Keeping up with Fraud: An Active Learning Approach for Imbalanced Non-Stationary Data Streams

Kemper, D.

Keeping up with Fraud: An Active Learning Approach for Imbalanced Non-Stationary Data Streams

Files

Kemper, D._MSc_Thesis_2017.pdf (4.58 MB)

Authors

Kemper, D.

Issue Date

2017-07-13

Language

en

URI

http://theses.ubn.ru.nl/handle/123456789/5237

Abstract

Fraud detection is a difficult task in which multiple problems co-occur. The data often comes from a non-stationary stream. Moreover, correct labels are available for only a small part of the data and fraudulent cases are much more rare than non-fraudulent cases. A promising technique for solving this combination of problems is active learning, where instances are selected for labeling such that the classifi er can learn the most. Previously, the critical sampling strategy has been proposed, that selects instances close to the decision boundary and oversamples fraudulent cases. The current project suggested an extension to this strategy that also explores full input space. These strategies were compared to state-of-the-art active learning strategies, using a new data stream sampled from the KDD'99 dataset, implemented in Massive Online Analysis (MOA). It was found that the original critical sampling algorithm does not perform better than random sampling, as has been found previously. An explanation could be that critical sampling induces a sampling bias, specifically if minority data comes from multiple dense and sparse areas in input space. In further research, this sampling bias could be overcome by combining critical sampling with a clustering- or diversity-based approach.