Click here to find my master’s thesis in statistics from Lund University.
Abstract
The aim of the thesis is to evaluate solutions to the class imbalance problem using real world data sets with varying degrees of class imbalance. The analysis is limited to binary classification. Three large data sets relating to credit card fraud, vehicle insurance and heart disease are used for the analysis.
Several methods are compared and evaluated. Logistic regression, SVC and decision trees are used as benchmark classifiers in order to compare these to imbalanced learning techniques. Random undersampling and SMOTE are used to to evaluate resampling techniques. Cost-sensitive versions of logistic regression, SVC and decision trees are used to evaluate cost-sensitive algorithms. The resampling techniques are also used in combination with the cost-sensitive algorithms. The results are evaluated using six measures: accuracy, recall, precision, F-measure, G-mean and AUC.
The conclusion of the thesis is that none of the methods evaluated outperforms all others. Depending on the data set used for analysis, the methods produced varying scores for the different evaluation measures. As an example of this, the method used to produce the highest precision score was not the same for the credit card fraud detection data and for the heart disease data. The analysis further showed that which evaluation measure to use depends on the goal of the analysis.
This shows that none of the evaluated techniques are optimal for all data sets. Depending on the data set used and the goals of the analysis, different methods and evaluation measures may be applied.