Friday, June 3, 2016

Deal with class imbalance in dataset

When we use machine learning to solve a problem. We have data-sets named train set, validation set, test set...The model is trained on train set, then run on validation set to tune parameters, and test on test set.

Class imbalance means that the number of samples belongs to one or more classes is much bigger than other classes in datasets. This isn't good for training since the model would focus more on classes that have much bigger number of samples, and neglects others.

Some solutions:

- adjust the cost function. The issue with class imbalance is often about "Accuracy" is no longer what you want to optimize, since the classifier can trivially get very high accuracy simply by heavily favoring the majority class. Of course by default this is exactly the metric classifiers optimize in training so naturally we get bad results. Ideally we can adjust your cost function to make classifying the minority class a higher priority for the algorithm. If not, we can also oversample/undersample to simulate this affect [1]. This could be applied for deep neural networks.

- preprocessing data and the method of sampling is much more decisive for the quality of the resulting data mining model than the modeling algorithm [2].

So what is Oversampling and Undersampling?
Oversampling :  duplicate the observations of the minority class to obtain a balanced dataset.
Undersampling : drop observations of the majority class to obtain a balanced dataset.

Illustration of Oversampling and Undersampling

They should be chosen for appropriate cases as mentioned in [2].


[1] https://www.reddit.com/r/MachineLearning/comments/4m7d0c/fighting_against_class_imbalance_in_a_supervised/
[2] https://zyxo.wordpress.com/2008/12/30/oversampling-or-undersampling/
[3] https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

No comments:

Post a Comment