Handling Imbalanced Data

Published in

Nerd For Tech

3 min readApr 21, 2021

Introduction

Imbalance data refers to the categorical dataset in which class distribution is not uniform. This means the distribution of classes is unequal in the dataset.

For example, you have a dataset with 1000 records and 2 classes(Yes/No). Out of 1000, only 50 records belong to class ‘Yes’ and the remaining 950 records to class ‘No’. The distribution of records among class ‘Yes’ and ‘No’ is unequal.

Why do we need to handle Imbalance Data?

Training a model on Imbalanced data can make the model biased to the majority class and ignore the minority class entirely.

Suppose your boss asked you to train a model to classify the products as ‘Defective’ or ‘Not Defective’. And the dataset given to you was having 99.3% records for r ‘Not Defective’ product and only 0.7% as ‘Defective’.

You trained the model with accuracy 98% accuracy and deployed it. After some days your boss comes to you and said the model is not correct because it is classifying ‘defective’ products as ‘not defective’.

Now you are confused because the accuracy and precision are good than what the problem really is.

Problem is that you trained the model on imbalanced data that leads the model towards business and your model is classifying the products towards the class having more records.

What the solution is?

The solution is that you have to make data balanced.

By using the following techniques we can balance our data.

Resample the training set
Right evaluation metrics
K-fold Cross-Validation
Cluster the abundant class

But here we’ll just discuss Resample Technique.

Resample Technique

Re-sampling is basically a method that can balance the imbalanced class, it provides a convenient and effective way to deal with imbalanced learning problems using standard classifiers because it alters the original training set rather than modifying the learning algorithm.

Resample is of two types.

Under Sampling
Over Sampling

Under Sampling

Under Sampling technique decreases the frequency of the majority class. By doing this we lose a large portion of data that could be useful for model preparation.

In the following example, we have a large difference between 0 & 1 class frequency.

Let’s apply under-sampling.

Result

We can see it reduces the frequency of majority class 0 equal to 1.

Over Sampling

This technique increases the frequency of the minority class. The drawback of this method is the result in overfitting of the data due to it making exact copies of the minority class. Moreover, the size of the training set rises that increases the time to build a model.