Handling Imbalanced Data

Shahzad Abbas
Nerd For Tech
Published in
3 min readApr 21, 2021

--

Introduction

Imbalance data refers to the categorical dataset in which class distribution is not uniform. This means the distribution of classes is unequal in the dataset.

Imbalance Data

For example, you have a dataset with 1000 records and 2 classes(Yes/No). Out of 1000, only 50 records belong to class ‘Yes’ and the remaining 950 records to class ‘No’. The distribution of records among class ‘Yes’ and ‘No’ is unequal.

Why do we need to handle Imbalance Data?

Training a model on Imbalanced data can make the model biased to the majority class and ignore the minority class entirely.

Suppose your boss asked you to train a model to classify the products as ‘Defective’ or ‘Not Defective’. And the dataset given to you was having 99.3% records for r ‘Not Defective’ product and only 0.7% as ‘Defective’.

Jupyter Notebook Graph

You trained the model with accuracy 98% accuracy and deployed it. After some days your boss comes to you and said the model is not correct because it is classifying ‘defective’ products as ‘not defective’.

Now you are confused because the accuracy and precision are good than what the problem really is.

Problem is that you trained the model on imbalanced data that leads the model towards business and your model is classifying the products towards the class having more records.

What the solution is?

The solution is that you have to make data balanced.

By using the following techniques we can balance our data.

  1. Resample the training set
  2. Right evaluation metrics
  3. K-fold Cross-Validation
  4. Cluster the abundant class

But here we’ll just discuss Resample Technique.

Resample Technique

Re-sampling is basically a method that can balance the imbalanced class, it provides a convenient and effective way to deal with imbalanced learning problems using standard classifiers because it alters the original training set rather than modifying the learning algorithm.

Resample is of two types.

  1. Under Sampling
  2. Over Sampling

Under Sampling

Under Sampling technique decreases the frequency of the majority class. By doing this we lose a large portion of data that could be useful for model preparation.

In the following example, we have a large difference between 0 & 1 class frequency.

Notepad Pic

Let’s apply under-sampling.

Result

We can see it reduces the frequency of majority class 0 equal to 1.

Over Sampling

This technique increases the frequency of the minority class. The drawback of this method is the result in overfitting of the data due to it making exact copies of the minority class. Moreover, the size of the training set rises that increases the time to build a model.

Let’s apply over-sampling.

Result

We can see that the over-sampling technique increases the instances of minority class 1.

That were the Random Sampling Techniques to balance the imbalanced data.

--

--

Shahzad Abbas
Nerd For Tech

Shahzad, A Data Scientist and has an experience of one and half year. Studied CS and have knowledge of multiple Programming languages.