Technology

How to Handle Imbalanced Datasets in Classification

Abhay Abhay 4 min read
How to Handle Imbalanced Datasets in Classification
Photo by Markus Winkler on Unsplash

Picture this: you build a fraud detector, run it on a million transactions, and it scores a glorious 99.8% accuracy. You celebrate, ship it, and then accounting calls to ask why it never actually caught any fraud. The grim punchline? If only 0.2% of transactions are fraudulent, a model that smugly predicts “all clear” every single time also hits 99.8%. Your classifier learned the laziest trick in the book and you handed it a trophy for it.

This is the trap of imbalanced datasets, and it shows up everywhere the interesting thing is rare: fraud, disease screening, defect detection, churn. The minority class is the whole point, and accuracy quietly ignores it.

Why accuracy lies

Accuracy is just “fraction of predictions that were right.” When 99.8% of your data belongs to one class, predicting that class always gets you 99.8% for free. The metric isn’t broken, it’s just answering a question you don’t care about. You don’t want to know how often the model is right overall; you want to know how well it finds the rare thing without crying wolf constantly.

So the very first move is to stop staring at accuracy and use metrics that respect the minority class:

  • Recall (a.k.a. sensitivity): of all the actual fraud, how much did we catch? Miss real fraud and recall punishes you.
  • Precision: of everything we flagged, how much was actually fraud? Flag everything and precision tanks.
  • F1: the harmonic mean of the two, when you need a single number.
  • PR-AUC (average precision): summarises precision/recall across every threshold. For heavy imbalance it’s far more honest than ROC-AUC, which can look flatteringly high even when the model is mediocre on the rare class.
from sklearn.metrics import (
    classification_report, average_precision_score, f1_score
)

# y_proba = model.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred, digits=3))
print("PR-AUC:", average_precision_score(y_test, y_proba))

Technique 1: Reweight the classes (the lazy genius move)

Before you touch your data, try the one-liner. Most scikit-learn classifiers accept class_weight="balanced", which automatically penalises mistakes on the rare class more heavily, in inverse proportion to its frequency. No new rows, no extra libraries, no fuss.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(class_weight="balanced", random_state=42)
clf.fit(X_train, y_train)

This is the highest-leverage thing you can do, and it costs you one keyword argument. Start here.

Technique 2: Resample the data

If reweighting isn’t enough, change the class balance directly. Two flavours:

  • Oversampling the minority class. The crude version duplicates rows; the smarter version, SMOTE (Synthetic Minority Over-sampling Technique), invents new synthetic examples by interpolating between existing minority points. It lives in imbalanced-learn (version 0.14 as of mid-2026), scikit-learn’s sister library.
  • Undersampling the majority class, by throwing away majority rows until things even out. Fast, but you’re literally discarding data, so use it when the majority is enormous.
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)  # train only!

That # train only! comment is doing heavy lifting, which brings us to the part everyone gets wrong.

The pitfall that ruins everything: resampling the test set

Repeat after me: you resample only the training data, never the test set. Your test set must reflect reality, and reality is imbalanced. If you SMOTE before splitting, synthetic minority points leak across the train/test boundary, your test set gets stuffed with fabricated examples, and your metrics turn into fan fiction. You’ll see beautiful numbers and ship a disaster.

The fix is a pipeline that resamples inside cross-validation, so it only ever touches the training fold:

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

pipe = Pipeline([
    ("smote", SMOTE(random_state=42)),
    ("clf", LogisticRegression(max_iter=1000)),
])
# scoring on average precision = PR-AUC
scores = cross_val_score(pipe, X, y, cv=5, scoring="average_precision")

Note that’s imblearn’s Pipeline, not scikit-learn’s, because resampling steps must only run during fit, never at prediction time.

Technique 3: Tune the threshold

Classifiers default to a 0.5 decision threshold, which is an arbitrary middle that rarely suits skewed data. Since predict_proba gives you the raw scores, you can pick the cutoff that matches your actual priorities. Catching fraud matters more than the odd false alarm? Lower the threshold to buy recall. Plot precision and recall across thresholds and choose deliberately instead of letting 0.5 decide for you.

import numpy as np
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
f1s = 2 * precision * recall / (precision + recall + 1e-9)
best = thresholds[np.argmax(f1s)]
print("Best F1 threshold:", best)

Your imbalanced-data checklist

When the interesting class is rare, do this, in order:

  1. Drop accuracy as your headline metric. Track precision, recall, F1, and PR-AUC instead.
  2. Stratify your split so train and test keep the same class ratio.
  3. Try class_weight="balanced" first — it’s one argument and often enough.
  4. Resample if needed (SMOTE or undersampling), and do it inside a pipeline, only on training data.
  5. Tune the threshold to your real-world cost of false positives vs. false negatives.

Imbalanced data isn’t a bug to be erased; it’s the shape of the problem. Respect it, measure honestly, and your model will start catching the rare thing that actually pays the bills — instead of winning gold medals for doing nothing.

More posts