Dealing with Class Imbalance: Resampling, Weighting, Ensembles

Class imbalance is a common problem in practice — real-world data isn’t as clean as a public dataset, and the positive/negative counts are often wildly skewed. There are three families of fixes: resampling, weight balancing, and ensembles.

Resampling

Resampling goes in two directions: undersample the majority class, or oversample the minority class.

A classic undersampling algorithm is Tomek Links: find the majority-class samples closest to the minority class and remove them, sharpening the decision boundary. But it doesn’t always help — it can also erase subtle boundaries and backfire.

Undersampling with Tomek Links — Tomek Links: remove majority-class points hugging the minority class to clean up the boundary.

A classic oversampling algorithm is SMOTE: compute the K nearest neighbors of a minority-class sample $x_i$ , randomly pick one neighbor $\hat{x}_i$ , and synthesize a new sample on the line between $x_i$ and $\hat{x}_i$ ; repeat until enough new samples exist.

Oversampling with SMOTE — SMOTE: synthesize new samples along the line between a minority sample and its neighbor.

Weight balancing

Another family weights samples by class, the flagship being Focal Loss:

\text{FL}(p_t) = -\alpha_t\,(1-p_t)^{\gamma}\log(p_t)

Here $\alpha$ handles positive/negative imbalance — different loss weights for positives and negatives — while $\gamma$ handles easy/hard imbalance: the larger $\gamma$ , the more the loss of high-confidence “easy” samples is suppressed, focusing the loss on the hard, “difficult” ones.

Focal Loss focuses on hard samples — Larger γ pushes down the loss of easy (high-confidence) samples, focusing on the hard ones.

Ensembles

A third family is ensemble learning. The common Bagging (bootstrap aggregating) trains several classifiers on different sampled subsets of the data, then votes — a combination of weak classifiers is often more robust than any single one.

Bagging algorithm — Bagging: train several classifiers on different data subsets, then aggregate by voting.

References

He, Haibo, Ma, Yunqian. Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley, 2013.
Chawla, Nitesh V., et al. SMOTE: Synthetic Minority Over-sampling Technique. JAIR, 2002.
Lin, Tsung-Yi, et al. Focal Loss for Dense Object Detection. ICCV, 2017.

Tech

2021 · 01 · 18