Overfitting, Underfitting, and the Bias-Variance Tradeoff

There’s a student we all remember from school: the one who memorised every answer in the textbook, aced the practice questions verbatim, then completely fell apart when the exam asked something slightly different. That student is a perfect metaphor for an overfit machine learning model — brilliant at the past, useless at the future.

At the other end sits the student who skimmed one chapter, decided “eh, close enough,” and answered every question with a confident shrug. That’s underfitting. Somewhere between these two academic disasters lives a good model — and getting there is what the bias-variance tradeoff is all about.

Memorising vs. shrugging

Overfitting happens when a model learns the training data too well — including its noise, quirks, and random flukes. It mistakes coincidence for signal. Show it new data and it stumbles, because it never learned the underlying pattern; it just memorised the answer key.

Underfitting is the opposite failure. The model is too simple to capture the real relationship in the data. It fits poorly on the training set and on everything else. Picture fitting a straight line through data that clearly curves — no amount of new data saves you; the model just isn’t flexible enough to care.

The tell-tale sign:

Overfit: great training accuracy, lousy test accuracy. (Big gap.)
Underfit: lousy training accuracy and lousy test accuracy. (Bad at both.)

Bias and variance, in human terms

These two failures map onto two sources of error.

Bias is error from being too rigid — the model makes strong, simplifying assumptions and misses the truth. High bias is the perpetual shrugger: it’ll give roughly the same wrong answer no matter what data you train it on.

Variance is error from being too sensitive. A high-variance model reacts dramatically to the specific training set it saw. Swap in a different sample of data and it produces a wildly different model — because it’s been busy memorising noise that won’t repeat.

Here’s the catch, and it’s the whole point: reducing one tends to increase the other. Make a model more flexible and bias drops but variance climbs. Make it simpler and variance drops but bias climbs. You can’t slam both to zero. You can only find the balance.

The U-shaped curve

Plot a model’s error against its complexity and you get one of the most famous pictures in machine learning. As complexity rises, training error keeps falling — the model gets ever better at reciting the textbook. But test error (the part you actually care about) traces a U-shape: it drops as the model gets capable enough to learn real patterns, bottoms out at the sweet spot, then climbs again as the model starts memorising noise.

The bottom of that U is where you want to live. Too far left, you’re underfitting. Too far right, you’re overfitting. The art is finding the dip.

A worth-knowing plot twist: double descent

The classic U-shape ruled textbooks for decades, then enormous neural networks showed up and broke the story. Researchers observed a phenomenon called double descent: push model size past the point where it perfectly fits (interpolates) the training data, and test error — surprisingly — starts falling again. It’s why a giant overparameterised transformer can memorise its training set and still generalise well, something classical theory said shouldn’t happen. The U-shape isn’t wrong; it’s just one act in a longer play. For most everyday models, though, the U is still your map.

The practical toolkit

You don’t need a PhD to manage this. You need a handful of reliable moves:

Get more data. The single best cure for high variance. Harder to memorise noise when there’s more genuine signal to learn.
Adjust complexity. Underfitting? Add features, use a more expressive model. Overfitting? Simplify, prune features, shrink the network.
Regularise. L1/L2 penalties discourage the model from leaning too hard on any one weight; dropout randomly switches off neurons during training so the network can’t over-rely on a clique of them.
Early stopping. Watch validation error during training and stop the moment it starts creeping up — that’s the U-curve turning the corner in real time.
Cross-validation. Don’t trust a single train/test split. K-fold rotates the held-out fold across the whole dataset and averages the results, giving you a far more honest estimate of how the model will actually generalise.

Here’s the diagnosis pattern in code:

from sklearn.model_selection import validation_curve
from sklearn.tree import DecisionTreeClassifier
import numpy as np

depths = range(1, 20)
train_scores, val_scores = validation_curve(
    DecisionTreeClassifier(), X, y,
    param_name="max_depth", param_range=depths,
    cv=5, scoring="accuracy",
)

train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)

# Train keeps rising; validation peaks then falls.
# The depth where val_mean peaks is your sweet spot.
best_depth = depths[np.argmax(val_mean)]
print(f"Best max_depth: {best_depth}")

Watch train_mean march steadily upward while val_mean rises, peaks, then sags — that’s the bias-variance tradeoff drawn live, and the peak of val_mean is the bottom of your U.

The takeaway

When a model disappoints, run one quick check before anything else: compare training and validation error.

Both bad? You’re underfitting — add complexity.
Training great, validation bad? You’re overfitting — add data or regularisation.

That single comparison tells you which direction to move along the curve. Master that reflex and you’ve internalised the whole tradeoff — no memorisation required.