Regularization: L1, L2, Dropout, and Weight Decay
Abhay
4 min read
Every model wants to be a star pupil. Give it enough freedom and it will happily memorise your training data down to the last typo — then fall flat on anything new. We’ve already met that overachiever in the bias-variance post: the model that confuses noise for signal and overfits spectacularly. This post is about the discipline that keeps it honest. Regularization is, roughly, the art of telling your model: “Be confident, but not that confident.”
The core idea is simple. Overfitting usually shows up as a model with wild, oversized weights — a few parameters cranked up to compensate for quirks in the data. Regularization fights back by adding a penalty for complexity to the loss function. The model now has two jobs: fit the data and keep itself small. Let’s meet the main tools.
L2 (Ridge): shrink everything
L2 regularization adds the sum of the squared weights to the loss:
loss = error + λ · Σ(wᵢ²)
That λ (lambda) is the strength dial. Crank it up and the model is heavily punished for large weights, so it spreads influence thinly across many features instead of betting the house on a few. L2 rarely sets a weight to exactly zero — it just keeps everyone politely small. In scikit-learn it’s a one-liner:
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0) # alpha is λ; higher = more shrinkage
model.fit(X_train, y_train)
L2 is the reliable default. It’s stable, smooth, and plays nicely with gradient descent.
L1 (Lasso): pick favourites
L1 uses the absolute values of the weights instead of squares:
loss = error + λ · Σ|wᵢ|
That small change has a dramatic consequence. Because of the geometry of the absolute-value penalty, L1 pushes some weights all the way to exactly zero — effectively deleting those features. Lasso doesn’t just shrink; it performs automatic feature selection, handing you a sparse, interpretable model. If you’ve got 500 features and suspect only a handful matter, Lasso is your friend:
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1)
model.fit(X_train, y_train)
# many model.coef_ entries will now be exactly 0.0
The catch: L1 can be a bit temperamental when features are correlated, sometimes arbitrarily keeping one and dropping its twin. Can’t decide between sparsity and stability? Elastic Net mixes both penalties and is often the pragmatic choice.
Dropout: the neural-net classic
L1 and L2 are general-purpose, but deep networks have their own favourite trick. Dropout randomly switches off a fraction of neurons on each training pass — typically 20–50% of them. The network can never rely on any single neuron always being there, so it’s forced to learn redundant, robust representations rather than fragile co-dependencies. It’s a bit like rotating staff so the whole team learns the job, not just one indispensable hero.
import torch.nn as nn
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(p=0.5), # drop 50% of activations during training
nn.Linear(256, 10),
)
Crucially, dropout only happens during training. At inference time every neuron is back on deck (with activations scaled appropriately), giving you a free ensemble effect. Just remember to call model.eval() so the framework turns dropout off — forgetting that is a rite-of-passage bug.
Weight decay: the same thing, mostly
Here’s where people tie themselves in knots. Weight decay and L2 look identical — both nudge weights toward zero — and for plain SGD they’re mathematically equivalent. But with adaptive optimizers like Adam, they part ways. Classic L2 folds the penalty into the gradient, where Adam’s per-parameter scaling then distorts it. AdamW instead decouples the decay, subtracting a fixed fraction of each weight directly during the update, so every parameter gets regularized evenly. That subtle fix gives better generalization, which is why AdamW is now the default for training transformers and most modern deep nets.
import torch
# weight_decay here is decoupled — this is true weight decay, not L2-via-gradient
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
The free regularizers
Two more techniques cost almost nothing. Early stopping simply halts training when validation performance stops improving — you stop the model before it starts memorising. And data augmentation (flipping, cropping, or jittering images; paraphrasing text) shows the model endless variations of the same data, which makes overfitting to specific examples much harder. Neither adds a penalty term, but both quietly do a regularizer’s job.
Tuning the strength
All of this hinges on one knob: how much to regularize.
- Too little (
λnear zero, dropout near zero): the model overfits — great training scores, poor validation scores. - Too much: you’ve handcuffed the model into underfitting — both scores sag.
The sweet spot is empirical. Sweep λ (or alpha, or dropout p) across a log-scale grid — say 0.001, 0.01, 0.1, 1, 10 — using cross-validation, and watch the validation curve, not the training curve.
The takeaway: reach for L2/weight decay (AdamW) as your default, add dropout in neural nets, switch to L1/Elastic Net when you want feature selection, and lean on early stopping and augmentation for free. Then tune the strength by watching the gap between training and validation scores — close that gap, and your star pupil finally learns to handle the real exam.