Learning Rate Schedules and Warmup, Explained

You already know what a learning rate is: the size of the step your model takes downhill each time it learns. What almost nobody tells you up front is that picking one good number and leaving it there for the entire run is, frankly, a rookie move. A fixed learning rate is like driving from your house to the motorway, down the motorway, and into a tight car park — all at exactly 50 mph. Too slow for the open road, lethal in the car park. The fix is to change the speed as you go. That’s all a learning rate schedule is.

Why a fixed rate is suboptimal

Training has phases. Early on, you’re miles from any good solution, so big confident steps pay off — you want to cover ground. Later, you’re hovering near a minimum, and big steps just make you bounce around the valley floor, never settling. The same number can’t be right for both. Hold it high and you never converge cleanly; hold it low and the first half of training crawls while your GPU meter spins like a taxi fare.

The intuition that runs through every schedule below is the same: start high to explore, end low to settle. Explore the landscape boldly, then tiptoe into the valley.

The classic decay schedules

The oldest trick is to simply turn the learning rate down over time. Three flavours dominate:

Step decay: keep it constant, then chop it (say, multiply by 0.1) every N epochs. Crude, effective, beloved by old-school computer-vision papers. You can usually spot the cliffs in the loss curve where the rate dropped.
Exponential decay: multiply by a fixed factor every step, so it glides down a smooth curve instead of falling off staircases. Gentler, fewer knobs to tune.
Cosine annealing: follow the shape of a half-cosine from your starting rate down to (near) zero. It lingers high for a while, then eases off gracefully at the end. This is the modern default, and for good reason — it just tends to work, and it has only one real parameter (how long the run is).

Warmup: starting slow on purpose

Here’s the plot twist that confuses everyone. After all that talk of starting high, the very first thing many modern training runs do is start low and ramp up over the first few hundred or few thousand steps. This is warmup, and it looks like a contradiction until you see why.

The culprit is the optimiser, usually Adam. Adam scales each step using a running estimate of recent gradient variance — but at step one, it has almost no history, so that estimate is wildly noisy. Feed it your full learning rate immediately and those early, jittery updates can fling the weights somewhere awful before training has found its footing. Warmup hands Adam a few easy reps to stabilise its estimates before you ask it to lift heavy.

Why transformers especially? This variance problem theoretically affects any Adam-trained model, yet in practice it’s transformers (and deep ones — three layers or fewer often skip it) that genuinely fall over without warmup. Their architecture is unusually sensitive in the high-drift early phase, so warmup acts as a throttle, preventing runaway updates before the network settles into a well-behaved regime. The canonical recipe — warm up, then cosine-decay back down — is everywhere from BERT to the latest LLMs for exactly this reason.

Warm restarts and ReduceLROnPlateau

Two more tools worth knowing:

Warm restarts (SGDR) periodically yank the rate back up to its peak and cosine-anneal down again, over and over. Deliberately jolting it high can knock the model out of a mediocre minimum so it can find a better one — like shaking the box to settle the cereal properly.

ReduceLROnPlateau is the pragmatist’s schedule: don’t decide in advance, just watch. If the validation loss stops improving for a few epochs, cut the rate. It reacts to your actual training instead of a fixed timetable, which makes it a great low-effort default when you don’t know how long things will take.

In code

PyTorch bundles these as schedulers you step alongside the optimiser. Here’s the warmup-then-cosine combo that powers most transformer training:

import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import LinearLR, CosineAnnealingLR, SequentialLR

optimizer = AdamW(model.parameters(), lr=3e-4)

warmup = LinearLR(optimizer, start_factor=0.01, total_iters=500)   # ramp up
decay = CosineAnnealingLR(optimizer, T_max=10_000)                 # ease down
scheduler = SequentialLR(optimizer, [warmup, decay], milestones=[500])

for batch in train_loader:
    loss = train_step(batch)
    loss.backward()
    optimizer.step()
    scheduler.step()        # update the LR every step
    optimizer.zero_grad()

Swap SequentialLR for ReduceLROnPlateau and call scheduler.step(val_loss) once per epoch instead, and you’ve got the reactive version.

The takeaway

Don’t ship a constant learning rate. Reach for cosine annealing as your default decay — one parameter, reliably good. If you’re training anything transformer-shaped, add a short linear warmup (a few hundred to a few thousand steps) before the decay; it’s not optional, it’s the thing standing between you and a diverged run. And if you genuinely don’t know how training will behave, ReduceLROnPlateau lets the loss curve make the call for you. High then low, explore then settle — let the schedule do what one frozen number never can.