Technology

Optimizers Explained: SGD, Momentum, and Adam

Abhay Abhay 4 min read
Optimizers Explained: SGD, Momentum, and Adam
Photo by Braden Collum on Unsplash

Gradient descent tells you which way is downhill. The optimizer decides how to actually take the step. That distinction sounds pedantic until your loss curve flatlines, oscillates like a heart monitor, or explodes to NaN — at which point the choice of optimizer stops being academic and starts being the difference between a working model and a wasted weekend.

So let’s skip the calculus (gradient descent has its own post) and talk about the variants you’ll actually pick from a dropdown. Same gradients go in; what comes out is a smarter, sometimes wildly faster, weight update.

Plain SGD: one cautious step at a time

Stochastic Gradient Descent is the honest baseline. For each batch, it computes the gradient and nudges every weight a little in the opposite direction:

new weight = old weight − learning rate × gradient

That’s it. No memory, no cleverness. SGD is the person who reads the map at every intersection and walks exactly where it points. The upside: it’s simple, well-understood, and with the right learning rate it generalizes beautifully — a lot of record-setting vision models were trained on little more than SGD and patience. The downside: in a ravine-shaped loss surface (steep on one axis, shallow on another) it ping-pongs across the steep walls while creeping painfully slowly toward the actual minimum.

Momentum: rolling downhill instead of stepping

Momentum fixes the ping-pong by giving SGD a sense of inertia. Instead of reacting only to the current gradient, it keeps a running average of recent gradients — a velocity — and moves with that.

Picture a ball rolling down the hill rather than a hiker pausing at every step. The ball builds speed in consistent directions and smooths over the side-to-side jitter, because oscillations that flip sign cancel out while the true downhill direction reinforces itself. In practice this means faster convergence and far less thrashing in those nasty ravines. The trade-off is one more knob — the momentum coefficient — though 0.9 is the value almost everyone uses and almost never regrets.

RMSProp: a custom learning rate per parameter

Momentum smooths direction. RMSProp tackles a different problem: not every parameter deserves the same step size. Some weights sit on steep cliffs and need tiny, careful moves; others sit on near-flat plains and could happily take giant strides.

RMSProp keeps an exponentially-weighted average of each parameter’s squared gradients, then divides that parameter’s update by the square root of it. Steep, high-gradient directions get scaled down; gentle, low-gradient directions get scaled up. Everyone ends up taking a sensibly-sized step regardless of how dramatic their slice of the loss surface is. It’s the optimizer equivalent of adaptive cruise control.

Adam: momentum and RMSProp move in together

Here’s the obvious move once you’ve seen both ideas: why not combine them? That’s Adam (Adaptive Moment Estimation). It tracks momentum (the first moment — a running mean of gradients) and the RMSProp term (the second moment — a running mean of squared gradients), then uses both to shape every update. Smooth direction, per-parameter step sizing, plus a small bias-correction so the early steps aren’t sluggish while the averages warm up.

The result behaves like a heavy ball with friction, and it works astonishingly well out of the box. Adam has become the de facto default across vision, language, diffusion, and tabular models — largely because it converges fast and is forgiving about learning rates, which is exactly what you want at 2 a.m.

AdamW: the version you should actually use

There’s a subtle catch. The classic way to fight overfitting is weight decay (gently shrinking weights toward zero). Adam tangles that decay up inside its adaptive machinery, so it gets unevenly applied and quietly underperforms. AdamW decouples weight decay from the gradient update — applying it cleanly as a separate shrink step — and the payoff is better generalization. It’s why AdamW is the standard optimizer behind transformer-scale models like BERT and GPT. If your framework offers it (PyTorch does), prefer it.

import torch

# Sensible defaults that just work
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# The classic vision-style alternative
# optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

So which one do you pick?

A rule-of-thumb cheat sheet:

  • Just want it to train? Reach for AdamW, lr=3e-4, weight_decay=0.01. This is the boring, correct default for nearly everything.
  • Training a transformer or a large language/diffusion model? AdamW, no debate.
  • Doing image classification and you have time to tune? SGD with momentum (0.9) plus a learning-rate schedule often squeezes out slightly better final accuracy.
  • Plain SGD with no momentum? Mostly a teaching tool now — keep the momentum.

The real takeaway: don’t agonize over this on day one. Start with AdamW and a learning rate near 3e-4, get a model training, and only reach for SGD-with-momentum if you’re chasing that last fraction of a percent on a vision task. The optimizer is a dial, not a destiny — but starting on the right setting saves you a lot of NaN-staring.


Sources: Ruder — An overview of gradient descent optimization algorithms, DigitalOcean — Momentum, RMSProp and Adam, GeeksforGeeks — Why AdamW beats Adam with L2 regularization.

More posts