Gradient Descent: How Machine Learning Models Learn

Strip away the hype and almost every machine learning model — from a humble linear regression to a 70-billion-parameter language model — learns the same way: it makes a guess, measures how wrong it is, and nudges itself in a slightly less-wrong direction. Repeat a few million times and you have intelligence. That nudging process is gradient descent, and it is the single most important algorithm you’ve probably never been formally introduced to.

Loss as a landscape

Start with the idea of loss: a single number that says how badly the model is doing. Predictions spot-on? Loss near zero. Predictions wildly off? Loss is huge. Training is just the hunt for the model settings (the “weights”) that make this number as small as possible.

Here’s the useful mental picture. Imagine every possible combination of weights laid out as a vast, hilly landscape. The height at any point is the loss. Peaks are terrible models; valleys are good ones. Your job is to find the lowest valley — and you’re doing it blindfolded, in fog, able to feel only the slope directly under your feet.

So you do the obvious thing: feel which way is downhill, take a step that way, and repeat. That’s it. That’s the whole algorithm. The “gradient” is just the mathematical word for “which way is downhill and how steep,” and “descent” is the bit where you walk down it.

The size of your steps

The one knob that decides whether this works is the learning rate — how big a step you take each time.

Too small, and you inch down the mountain one timid centimetre per step. You’ll get there eventually, but training takes forever and your electricity bill files a complaint. Too big, and you take such enormous leaps that you overshoot the valley entirely, bounce up the opposite slope, overshoot that, and ping-pong around forever — or rocket off to infinity and produce the dreaded NaN. The Goldilocks zone is real, and finding it is half the art of training models.

Batch, stochastic, or mini-batch?

To know which way is downhill, you need to check your data. The question is how much of it per step.

Batch gradient descent uses the entire dataset for every single step. The direction is beautifully accurate, but if you have ten million examples, every step is agonisingly slow.
Stochastic gradient descent (SGD) goes to the other extreme: one random example per step. Each step is lightning-fast but jittery — like navigating downhill while being gently shoved. Counter-intuitively, that noise is a feature, not a bug: the random jostling can knock you out of shallow dead-ends.
Mini-batch gradient descent is the sensible compromise everyone actually uses: a small handful of examples (typically 32 to 256) per step. Fast enough to be practical, accurate enough to be stable. It’s the porridge that’s just right.

Local minima and saddle points

The naive worry is getting stuck in a local minimum — a small dip that isn’t the true lowest point, like settling for a roadside ditch when the Grand Canyon is one ridge over. In practice, for the enormous models of today, the bigger nuisance is the saddle point: a spot that slopes down in some directions and up in others, like the seat of a saddle. The gradient there goes nearly flat, and your descent stalls, twiddling its thumbs while you wonder why the loss stopped dropping.

Giving descent some momentum

This is where the fancier optimisers earn their keep. Momentum lets your steps accumulate speed in a consistent direction, like a ball rolling downhill — it powers through flat saddle points and dampens the side-to-side bouncing in narrow ravines. Adam, the default optimiser in nearly every deep learning project today, goes further: it blends momentum with a per-weight adaptive learning rate, automatically taking bigger steps where the ground is gentle and smaller ones where it’s treacherous.

A tiny gradient-descent loop

Enough metaphor. Here’s the entire idea in a few lines of Python, minimising the simplest possible landscape, f(x) = x², whose lowest point is at x = 0. The derivative (our gradient) is 2x:

def f(x):
    return x ** 2

def gradient(x):
    return 2 * x          # slope of x² at point x

x = 10.0                  # start far from the minimum
learning_rate = 0.1

for step in range(25):
    grad = gradient(x)
    x = x - learning_rate * grad   # step downhill
    print(f"step {step:2d}: x = {x:.4f}, loss = {f(x):.4f}")

Run it and watch x march from 10 toward 0, the loss shrinking each step. Bump learning_rate up to 1.1 and you’ll see the overshoot disaster live: x explodes instead of settling. That five-line loop is, in spirit, exactly what’s happening inside a neural network — just with millions of x’s and a landscape no human could ever picture.

The takeaway

Gradient descent is “feel for downhill, take a step, repeat.” When you train a model and the loss won’t budge or blows up, your first suspect should almost always be the learning rate — halve it if things explode, raise it if progress crawls. Reach for mini-batches (start at 32) for speed without chaos, and let Adam handle the fiddly per-weight tuning so you don’t have to. Master that one knob and that one default, and you understand the engine room of essentially all of modern machine learning.

Gradient Descent: How Machine Learning Models Learn

Loss as a landscape

The size of your steps

Batch, stochastic, or mini-batch?

Local minima and saddle points

Giving descent some momentum

A tiny gradient-descent loop

The takeaway

More posts

Reading Learning Curves: Diagnosing Model Training

Model Explainability: Making Sense of SHAP and LIME

Diffusion Models: How AI Generates Images