Backpropagation, Explained (Without the Scary Math)

A neural network is, at heart, a very confident guesser that is wrong a lot at first. You show it a photo of a cat, it shouts “dog!” with great conviction, and somehow it has to learn from that humiliation. The mechanism that turns “you were wrong” into “here’s exactly how each of your million knobs should change” is backpropagation — the algorithm that, more than any other, is why deep learning works at all. Popularised by a famous 1986 paper from Rumelhart, Hinton, and Williams, it’s really just one clever idea about assigning blame.

First, the forward pass

Before a network can be corrected, it has to commit to a guess. That’s the forward pass: data goes in one end, gets multiplied by weights, squished through some nonlinear functions, and out pops a prediction. Think of it as a row of dominoes — input topples the first layer, which topples the next, all the way to the final answer.

Then we measure the damage with a loss function: a single number for how wrong the guess was. Predicted “dog” with 90% confidence on a cat photo? That’s a big, embarrassing loss. The entire goal of training is to make that number small.

The real question: who’s to blame?

Here’s the catch. The loss is one number at the end of the network, but the blame belongs to thousands of weights scattered throughout it. If the answer was wrong, which knobs caused it, and in which direction should each one turn?

That’s a credit-assignment problem, and backpropagation solves it with one tool from first-year calculus: the chain rule. The intuition, minus the symbols: if A affects B, and B affects C, then A’s influence on C is just A’s effect on B multiplied by B’s effect on C. Influence flows along the chain by multiplication.

A neural network is exactly such a chain — a long composition of functions. So to find how a weight buried in the first layer affected the final loss, you multiply together the local effects of every step between them. Do that for every weight and you’ve computed the gradient: the full list of “nudge this knob up or down, by this much, to reduce the loss.”

Gradients flow backward

The genius of backprop is the order it does this in. The naive approach would recompute that long chain of multiplications separately for every single weight — wildly redundant. Instead, backprop works backward from the loss, computing each layer’s contribution once and reusing it for everything behind it.

Picture toppling the dominoes in reverse. The error at the output gets handed back to the last layer: “you contributed this much.” That layer adjusts the blame and passes a share to the layer before it, which does the same, and so on, until every weight has received its personal slice of responsibility. One backward sweep, gradients for the whole network. That reuse is the difference between training being feasible and taking until the heat death of the universe.

Then the optimizer takes over

Backprop’s job ends at computing the gradients. It doesn’t actually change anything — it just hands the optimizer a treasure map saying “downhill is that way.” The optimizer (plain SGD, or more often Adam) takes each weight and nudges it a small step in the blame-reducing direction. Backprop is the diagnosis; the optimizer is the treatment. Run forward pass → loss → backward pass → update a few million times, and the confident guesser slowly stops being wrong.

You’ll never write this by hand

Here’s the genuinely good news: in real life you never compute these chains yourself. Modern frameworks do automatic differentiation (autodiff) — they record every operation in your forward pass into a graph, then replay it backward to get exact gradients for free. In PyTorch it’s a single method call:

import torch

# two weights we want gradients for
w1 = torch.tensor(2.0, requires_grad=True)
w2 = torch.tensor(3.0, requires_grad=True)

x = torch.tensor(4.0)          # input
y = (w1 * x + w2)              # forward pass: a tiny "network"
loss = (y - 20.0) ** 2        # how wrong are we? (target = 20)

loss.backward()               # backprop: gradients flow backward

print(w1.grad)                # d(loss)/d(w1)  ->  tensor(-72.)
print(w2.grad)                # d(loss)/d(w2)  ->  tensor(-18.)

That loss.backward() call is backpropagation. PyTorch walked the computation graph in reverse, applied the chain rule at each node, and stashed the gradient on every weight that asked for one (requires_grad=True). The negative signs tell the optimizer to increase both weights to shrink the loss. Swap the toy expression for a hundred-layer transformer and nothing about the principle changes — only the size of the graph.

The takeaway

Backpropagation is the chain rule with a smart, backward-flowing schedule that reuses work, so a single error signal at the output becomes a precise correction for every weight at once. Lock in this four-beat loop and you understand how all of modern deep learning trains: forward pass → loss → backward pass (backprop computes gradients) → optimizer updates the weights. When training misbehaves and gradients shrink to nothing or blow up — the classic vanishing/exploding gradient problem — you now know exactly which step is multiplying those tiny or huge numbers down the chain. And the next time you call loss.backward(), give a small nod to the dominoes falling in reverse on your behalf.