Knowledge Distillation, Explained

Picture a brilliant but exhausting professor: knows everything, answers every question with footnotes, and takes forty seconds to do it. Now picture a sharp teaching assistant who watched the professor for a semester and can give you 97% of the same answers in a fraction of the time. That swap, give or take, is knowledge distillation: a small “student” model learns to mimic a big “teacher” model, ending up dramatically faster while keeping most of the smarts.

It sounds like cheating. It mostly isn’t. Let’s unpack why it works.

The trick: don’t copy the answer, copy the hesitation

Normally you train a model on hard labels: this image is a dog (1), and definitely not a cat (0), and absolutely not a car (0). Clean, but lossy. The label “dog” throws away everything interesting the teacher knows.

A well-trained teacher, shown a photo of a golden retriever, doesn’t just scream “DOG.” Its raw output, once softened, looks more like:

Dog: 0.85   Cat: 0.12   Car: 0.03

Look at that. The teacher quietly admits a retriever is a bit cat-like (both are furry quadrupeds) and nothing at all like a car. That ranking of wrongness is the gold. Hinton and colleagues, who formalised this in 2015, called it dark knowledge: the relational structure hiding in a model’s confidence, not just its top pick.

The student trains on these soft labels instead of (or alongside) the hard ones. Instead of being told “the answer is dog,” it’s told “the answer is mostly dog, somewhat cat-adjacent, definitely not car.” That is a far richer lesson per example, which is why distilled students often learn from less data than training from scratch would demand.

Temperature: the dial that loosens the model’s tongue

There’s a catch. A confident teacher’s softmax is brutal: it might output Dog: 0.999, squashing those juicy cat/car distinctions to near zero. So we add a temperature parameter, T, that divides the logits before the softmax:

T = 1: normal, spiky output. The teacher mumbles.
T > 1 (say 3–5): the distribution softens, exaggerating the small differences between classes. The teacher opens up.

Crank the temperature and 0.999/0.001/0.0001 relaxes into something like 0.85/0.12/0.03. Now the student can actually see the relationships. (Just remember to scale the loss back by T², because dividing logits also shrinks the gradients.)

The loss function, in plain pseudocode

The student optimises a weighted blend of two objectives: match the teacher’s soft predictions (so it inherits the dark knowledge), and still get the real labels right (so it stays honest). The first part is a KL divergence between soft distributions; the second is plain cross-entropy.

def distillation_loss(student_logits, teacher_logits, true_labels, T=4.0, alpha=0.5):
    # Soft targets: KL divergence between softened distributions
    soft_teacher = softmax(teacher_logits / T)
    soft_student = log_softmax(student_logits / T)
    soft_loss = kl_divergence(soft_student, soft_teacher) * (T ** 2)

    # Hard targets: ordinary cross-entropy against the real answers
    hard_loss = cross_entropy(student_logits, true_labels)

    # Blend the two
    return alpha * soft_loss + (1 - alpha) * hard_loss

alpha is just how much you trust the teacher versus the ground truth. The teacher’s weights stay frozen throughout; only the student learns.

Why bother? Smaller, faster, on-device

Distillation is the quiet workhorse behind models that fit where the big ones can’t:

DistilBERT, the poster child, is about 40% smaller and 60% faster than BERT while retaining roughly 97% of its language-understanding performance. That’s a lot of accuracy to keep for nearly half the footprint.
On-device inference: a distilled model can run on a phone or laptop without a round trip to a data centre, which means lower latency, lower cost, and data that never leaves the device.
Serving at scale: when you’re handling millions of requests, a 2x speedup is a halved bill.

This is the same instinct behind the rise of compact models generally. If the idea of “smaller is often smarter” interests you, I dug into it in Small Language Models: Why Smaller Is Often Smarter — distillation is one of the main techniques that makes those tiny models punch above their weight.

The takeaway

When you have a model that’s accurate but too heavy to ship, reach for distillation before you reach for “just buy a bigger GPU.” Concretely:

Train or grab a strong teacher — accuracy first, speed second.
Define a smaller student — fewer layers, fewer parameters.
Distil with soft labels, blending KL-divergence-on-soft-targets with ordinary cross-entropy, and tune T (start around 3–5) and alpha (start near 0.5).
Measure the trade-off — if you keep ~95%+ of the accuracy for half the size, ship the student.

The professor was never going to fit in your pocket. The teaching assistant just might.