Model Quantization and Compression, Explained

A modern language model is a bit like a grand piano: magnificent, capable of extraordinary things, and an absolute nightmare to carry up the stairs. A 70-billion-parameter model in full precision wants something like 140GB just to sit in memory before it does a single useful thing. That’s fine if you rent racks of GPUs. It’s less fine if you want it on a laptop, a phone, or a cloud bill that doesn’t make your CFO weep.

Model compression is the art of getting most of the music out of a much smaller instrument. Three techniques do the heavy lifting: quantization, pruning, and knowledge distillation. Let’s unpack them.

Quantization: fewer bits, same vibe

Neural networks store their weights as numbers, and by default those numbers are 32-bit floating point (FP32) — luxuriously precise, and mostly wastefully so. Quantization asks a simple question: do we really need all that precision? Usually, no.

Quantization maps those high-precision values onto a smaller numeric type — FP16, INT8, or even 4-bit. The payoff is roughly linear: 8-bit gives you about a 4x memory reduction, and 4-bit roughly 8x, while shrinking bandwidth and speeding up inference. The trick is keeping the model’s answers intact while throwing away the fine print.

There are two main flavours:

Post-training quantization (PTQ): take an already-trained model and compress it, no retraining required. INT8 PTQ is the workhorse — it usually drops memory 4x with minimal accuracy loss. Push naively to INT4 and accuracy can fall off a cliff, which is why smarter methods like GPTQ quantize layer-by-layer to 3–4 bits with little degradation.
Quantization-aware training (QAT): simulate the low-bit noise during training so the model learns to cope. It typically yields better accuracy than PTQ, but it’s expensive — you need the training pipeline and data — so for giant LLMs it’s used sparingly.

Rule of thumb: reach for PTQ first. It’s a free lunch surprisingly often. Only break out QAT when you need aggressive low-bit precision and PTQ has visibly hurt quality.

Pruning: trimming the dead weight

If quantization makes each number cheaper, pruning removes numbers entirely. Over-parameterized networks are full of weights doing approximately nothing, so magnitude-based pruning snips the smallest ones and sets them to zero.

There’s a catch worth tattooing on your monitor: smaller on disk does not automatically mean faster. Unstructured sparsity can shrink storage while doing nothing for CPU speed, because scattered zeros wreck memory access patterns. Structured pruning — removing whole channels or heads — plays nicer with real hardware. Always measure latency, not just parameter count.

Knowledge distillation: teach a smaller model

The third technique is the most charming. Instead of shrinking a model, you train a small student to imitate a big teacher. Crucially, the student learns from the teacher’s full probability distribution — the soft “I’m 70% sure it’s a cat, 20% a fox” signal — which carries far richer information than the raw label alone. It’s the difference between copying a friend’s exam answers and actually having them explain their reasoning. Distillation is exactly how many of today’s compact, capable small language models come into existence.

A practical taste

In practice you rarely hand-roll this. Libraries like Hugging Face bitsandbytes quantize on load in a couple of lines:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Load a 7B model in 4-bit instead of 16-bit: ~14GB -> ~4GB
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",       # 4-bit NormalFloat
    bnb_4bit_compute_dtype="float16",
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=config,
    device_map="auto",
)

For on-device CPU inference, the GGUF format (used by llama.cpp and friends) is the de facto standard — it ships pre-quantized weights at various bit-widths so a 7B model runs happily on a laptop with no GPU in sight.

The takeaway

Compression is a three-way trade between accuracy, size, and speed — you don’t get all three for free. A sane default playbook:

Start with INT8 PTQ. Big memory win, negligible accuracy hit, almost no effort.
Drop to 4-bit (NF4/GPTQ) when you need it small enough to fit, and test the accuracy.
Prune structurally, and benchmark real latency — never trust the parameter count.
Distill when you want a permanently smaller, faster model rather than a compressed copy.
Combine them. Prune, then quantize, then distill is a well-trodden, effective pipeline.

The piano doesn’t have to come up the stairs. Sometimes a very good keyboard is exactly what the room needs.