Model Quantization and Compression, Explained
Abhay
4 min read
A modern language model is a bit like a grand piano: magnificent, capable of extraordinary things, and an absolute nightmare to carry up the stairs. A 70-billion-parameter model in full precision wants something like 140GB just to sit in memory before it does a single useful thing. That’s fine if you rent racks of GPUs. It’s less fine if you want it on a laptop, a phone, or a cloud bill that doesn’t make your CFO weep.
Model compression is the art of getting most of the music out of a much smaller instrument. Three techniques do the heavy lifting: quantization, pruning, and knowledge distillation. Let’s unpack them.
Quantization: fewer bits, same vibe
Neural networks store their weights as numbers, and by default those numbers are 32-bit floating point (FP32) — luxuriously precise, and mostly wastefully so. Quantization asks a simple question: do we really need all that precision? Usually, no.
Quantization maps those high-precision values onto a smaller numeric type — FP16, INT8, or even 4-bit. The payoff is roughly linear: 8-bit gives you about a 4x memory reduction, and 4-bit roughly 8x, while shrinking bandwidth and speeding up inference. The trick is keeping the model’s answers intact while throwing away the fine print.
There are two main flavours:
- Post-training quantization (PTQ): take an already-trained model and compress it, no retraining required. INT8 PTQ is the workhorse — it usually drops memory 4x with minimal accuracy loss. Push naively to INT4 and accuracy can fall off a cliff, which is why smarter methods like GPTQ quantize layer-by-layer to 3–4 bits with little degradation.
- Quantization-aware training (QAT): simulate the low-bit noise during training so the model learns to cope. It typically yields better accuracy than PTQ, but it’s expensive — you need the training pipeline and data — so for giant LLMs it’s used sparingly.
Rule of thumb: reach for PTQ first. It’s a free lunch surprisingly often. Only break out QAT when you need aggressive low-bit precision and PTQ has visibly hurt quality.
Pruning: trimming the dead weight
If quantization makes each number cheaper, pruning removes numbers entirely. Over-parameterized networks are full of weights doing approximately nothing, so magnitude-based pruning snips the smallest ones and sets them to zero.
There’s a catch worth tattooing on your monitor: smaller on disk does not automatically mean faster. Unstructured sparsity can shrink storage while doing nothing for CPU speed, because scattered zeros wreck memory access patterns. Structured pruning — removing whole channels or heads — plays nicer with real hardware. Always measure latency, not just parameter count.
Knowledge distillation: teach a smaller model
The third technique is the most charming. Instead of shrinking a model, you train a small student to imitate a big teacher. Crucially, the student learns from the teacher’s full probability distribution — the soft “I’m 70% sure it’s a cat, 20% a fox” signal — which carries far richer information than the raw label alone. It’s the difference between copying a friend’s exam answers and actually having them explain their reasoning. Distillation is exactly how many of today’s compact, capable small language models come into existence.
A practical taste
In practice you rarely hand-roll this. Libraries like Hugging Face bitsandbytes quantize on load in a couple of lines:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# Load a 7B model in 4-bit instead of 16-bit: ~14GB -> ~4GB
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # 4-bit NormalFloat
bnb_4bit_compute_dtype="float16",
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
quantization_config=config,
device_map="auto",
)
For on-device CPU inference, the GGUF format (used by llama.cpp and friends) is the de facto standard — it ships pre-quantized weights at various bit-widths so a 7B model runs happily on a laptop with no GPU in sight.
The takeaway
Compression is a three-way trade between accuracy, size, and speed — you don’t get all three for free. A sane default playbook:
- Start with INT8 PTQ. Big memory win, negligible accuracy hit, almost no effort.
- Drop to 4-bit (NF4/GPTQ) when you need it small enough to fit, and test the accuracy.
- Prune structurally, and benchmark real latency — never trust the parameter count.
- Distill when you want a permanently smaller, faster model rather than a compressed copy.
- Combine them. Prune, then quantize, then distill is a well-trodden, effective pipeline.
The piano doesn’t have to come up the stairs. Sometimes a very good keyboard is exactly what the room needs.