Technology

Fine-Tuning LLMs with LoRA and PEFT

Abhay Abhay 4 min read
Fine-Tuning LLMs with LoRA and PEFT
Photo by Jacek Ulinski on Unsplash

Full fine-tuning a large language model is a bit like repainting your entire house because you wanted the bathroom door to be a slightly nicer shade of blue. Technically it works. It is also gloriously wasteful. To nudge a 7-billion-parameter model into behaving a certain way, classic fine-tuning updates all seven billion of those parameters, keeps a full set of optimizer states for each one, and then asks you to store a complete new copy of the model when you are done. That is the kind of thing that turns a weekend experiment into a cloud bill with its own area code.

This is the problem PEFT — Parameter-Efficient Fine-Tuning — was invented to solve. And the technique that made PEFT a household name (well, in households that argue about GPUs) is LoRA.

Why full fine-tuning hurts

The pain isn’t just the parameters themselves. Training needs gradients and optimizer state — with Adam, you’re carrying roughly two extra numbers per weight. So a model that’s, say, 28 GB in half precision can balloon past 100 GB of VRAM in mid-training once you add gradients, optimizer moments, and activations. That’s multi-GPU territory for a task that might only need to teach the model your company’s tone of voice.

And then there’s the deployment tax: every fine-tuned variant is a brand-new multi-gigabyte file. Ten use cases, ten full copies. Your object storage weeps.

The LoRA trick: freeze the giant, train the tiny

LoRA’s insight is delightfully cheeky. Instead of editing the model’s huge weight matrices directly, you freeze them completely and inject a small, trainable detour alongside each one.

The math is the elegant part. A weight update for a big matrix is itself a big matrix — but it tends to be low-rank, meaning it can be approximated by multiplying two skinny matrices together. So rather than learn a full update ΔW (say, 4096×4096 ≈ 16.7M numbers), LoRA learns two small matrices A (4096×8) and B (8×4096) whose product approximates it. That 8 is the rank r, and it shrinks the trainable count to about 65K — a rounding error by comparison. At inference, the model computes its normal output plus this little low-rank correction.

The base model never changes. You only ever train and save the adapter — often just a few megabytes. Want a different behaviour? Train a different adapter. Same frozen giant underneath.

QLoRA: now do it on a laptop GPU

QLoRA takes the idea one decadent step further. It loads the frozen base model in 4-bit precision (via bitsandbytes), slashing the memory needed to hold the model, then trains LoRA adapters on top in higher precision. The clever bit is that quantization only affects the frozen weights you’re not updating anyway, so quality stays remarkably close to full fine-tuning. This is how people fine-tune sizeable models on a single consumer card instead of a server rack.

The actual code

Hugging Face’s peft library (currently v0.19) makes this almost suspiciously short. Here’s the LoRA sketch:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Load the base model in 4-bit (the "Q" in QLoRA)
quant = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", quantization_config=quant)
model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=16,                      # rank: bigger = more capacity, more params
    lora_alpha=32,             # scaling for the adapter's contribution
    lora_dropout=0.05,
    target_modules="all-linear",  # let PEFT find the layers to adapt
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
model.print_trainable_parameters()
# e.g. trainable: 0.06% of total parameters — that's the whole point

You then hand this to a normal Trainer or SFTTrainer and train as usual. The two knobs worth knowing: r (capacity — start at 8–16) and lora_alpha (how loudly the adapter speaks — a common rule is alpha ≈ 2×r).

A realistic workflow

  1. Prepare data. A few hundred to a few thousand high-quality examples in instruction/response format beats a giant pile of noise. Quality, not volume.
  2. Train the adapter. Run QLoRA on your dataset. Minutes-to-hours on one GPU, not days on a cluster.
  3. Serve. Either keep the adapter separate and load it on top of the base model at runtime (great when you have many adapters), or call merge_and_unload() to fold the adapter back into the weights for a single, standalone model with zero inference overhead.

When to fine-tune at all

Here’s the honest caveat: fine-tuning teaches a model how to behave — a style, a format, a skill. It is not how you give it fresh facts. For “answer questions about our constantly-changing docs,” reach for retrieval instead. I dug into that exact decision in RAG vs Fine-Tuning — read that before you train anything.

The takeaway

If you’ve decided fine-tuning is the right tool, default to QLoRA. Freeze the base model, load it in 4-bit, train a small LoRA adapter with r=16 and alpha=32, and ship the few-megabyte adapter instead of a fresh copy of the whole model. You’ll get most of the quality of full fine-tuning for a tiny fraction of the memory, money, and storage — and you’ll never repaint the whole house to change one door again.

More posts