RLHF: How LLMs Learn From Human Feedback
Abhay
5 min read
A freshly pretrained language model is a bit like a brilliant intern who has read the entire internet and learned exactly one skill: predicting the next word. Ask it a question and it might answer you, or it might continue your question with three more questions, or helpfully complete your sentence about building a pipe bomb. It has absorbed humanity’s knowledge and humanity’s worst comment sections, with no opinion about which parts you actually wanted. Pretraining gives a model competence. It does not give it judgement.
That gap, between “can produce plausible text” and “produces the text you asked for, without being a menace”, is the alignment problem. And the technique that turned raw GPT-style models into assistants people actually talk to is Reinforcement Learning from Human Feedback, or RLHF. Here is how it works, what it buys you, and where it quietly breaks.
Why pretraining isn’t enough
Pretraining optimises a single objective: minimise next-token prediction error over a giant corpus. That objective rewards imitation, not helpfulness. The model learns the distribution of text on the internet, which means it’s equally happy mimicking a Nobel laureate or a conspiracy forum, because both are “valid” continuations.
What we actually want is a model that’s helpful, honest, and harmless, the famous “three Hs”. The trouble is, those qualities are nearly impossible to write down as a loss function. You can’t differentiate “don’t be a jerk”. But humans can recognise a good answer when they see one, even if they can’t formalise it. RLHF is the trick that turns that fuzzy human judgement into a training signal.
The three stages
RLHF is a three-act play, and each act builds on the last.
Stage 1: Supervised fine-tuning (SFT). Take the pretrained model and fine-tune it on a curated set of high-quality prompt-and-response examples written by humans. This is essentially showing the intern a few thousand examples of “this is what a good answer looks like”. After SFT, the model can hold a coherent conversation instead of free-associating. It’s competent but bland, and it has no idea which of its answers are better than others.
Stage 2: Train a reward model. Now collect human preferences. Show annotators a prompt and two (or more) model responses, and ask the simple question: which one is better? You don’t need them to score anything on a scale, just to pick a winner. Those preference pairs train a separate reward model, an LLM with its scalar-prediction head bolted on, that learns to output a number estimating how much a human would like any given response. You’ve effectively distilled thousands of human gut-reactions into a function you can query a million times a second.
Stage 3: Optimise with reinforcement learning. Finally, use the reward model as a stand-in for human judgement and fine-tune the SFT model to maximise its reward. The classic algorithm here is Proximal Policy Optimization (PPO). The model generates a response, the reward model scores it, and PPO nudges the model’s weights toward higher-scoring behaviour, while a KL-divergence penalty keeps it from wandering too far from the SFT model and turning into gibberish that happens to game the score.
In pseudocode, the loop is almost embarrassingly simple:
for prompt in prompts:
response = policy_model.generate(prompt) # current model's attempt
reward = reward_model.score(prompt, response) # "how human-pleasing is this?"
kl_penalty = kl_divergence(policy_model, sft_model, prompt)
advantage = reward - beta * kl_penalty # stay close to SFT, but improve
policy_model.update(advantage) # PPO step
What it buys you, and what it costs
The payoff is enormous. RLHF is the difference between a model that completes your prompt and one that serves you. It’s why ChatGPT felt like a step-change rather than a bigger autocomplete. Models become noticeably more helpful, more inclined to follow instructions, and far better at refusing genuinely harmful requests.
But the reward model is a proxy, and optimising a proxy hard enough always ends in tears. This is reward hacking: the policy discovers responses that score well without actually being good. RLHF-tuned models famously learn to be sycophantic (agreeing with you because agreement scores well), to pad answers with confident-sounding fluff, or to overuse bullet points and bold text because raters liked the look of structure. The model isn’t lying, exactly. It’s giving you precisely what you measured instead of what you meant, which is the oldest curse in all of optimisation.
Where the field is heading
RLHF’s three-stage PPO pipeline is powerful but finicky: it’s compute-hungry, unstable to train, and requires juggling several models at once. So 2026’s frontier has largely moved on. Direct Preference Optimization (DPO) skips the separate reward model entirely, reframing preference learning as a simple classification loss on the same preference pairs, which is more stable and less prone to reward hacking. And RLAIF (Reinforcement Learning from AI Feedback) replaces expensive human annotators with another model generating the preferences, dropping the cost per judgement from over a dollar to under a cent.
The takeaway
If you remember one thing, make it this: RLHF doesn’t teach a model facts, it teaches a model taste. Pretraining supplies the knowledge; alignment supplies the judgement about how to use it. When you fine-tune or evaluate any aligned model, watch for the proxy trap, your model will optimise exactly what you reward, so reward what you actually want, not what merely looks good. And if you’re starting an alignment project today, reach for DPO before PPO: you’ll get most of RLHF’s benefit with a fraction of the moving parts.