Reinforcement Learning, Explained
Abhay
5 min read
Most machine learning is taught with flashcards. You show the model a photo labelled “cat,” it guesses, you correct it, repeat ten million times. Reinforcement learning (RL) throws the flashcards out. Instead it drops an agent into a world, lets it fumble around, and pays it in points. No one tells it the right answer — it has to discover which actions tend to pay off. It’s less like a student cramming and more like a dog learning that “sit” reliably produces treats. Let’s unpack how that works, and why in 2026 it’s quietly running everything from chatbots to data-centre thermostats.
The loop that runs the whole show
RL has exactly five moving parts, and they spin in a loop:
- Agent — the decision-maker (the thing we’re training).
- Environment — everything the agent can’t directly control (the game, the road, the warehouse).
- State — a snapshot of the situation right now.
- Action — what the agent does next.
- Reward — a number the environment hands back saying “good” or “bad.”
The cycle is relentless: the agent observes a state, picks an action, the environment responds with a new state and a reward, and round we go. The agent’s one job is to maximise the cumulative reward over time — not just the next treat, but the whole jar. That distinction matters. Sacrificing a pawn to win the game is the entire point of RL, and the reason it can learn genuinely clever long-horizon behaviour.
# The canonical RL loop, stripped to its bones
state = env.reset()
while not done:
action = agent.choose(state) # decide
next_state, reward, done = env.step(action) # act, observe
agent.learn(state, action, reward, next_state) # update beliefs
state = next_state
Explore or exploit? The agent’s eternal dilemma
Here’s where it gets interesting. Suppose your agent has found a restaurant that gives a reliably decent reward. Does it keep going back (exploit what it knows), or try the mysterious new place that might be amazing — or might be a disaster (explore)?
Pure exploitation gets you stuck in a rut, forever ordering the same mediocre dish because you never sampled anything better. Pure exploration means you never settle down and cash in on what you’ve learned. Good RL balances the two, often with a simple trick called epsilon-greedy: most of the time pick the best-known action, but with small probability epsilon, roll the dice and try something random. Over training, you slowly shrink epsilon — curious youth, settled adulthood.
Policy and value: two ways to be smart
Agents reason about the world using two related ideas:
- A policy is the agent’s strategy — a mapping from states to actions. “When the light is red, stop.” The goal of RL is to find the optimal policy that racks up the most reward.
- A value function estimates how good a state (or a state-action pair) is in the long run — the total reward you can expect from here on, if you keep playing well. It’s the agent’s sense of “this position looks promising” before any reward actually arrives.
Some methods learn the policy directly; others learn values and derive the policy from them. (The most famous value-based method, Q-learning, gets its own post — I won’t derive the Bellman update here, just know it’s the workhorse behind a lot of this.)
How this differs from supervised learning
If you’ve read the rest of this blog, you know supervised learning lives on labelled examples: every input comes with the correct output stapled to it. RL has no such luxury.
| Supervised learning | Reinforcement learning | |
|---|---|---|
| Signal | Correct label, immediately | Reward, often delayed |
| Feedback | “Here’s the right answer” | “That was worth 3 points” |
| Data | Fixed dataset | Generated by acting |
The killer difference is delayed, evaluative feedback. A chess agent doesn’t learn that move 12 was a blunder until it loses 30 moves later. Untangling which action deserves credit (or blame) for an outcome is the credit assignment problem, and it’s the central headache RL exists to solve. Supervised learning never has to worry about it; the answer key is right there.
Where it actually shows up
RL stopped being a lab curiosity a while ago. The greatest hits:
- Games: AlphaGo beating a Go world champion, plus OpenAI Five (Dota 2) and AlphaStar (StarCraft II) — messy, real-time, hidden-information environments.
- Robotics: learning grasping and locomotion through trial and error rather than hand-coded rules.
- Infrastructure: Google used RL to cut data-centre cooling energy, with live deployments reporting roughly 9–13% savings — boring, lucrative, very real.
- Your chatbot: RLHF (Reinforcement Learning from Human Feedback) is how nearly every 2026 large language model gets aligned — pre-train on text, then use RL to nudge the model toward responses humans actually prefer. Newer variants like GRPO trim the memory cost, and “verifiable rewards” let models train themselves on maths and code where correctness is checkable.
The takeaway
When you’re sizing up a problem, ask one question: do I have labelled answers, or only a notion of “better” and “worse”? If it’s the former, reach for supervised learning. If it’s the latter — sequential decisions, delayed payoffs, no answer key, just a score you want to maximise — you’re in RL territory. Frame it as agent, environment, state, action, reward, get the reward signal right (a badly designed reward is how agents learn to cheat), and balance exploration against exploitation. Nail those, and you’ve got the mental model that powers everything from game-playing superhumans to the assistant you’re probably chatting with right now.
Sources: Let’s Data Science, Splunk, IntuitionLabs, DataRoot Labs, DigitalOcean.