Q-Learning and the Bellman Equation, Explained

Imagine teaching someone chess by never explaining the rules of strategy — you just let them play a few million games and whisper “good” or “bad” after each move. Eventually they’d figure out that protecting the king is wise and hanging the queen is not. That, in a nutshell, is Q-learning: an agent that learns to act well purely from rewards, without anyone ever handing it a rulebook.

If you’re hazy on the basics — agents, states, rewards, the whole loop — start with the reinforcement learning overview. This post zooms in on the engine room: the Q-table and the famously elegant update that fills it in.

The Q-table: a giant cheat sheet

At the heart of classic Q-learning sits a table. The rows are states (where you are) and the columns are actions (what you could do). Each cell holds a single number, Q(s, a), which answers one deceptively deep question: “If I’m in state s and take action a, how much total reward can I expect from here on out, assuming I play smart afterwards?”

Get that table filled in correctly and you don’t need to think anymore. In any state, just scan the row and pick the action with the highest number. The table is the strategy. The entire problem reduces to one thing: figuring out the right numbers.

The Bellman update: today’s reward plus tomorrow’s promise

Here’s the trick. You don’t know the true Q-values up front — if you did, you’d already be a chess grandmaster. So you start with a table full of zeros (or random guesses) and improve it one experience at a time.

Every time the agent takes action a in state s, lands in a new state s', and collects reward r, it nudges its estimate using the Bellman update:

# Tabular Q-learning update for one step
def update_q(Q, state, action, reward, next_state, alpha=0.1, gamma=0.99):
    best_next = max(Q[next_state])          # best you could do from s'
    td_target = reward + gamma * best_next  # reality says: this is the value
    td_error = td_target - Q[state][action] # how wrong was our guess?
    Q[state][action] += alpha * td_error    # nudge toward reality
    return Q

Read the td_target line slowly, because it’s the whole idea: the value of acting now equals the reward you get immediately plus the discounted best you can do next. The discount factor gamma (a number like 0.99) is patience dialled in — close to 1 and the agent plans far ahead; close to 0 and it lives for the moment like a toddler near a cookie jar.

The gap between what we believed (Q[state][action]) and what reality just suggested (td_target) is the temporal-difference error. The learning rate alpha decides how boldly we close that gap. Big alpha, jumpy and reactive; small alpha, slow but steady. We don’t leap to the new estimate — we drift toward it.

Exploration vs. exploitation: the epsilon-greedy bargain

There’s a catch. If the agent always picks the action it currently thinks is best, it’ll happily get stuck in the first half-decent rut it finds — never discovering the brilliant move two squares over. But if it always acts randomly, it never uses what it learned.

The classic compromise is epsilon-greedy: most of the time, exploit (pick the best-known action); but with small probability epsilon, explore (roll the dice on something random).

import random

def choose_action(Q, state, n_actions, epsilon=0.1):
    if random.random() < epsilon:
        return random.randrange(n_actions)   # explore
    return max(range(n_actions), key=lambda a: Q[state][a])  # exploit

A common move is to start with high epsilon (be curious early) and decay it over time (settle down once you know the terrain).

Does it actually work? Convergence

Reassuringly, yes — with caveats. Q-learning is mathematically guaranteed to converge to the optimal values, Q*, because the Bellman update is a contraction mapping: each pass squeezes the estimates inexorably closer to the truth. It’s also off-policy, meaning it learns the optimal strategy even while wandering around exploring — that max in the update always asks “what’s the best I could do next,” regardless of what the agent actually did.

The fine print: the guarantee assumes every state-action pair is visited infinitely often, and that alpha decays sensibly. Translation — a tabular Q-learner needs to try everything, repeatedly. Fine for a gridworld; hopeless for chess, where the states outnumber the atoms in the universe.

The leap to Deep Q-Networks

That’s exactly the wall Q-tables hit — and where Deep Q-Networks (DQN) come in: swap the lookup table for a neural network that predicts Q(s, a) from raw inputs like pixels, so the agent can generalise across states it has never literally seen.

The takeaway

If you remember one line, make it the Bellman update: value now = reward now + discounted best value next, and you only ever nudge toward that target, never jump. To build a Q-learner, you need exactly four knobs: alpha (how fast to learn), gamma (how far to plan), epsilon (how much to explore), and an epsilon decay schedule. Get those four right on a small problem first — then reach for a neural network when the table won’t fit.