Batch Size, Epochs, and Iterations Demystified

Three little words trip up almost everyone learning deep learning: epoch, iteration, and batch size. They sound like they should be interchangeable. They are not. Mix them up and your training logs become hieroglyphics, your learning rate goes haywire, and you spend an afternoon wondering why “100 epochs” finished in four seconds.

Let’s fix that for good. Think of training a model like reading a giant textbook to study for an exam.

The three terms, in plain English

An epoch is one complete read-through of the entire training dataset. Every single example has been seen by the model exactly once. Read the whole textbook cover to cover — that’s one epoch. Most models need many epochs, because nobody learns calculus in a single read.

Batch size is how many examples the model looks at before it updates its weights. You almost never feed the whole dataset in at once — it won’t fit in memory, and it turns out not to be ideal anyway. Instead you read a few pages, pause, and update your understanding. Those few pages are a batch.

An iteration (or “step”) is one weight update — one batch processed. Read a chunk, think about it, adjust. That’s an iteration.

The relationship that ties them together is refreshingly simple arithmetic:

iterations per epoch = number of training samples ÷ batch size

So with 10,000 samples and a batch size of 100, one epoch is 10,000 ÷ 100 = 100 iterations. Run 20 epochs and you’ve performed 2,000 weight updates total. That’s the whole formula. No calculus required.

A training loop, with the terms labelled

Here’s a stripped-down PyTorch-style loop so you can see exactly where each term lives:

batch_size = 100
num_epochs = 20

# DataLoader chops the dataset into batches of `batch_size`
loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# iterations_per_epoch == len(train_dataset) / batch_size
for epoch in range(num_epochs):           # one epoch = full pass
    for inputs, targets in loader:        # one loop = one iteration / step
        optimizer.zero_grad()
        predictions = model(inputs)       # forward pass on this batch
        loss = loss_fn(predictions, targets)
        loss.backward()                   # backward pass: compute gradients
        optimizer.step()                  # ONE weight update per batch

The outer loop counts epochs; the inner loop counts iterations. Batch size determines how often optimizer.step() fires. Everything flows from there.

Small batches vs. large batches: pick your poison

Batch size isn’t just a memory knob — it changes how your model learns.

Small batches produce noisy gradient estimates. Each update is computed from only a handful of examples, so it jitters around the “true” direction. Counterintuitively, that noise is often a feature: it helps the optimizer hop out of bad local minima and tends to land in flatter regions that generalize better to unseen data. The cost is slower wall-clock training, because you’re doing more, smaller updates.

Large batches give smooth, accurate gradients and crunch through an epoch faster — especially on GPUs that love big parallel chunks. The catch is the well-documented generalization gap: very large batches can converge to sharp minima that look great on training data and disappoint on test data. They also eat memory voraciously.

The honest rule of thumb: powers of two between 32 and 256 are a sane default for most problems. Start there, and only push higher if you have the hardware and the patience to tune for it.

The learning-rate catch: the linear scaling rule

Here’s the trap that bites people who crank up batch size for speed. A bigger batch means fewer, more confident updates per epoch — so each update needs to be bolder to cover the same ground. The fix is the linear scaling rule: when you multiply the batch size by k, multiply the learning rate by k too.

This is exactly the trick Goyal et al. used in their famous 2017 result, training ResNet-50 on ImageNet in one hour with batches of 8,192 images. They paired the scaled-up learning rate with a brief warmup period to avoid blowing up early in training. Forget to scale, and a huge batch with a tiny learning rate will train glacially or stall entirely.

How many epochs is enough?

The unsatisfying-but-correct answer: as many as it takes until your validation loss stops improving — and not one more. Train too few epochs and the model underfits; train too many and it starts memorising noise and overfits.

You don’t have to guess. Use early stopping: watch the validation loss each epoch and halt when it hasn’t improved for a set number of epochs (the “patience”). It’s a one-line callback in Keras and a short loop check in PyTorch, and it saves both compute and your model’s dignity.

The takeaway

Burn this into memory:

Epoch = one full pass over the data.
Iteration = one batch = one weight update.
Batch size = examples per update; iterations/epoch = samples ÷ batch size.
Default batch size to 32–256; smaller often generalizes better, larger is faster.
Scale your learning rate with your batch size (linear scaling rule + warmup).
Let early stopping decide your epoch count, not a number you typed at 2 a.m.