Convolutional Neural Networks (CNNs), Explained

Show a toddler a cat once and they’ll spot cats forever — fluffy ones, grumpy ones, cats half-hidden behind a curtain. For decades, computers were hopeless at this. Then convolutional neural networks (CNNs) came along and quietly became the reason your phone can tell a hotdog from a not-hotdog. Let’s unpack how they actually pull it off.

Why a plain network chokes on images

First, the obvious question: why not just feed images into a regular neural network and call it a day?

Because the maths is brutal. A modest 32×32 colour image is already 3,072 numbers. Wire that into a fully connected layer and every neuron connects to every pixel. Scale up to a real photo — say 224×224×3, around 150,000 inputs — and the first layer alone needs tens of millions of weights. To put the absurdity in perspective: a single fully connected layer over a 32×32×3 image producing a 28×28×6 output needs roughly 14 million parameters. The equivalent convolutional layer? About 156.

It’s not just the parameter bloat. A fully connected layer also throws away the one thing that makes an image an image: spatial structure. Flatten a photo into a long list of numbers and the network has no idea that two pixels were neighbours. It’s like trying to appreciate a painting by reading its hex codes aloud in random order.

Convolution: a tiny window that slides

CNNs fix both problems with one elegant trick. Instead of looking at the whole image at once, a filter (also called a kernel) — a small grid of weights, often 3×3 — slides across the image, computing a weighted sum at each position. That sliding-and-summing operation is the convolution.

Here’s the clever part. The same filter is reused at every position. This is parameter sharing, and it’s why CNNs are so lean: one small filter learns to detect, say, a vertical edge, and then it hunts for that edge everywhere in the image. An edge in the top-left and an edge in the bottom-right are found by the same handful of weights. Cheap, and exactly how vision should work — a cat’s ear is an ear no matter which corner of the frame it’s lurking in.

A single filter’s output is a feature map: a grid showing where in the image that particular pattern lit up. Early-layer filters learn embarrassingly simple things — edges, blobs, gradients of colour. The weights aren’t hand-designed; they’re learned by training, the same way the rest of the network learns.

Pooling: zoom out, keep the gist

After convolution comes pooling, which downsamples the feature maps. The most common flavour, max pooling, slides a 2×2 window and keeps only the largest value in each patch. A 32×32×12 volume becomes 16×16×12 — a quarter of the data, same number of channels.

Pooling does two useful jobs. It shrinks the computation, and it makes the network a little translation-invariant: nudge the cat a few pixels to the left and the pooled output barely flinches. The network learns to care that an ear is present, not that it sits at coordinate (47, 112).

A hierarchy of features

Stack these blocks and something beautiful emerges. The first convolutional layers detect edges. Feed those edges into the next layers and they assemble into textures and corners. Stack more, and you get eyes, ears, wheels, faces. Each layer composes the patterns below it into something richer — a hierarchy of features, built from raw pixels up to recognisable objects, with nobody ever telling the network what an “ear” is.

The classic architecture is a sandwich: convolution → pooling, repeated several times to extract and condense features, then a couple of dense (fully connected) layers at the end to turn those features into a final answer (“87% cat”). LeNet-5 pioneered the recipe back in the 1990s; AlexNet, VGG, and ResNet later scaled it into the powerhouses that conquered image recognition.

What it looks like in code

In Keras, that whole conv→pool→dense pattern is refreshingly readable:

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation="relu"),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(64, activation="relu"),
    layers.Dense(10, activation="softmax"),  # 10 classes
])

model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

Two Conv2D layers learning 32 then 64 filters, each followed by pooling, then flattened into dense layers for the verdict. That’s a working digit classifier in twelve lines.

Where CNNs show up

CNNs are the workhorse of computer vision: photo tagging, medical imaging, self-driving car perception, face unlock, satellite imagery, and the quality-control camera squinting at your soda cans on a factory line. Transformers have lately muscled into vision too, but for sheer efficiency on image-shaped data, convolution is still everywhere — often quietly running inside the bigger systems that get the headlines.

The takeaway

Remember three words: filters, hierarchy, sharing. A CNN works because small reusable filters slide across an image detecting patterns, pooling condenses them, and stacked layers build simple edges into complex objects — all while sharing weights to keep the parameter count sane. Next time you reach for a neural network on anything with spatial structure — images, audio spectrograms, even some time series — don’t flatten and hope. Convolve. Your accuracy, and your GPU, will thank you.

Convolutional Neural Networks (CNNs), Explained

Why a plain network chokes on images

Convolution: a tiny window that slides

Pooling: zoom out, keep the gist

A hierarchy of features

What it looks like in code

Where CNNs show up

The takeaway

More posts

Reading Learning Curves: Diagnosing Model Training

Model Explainability: Making Sense of SHAP and LIME

Diffusion Models: How AI Generates Images