Diffusion Models: How AI Generates Images
Abhay
4 min read
Here is a fact that should not work but does: the way modern AI makes a photorealistic corgi-astronaut is by starting with a screen of pure static — the kind of snow your grandparents’ TV showed at 2am — and then carefully un-noising it until a corgi appears. It is the digital equivalent of looking at a cloud and not just seeing a dog, but slowly sculpting the cloud into the dog. This is the strange, beautiful idea behind diffusion models, the engine inside DALL·E, Midjourney, and Stable Diffusion.
Let’s unpack how destruction became the secret to creation.
Step one: ruin a perfectly good image
Diffusion has two halves, and the first one is almost comically simple. You take a real image and add a tiny bit of Gaussian noise. Then a bit more. Then more. Repeat this hundreds or thousands of times and your crisp photo dissolves into featureless static. This is the forward process: a Markov chain that walks an image, step by step, into noise. There is nothing to learn here. Adding noise is easy; any toddler with a salt shaker understands entropy.
The clever bit is what we record along the way. At each step, we know exactly how much noise we added. That gives us a giant pile of training examples of the form: “here is a slightly noisier image, and here is the noise that was sprinkled on it.”
Step two: learn to run the tape backwards
The reverse process is where the magic — and the neural network — lives. We train a model (usually a U-Net) to look at a noisy image and predict the noise that was added. Subtract that prediction, and you get a slightly cleaner image. Do it again. And again. Start from pure static, denoise a few dozen times, and a coherent image crystallises out of the chaos.
In rough pseudocode, generation looks like this:
# x starts as pure random noise, shape = image dimensions
x = torch.randn(image_shape)
for t in reversed(range(num_steps)): # walk from noisy -> clean
predicted_noise = unet(x, t) # the model's best guess
x = denoise_step(x, predicted_noise, t) # remove a slice of noise
# x is now a freshly generated image
return x
That is genuinely the whole show. No adversaries, no game theory — just a model that has gotten very good at one humble task: spotting noise and peeling it away.
Why this beats GANs
For years, GANs (Generative Adversarial Networks) ran the image-generation scene by pitting two networks against each other: a generator faking images and a discriminator calling out the fakes. It worked, but GANs are famously temperamental. They suffer from mode collapse, where the generator discovers a handful of images that reliably fool the critic and then refuses to make anything else — like a chef who learns one dish gets applause and serves it at every meal forever. Training is a hyperparameter knife-edge; nudge it wrong and the whole thing diverges.
Diffusion models sidestep this. By breaking generation into many small, easy denoising steps instead of one heroic leap, they’re far more stable to train and they cover the full diversity of the data. The landmark 2021 OpenAI paper was literally titled “Diffusion Models Beat GANs on Image Synthesis” — they posted better FID scores (the standard quality metric, where lower is better) while producing genuinely varied output. Quality and diversity, without the drama.
Turning words into pictures
Denoising random static is neat, but you want a corgi-astronaut on demand. That’s conditioning. During training, the model also sees a text description (encoded by a language model) alongside each image, learning to denoise in a direction that matches the words. At generation time, your prompt steers every denoising step toward “corgi” and away from “spaghetti.”
The other breakthrough was efficiency. Running diffusion on full 512×512 pixel images is brutally expensive. Latent diffusion — the basis of Stable Diffusion — first compresses images into a small latent space (a 64×64 representation) using a VAE, runs the entire noisy-to-clean dance there, then decodes back to full resolution. Same idea, a fraction of the compute. This is why Stable Diffusion can run on a decent gaming GPU instead of a data centre.
The catch: speed
There’s no free lunch. Because diffusion generates by iterating — sometimes 50 steps, historically up to 1,000 — it’s inherently slower than a GAN, which produces an image in a single forward pass. Much of the field’s recent work is about shrinking that step count: distillation and consistency models now squeeze decent images out of 1–4 steps, closing the speed gap fast.
The takeaway
Next time an image generator dazzles you, remember the actual recipe: systematically add noise, then train a network to reverse it, one small step at a time. If you’re building with these tools, three rules of thumb pay off:
- Reach for latent diffusion (Stable Diffusion-style) when compute matters — it’s the difference between a laptop and a server farm.
- Treat sampling steps as a quality/speed dial. More steps, sharper image, slower output. Start around 20–30 and tune from there.
- Your prompt is the conditioning signal that bends every denoising step — vague words give the model too much freedom, so be specific.
Creation, it turns out, is just destruction played in reverse — very carefully, and with excellent taste.