Transformers: The Architecture Behind Modern AI

Every time you chat with an AI, ask it to summarise a contract, or watch it generate an image of a corgi in a spacesuit, you are poking at the same underlying machine: the transformer. It is the engine room of basically all modern AI. And the strange part is that it was born from a 2017 Google paper with the cheeky title Attention Is All You Need — a paper now cited over 170,000 times, putting it among the most-cited works of the century. Big claim. It turned out to be true.

So what changed? Let’s start with what came before.

The problem with reading one word at a time

Before transformers, the dominant tool for language was the recurrent neural network (RNN). An RNN reads a sentence the way you’d read it with a torch in a dark room: one word at a time, left to right, carrying a little summary of everything so far in its head.

This has two nasty side effects. First, by the time the model reaches the end of a long paragraph, the beginning has faded into a blur — the “memory” gets diluted. Second, and more fatally for the GPU era: you cannot process word five until you’ve processed word four. It’s strictly sequential. That makes RNNs painfully slow to train, and slowness is the enemy of scale.

The transformer’s headline trick is that it throws out the torch and turns on all the lights at once.

Attention: “what should I pay attention to?”

The core idea is attention, and the intuition is more human than it sounds. Consider the sentence:

The trophy didn’t fit in the suitcase because it was too big.

What does “it” refer to — the trophy or the suitcase? You know instantly it’s the trophy, because trophies being too big is what stops them fitting. To understand “it,” your brain quietly looked back and weighted the other words by relevance.

That weighting is attention. For every word, the model asks: given my job right now, which other words should I pay attention to, and how much? “It” pays a lot of attention to “trophy,” some to “big,” and almost none to “the.”

The clever mechanics: each token (a word or word-piece) is turned into three vectors — a Query (“what am I looking for?”), a Key (“what do I offer?”), and a Value (“here’s my actual content”). A token’s Query is compared against every other token’s Key to produce attention scores, and those scores blend the Values together. When tokens attend to other tokens in the same sentence, we call it self-attention — the beating heart of the whole thing.

In rough pseudocode:

# scores[i][j] = how much token i should attend to token j
scores = softmax(Query @ Key.T / sqrt(dim))
output = scores @ Value   # weighted blend of everyone's content

Run several of these in parallel — multi-head attention — and each “head” learns to track a different relationship: one watches grammar, one chases long-range references, one minds the topic. It’s a committee of specialists, all reading at once.

Tokens, embeddings, and a sense of order

A transformer doesn’t see words; it sees numbers. Text is chopped into tokens, and each token becomes an embedding — a long vector of numbers where “king” and “queen” land near each other in meaning-space, while “king” and “broccoli” do not.

But there’s a catch. Because attention looks at all tokens simultaneously, the model has no built-in sense of order — “dog bites man” and “man bites dog” would look identical, which is a problem if you’re the man. So transformers add positional information: a signal stitched into each embedding that says “I’m token number 3.” Word and place, bundled together.

Why parallelism let them eat the world

Here’s the payoff. Because every token is processed at the same time rather than in a queue, transformers map beautifully onto GPUs, which love doing thousands of identical operations in parallel. More parallelism means you can train bigger models on more data, faster — and it turns out that when you scale transformers up, they keep getting smarter rather than plateauing. That single property is why we got GPT-style large language models at all.

And it didn’t stop at text. Chop an image into patches, treat each patch as a “token,” and you get the Vision Transformer — one scaled to two billion parameters hit state-of-the-art ImageNet accuracy. The same architecture now underpins language models, image generators, protein-folding tools, and speech systems. One idea, an embarrassing number of trophies (which, yes, fit fine in their own suitcases).

The takeaway

If you remember one thing, make it this: a transformer is a machine that, for every piece of input, decides what else in the input is worth paying attention to — and does it for all pieces simultaneously. Attention gives it understanding; parallelism gives it scale. RNNs read with a torch; transformers flip on the floodlights.

Next time an AI nails a tricky pronoun or summarises a 40-page PDF without losing the plot, you’ll know the trick: it isn’t magic, it’s just very, very good at knowing where to look.

Transformers: The Architecture Behind Modern AI

The problem with reading one word at a time

Attention: “what should I pay attention to?”

Tokens, embeddings, and a sense of order

Why parallelism let them eat the world

The takeaway

More posts

Reading Learning Curves: Diagnosing Model Training

Model Explainability: Making Sense of SHAP and LIME

Diffusion Models: How AI Generates Images