Word Embeddings: word2vec and GloVe, Explained

Computers are brilliant at arithmetic and hopeless at meaning. To a raw program, “cat” and “dog” are just different strings, no more related than “cat” and “thermostat”. So before a machine could do anything clever with language, somebody had to teach it that words have neighbours, families, and rivalries. The trick that cracked it was disarmingly simple: turn every word into a list of numbers and let geometry do the heavy lifting.

That list of numbers is a word embedding, and it remains one of the most elegant ideas in all of NLP.

Words as arrows in space

Imagine a giant room with, say, 300 dimensions (just go with it). Every word gets a fixed spot in that room, represented by a vector of 300 numbers. The whole point is that position encodes meaning. Words used in similar contexts end up as neighbours: “coffee” sits near “tea” and “espresso”, while “espresso” gives “elephant” a very wide berth.

The guiding principle is a 1950s linguistics slogan from John Firth: “You shall know a word by the company it keeps.” If two words keep showing up surrounded by the same other words, they probably mean similar things. Embeddings turn that intuition into coordinates.

And once words are vectors, you can do arithmetic on them. This is where the party trick comes in.

king − man + woman ≈ queen

The single most famous demo in the field: take the vector for king, subtract man, add woman, and the nearest vector to your result is queen. The model was never told about gender or royalty. It just absorbed enough text that the direction from “man” to “woman” turned out to be roughly the same direction as “king” to “queen”.

It feels like magic, so one honest caveat: the algorithm usually excludes the input words from the answer. If it didn’t, “king − man + woman” would often land closest to… king itself. Strip the originals out and “queen” rises to the top. The analogy is real and lovely, but it has a little stage management behind the curtain. The deeper reason it works at all is that consistent relationships in language (gender, tense, capital-cities-of-countries) show up as consistent directions in the vector space.

word2vec: learn by guessing

The model that lit the fuse was word2vec, released by Tomas Mikolov and colleagues at Google in 2013. It’s prediction-based: it slides a window across billions of words and plays a fill-in-the-blank game, nudging the vectors every time it guesses. It comes in two flavours:

CBOW (Continuous Bag of Words): given the surrounding context, predict the missing middle word. (“The cat sat on the ___.”) It’s fast and does nicely on common words.
Skip-gram: the mirror image — given one word, predict its neighbours. It’s slower but shines on rare words and small datasets.

Either way, the magic is a side effect. word2vec never directly tries to learn meaning; it just gets better at a guessing game, and useful vectors fall out as a by-product. Train long enough and the geometry above emerges on its own.

GloVe: count, don’t guess

A year later, in 2014, Stanford’s Jeffrey Pennington, Richard Socher, and Christopher Manning offered a different recipe: GloVe (Global Vectors). Their gripe with word2vec was that it learns from local windows one at a time, never stepping back to see the whole corpus at once.

GloVe is count-based. It first builds a giant co-occurrence matrix — a tally of how often every word appears near every other word across the entire corpus — then factorises that matrix into vectors. Same destination, different vehicle: it bakes the global statistics in from the start rather than discovering them window by window. In practice the two methods produce similar quality, and “prediction vs. count” became the great embeddings debate of the mid-2010s.

Seeing it in action

You don’t need to train anything to play. The gensim library ships pretrained vectors:

import gensim.downloader as api

# Pretrained GloVe vectors (50-dimensional, ~66MB)
model = api.load("glove-wiki-gigaword-50")

# The party trick
result = model.most_similar(positive=["king", "woman"], negative=["man"])
print(result[0])          # ('queen', 0.85...)

# Plain old similarity
print(model.similarity("coffee", "tea"))     # high
print(model.similarity("coffee", "asphalt")) # low

That most_similar call is king − man + woman, and it quietly drops the input words for you.

Static vs. contextual: knowing the limits

Here’s the catch that eventually dethroned these models: word2vec and GloVe are static. Each word gets exactly one vector, for life. So “bank” has a single embedding whether you’re depositing cash or fishing off a river bank. Context is lost.

Modern systems use contextual embeddings — from BERT and the transformer family — where a word’s vector shifts depending on the sentence around it. Those richer, dynamic vectors are what power today’s semantic search and vector databases. (If that’s your interest, see the companion post on embeddings and vector databases for the modern, retrieval-focused side of the story.)

The takeaway

word2vec and GloVe are no longer state of the art, but they’re the cleanest way to get what embeddings actually are — and that intuition transfers straight to the contextual models running production NLP today. Your rule of thumb: reach for static embeddings (word2vec/GloVe via gensim) when you want something tiny, fast, and offline — keyword expansion, quick similarity, teaching a class. Reach for contextual embeddings when meaning depends on context, which for real applications is almost always. Either way, the mental model is the same: meaning is a place, and similarity is distance.

Word Embeddings: word2vec and GloVe, Explained

Words as arrows in space

king − man + woman ≈ queen

word2vec: learn by guessing

GloVe: count, don’t guess

Seeing it in action

Static vs. contextual: knowing the limits

The takeaway

More posts

Reading Learning Curves: Diagnosing Model Training

Model Explainability: Making Sense of SHAP and LIME

Diffusion Models: How AI Generates Images