Embeddings and Vector Databases: The Memory Behind RAG

Large language models are dazzling conversationalists with the long-term memory of a goldfish. Ask one about your company’s internal handbook and it will cheerfully invent answers, because it has never read it. Retrieval-Augmented Generation (RAG) fixes this by handing the model the right pages just before it answers. But “the right pages” implies you can find them, and finding meaning rather than keywords is exactly where embeddings and vector databases earn their keep.

From words to coordinates

An embedding turns a chunk of text into a long list of numbers, a vector that captures its meaning rather than its spelling. OpenAI’s text-embedding-3-small, for instance, maps any input into 1,536 dimensions; its bigger sibling text-embedding-3-large uses 3,072. The magic is that the model learns to place related ideas near each other. “I love my dog” and “my puppy is the best” land in roughly the same neighbourhood, while “quarterly tax filing” sulks off in a different postcode entirely. No shared words required; the meaning carries them.

Once everything lives in the same coordinate space, “similar” becomes a measurable distance. The usual yardstick is cosine similarity: the cosine of the angle between two vectors, ranging from -1 (opposite) to 1 (identical). Forget direction-versus-magnitude headaches; a handy detail is that OpenAI’s vectors come pre-normalised to length 1, so cosine similarity is just a dot product, and it ranks results identically to Euclidean distance. Closer angle, closer meaning. That’s the whole intuition.

Here’s the idea in a few lines of Python:

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text):
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return np.array(resp.data[0].embedding)

a = embed("I adore my golden retriever")
b = embed("my puppy is the best thing ever")
c = embed("the invoice is overdue again")

# Vectors are already length-1, so cosine == dot product
print("dog vs dog :", a @ b)   # ~0.6, clearly related
print("dog vs bill:", a @ c)   # ~0.1, strangers

Run it and the two dog sentences score far higher than either does against the invoice. You just measured meaning with arithmetic, which never stops being a little bit magical.

Why you need a database for this

With ten vectors, you can compare against all of them in a loop and call it a day. With ten million, that brute-force scan becomes a small eternity on every single query. This is where a vector database steps in. Its job is approximate nearest neighbour (ANN) search: finding the closest vectors fast by accepting “very nearly the closest” instead of demanding mathematically perfect results.

The workhorse algorithm is HNSW (Hierarchical Navigable Small World). Picture a stack of social networks: a sparse top layer of well-connected hubs lets you teleport across the space in a few hops, then progressively denser layers refine the search until you land on your neighbours. You navigate from coarse to fine, touching a tiny fraction of the data instead of all of it. The result is millisecond lookups over millions of vectors, with recall you can tune by trading a little accuracy for a lot of speed.

A vector database also handles the unglamorous-but-essential parts: storing metadata alongside vectors, filtering (“only docs from this user, after this date”), and increasingly hybrid search, which blends classic keyword matching (BM25) with semantic vectors so you catch both exact terms and fuzzy meaning.

A quick tour of the options

The field is crowded in 2026, but a few names keep coming up:

pgvector — a Postgres extension. If your data already lives in Postgres, this is the path of least resistance: ACID guarantees, no new system to babysit, and perfectly capable up to millions of vectors.
Qdrant — open source with the strongest metadata-filtering performance in its class. A great default when queries are heavily filtered.
Weaviate — the hybrid-search champion, fusing BM25, dense vectors, and metadata filters in a single query.
Milvus — built for the eye-watering scale where lighter systems fall over.
Pinecone — fully managed, so it handles sharding, backups, and tuning while you sleep.
Chroma and FAISS — beloved for prototyping and notebooks before you commit to anything heavier.

The takeaway

Embeddings turn meaning into geometry; vector databases make that geometry searchable at scale. If you’re building RAG, here’s the rule of thumb: start with text-embedding-3-small and pgvector. It’s cheap (around $0.02 per million tokens), it’s boring in the best way, and it’ll carry you comfortably into the millions of documents. Only when you outgrow it should you reach for a purpose-built engine, choosing by your real bottleneck: filtering (Qdrant), hybrid search (Weaviate), raw scale (Milvus), or hands-off ops (Pinecone). Don’t pay for a distributed cluster to give your chatbot the memory it should have had all along.

Embeddings and Vector Databases: The Memory Behind RAG

From words to coordinates

Why you need a database for this

A quick tour of the options

The takeaway

More posts

Reading Learning Curves: Diagnosing Model Training

Model Explainability: Making Sense of SHAP and LIME

Diffusion Models: How AI Generates Images