Natural Language Processing (NLP), Explained

Human language is a glorious mess. We use “bank” for both rivers and money, end sentences with prepositions our teachers warned us about, and somehow understand “yeah, right” as sarcasm. Computers, meanwhile, prefer numbers that behave. Natural Language Processing (NLP) is the field that bridges that gap: it teaches machines to read, interpret, and generate human language without losing their minds. Every time autocomplete finishes your sentence, your inbox quietly sweeps spam away, or a chatbot answers at 2 a.m., NLP is doing the heavy lifting.

What NLP actually does

Strip away the hype and NLP is a collection of well-defined tasks. The classics show up everywhere:

Text classification — is this email spam or not? Is this review positive or negative? Sorting text into buckets.
Sentiment analysis — a flavour of classification that reads the emotional temperature of a sentence. Brands use it to find out, in bulk, exactly how angry the internet is today.
Named Entity Recognition (NER) — pulling out the who, where, and when: people, companies, places, dates. The unglamorous backbone of every “extract the data from this contract” tool.
Machine translation — turning English into Japanese and back, the task that quietly went from comically bad to eerily good in a decade.
Summarization — squashing ten paragraphs into three sentences without lying about the content (mostly).

These tasks haven’t changed much. What changed dramatically is how we solve them.

The journey: from counting words to understanding them

Bag-of-words and TF-IDF

In the beginning, computers treated a sentence like a shopping list. Bag-of-words simply counts how often each word appears and throws grammar in the bin. “The dog bit the man” and “the man bit the dog” look identical to it — same words, same counts, two very different days.

TF-IDF (Term Frequency–Inverse Document Frequency) added a clever twist: weight words by how distinctive they are. Common words like “the” get crushed toward zero; rare, telling words like “refund” or “lawsuit” get boosted. It’s still a remarkably strong baseline for classification, and you can spin one up in a few lines:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "I love this product, absolutely fantastic",
    "Terrible experience, I want a refund",
    "Great value and fast delivery",
]

vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray().round(2))

Feed that matrix into a classifier and you have a working sentiment model before your coffee cools. The catch? It still has no idea that “fantastic” and “great” are basically the same thing.

Word embeddings

Around the early 2010s, Word2Vec changed the game by mapping each word to a dense vector — a list of numbers — where similar meanings land near each other in space. Suddenly “king − man + woman ≈ queen” became a party trick that actually worked. Words gained meaning by neighbourhood. The limitation: each word got exactly one vector, so “bank” was stuck being an awkward average of finance and rivers forever.

Transformers and LLMs

Then came 2017 and a Google paper with the swaggering title “Attention Is All You Need.” It introduced the Transformer, an architecture whose attention mechanism lets a model weigh every word against every other word in the sentence at once. Now “bank” gets a different representation depending on whether “river” or “deposit” is nearby. Context, finally.

Transformers scaled beautifully, and scaling them up gave us the Large Language Models (LLMs) behind today’s chatbots and translators. Modern generative models are, at heart, relentless next-word predictors trained on staggering amounts of text — and that one simple objective, done at scale, produces something that feels uncannily like understanding.

Tokenization: the unglamorous first step

Before any of this, text has to become numbers, and that starts with tokenization — chopping text into pieces. Modern models rarely split on whole words; they use subword schemes so they never choke on a word they’ve never seen. The big three:

BPE (Byte-Pair Encoding) — used by the GPT family
WordPiece — used by BERT
SentencePiece — used by T5

So “tokenization” might become token + ization, and a made-up word like “claudeify” still gets handled gracefully by gluing known fragments together. This is also why you’re billed by the token, not the word — and why emoji can be surprisingly expensive. With a modern library, the whole pipeline collapses into a couple of lines:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
print(classifier("NLP went from counting words to writing essays."))
# [{'label': 'POSITIVE', 'score': 0.99}]

That one call quietly tokenizes the text, runs a Transformer, and hands back a verdict.

Where it shows up (everywhere)

NLP isn’t a lab curiosity — it’s in your search bar, your spam filter, your phone keyboard, your voice assistant, customer-support bots, translation apps, and the grammar checker nagging you right now. You interact with a dozen NLP systems before lunch and notice none of them, which is exactly the point.

The takeaway

Here’s the rule of thumb worth keeping: match the tool to the job, not the hype. For straightforward classification on a tidy dataset, a TF-IDF model plus a simple classifier is fast, cheap, explainable, and frequently good enough — reach for it first. Save the transformer pipeline for when meaning, context, and nuance genuinely matter. Knowing why the field climbed from word-counting to attention is what lets you pick the right rung on that ladder instead of always grabbing the most expensive one.

Sources: Attention Is All You Need / Transformer overview, A Brief Timeline of NLP from Bag of Words to the Transformer Family, Tokenization methods in LLMs.