k-Nearest Neighbors (kNN), Explained

There’s an old saying that you’re the average of the five people you spend the most time with. k-Nearest Neighbors takes that idea, makes it mathematically literal, and turns it into one of the simplest classifiers in machine learning. Want to know what something is? Look at the things closest to it and let them vote. That’s the whole algorithm. No training loop, no loss function to minimise, no gradient descent grinding through epochs. kNN is the friend who refuses to do any work until the very last second — and then does surprisingly well anyway.

The laziest learner in the room

Most models study hard upfront: they pore over the training data, fit parameters, and distil everything into a tidy set of weights. kNN skips all of that. It just memorises the training set and goes back to sleep. This is why it’s affectionately called a lazy learner — all the real computation is deferred to prediction time.

When a new point shows up, kNN finally rolls up its sleeves:

Measure the distance from the new point to every point in the training set.
Grab the k closest ones.
For classification, take a majority vote of their labels. For regression, take the average of their values.

That’s it. The “model” is the dataset itself. (Quick note to avoid a common mix-up: kNN is not k-means. They both start with “k” and both involve distances, but k-means is unsupervised clustering, while kNN is supervised prediction. Different beasts, different jobs.)

“Nearest” depends on how you measure

Everything hinges on what “close” means, and that’s a choice. The default is Euclidean distance — the straight-line, as-the-crow-flies distance you learned in school. But it’s not the only option:

Manhattan distance sums the absolute differences along each axis, like a taxi navigating a grid of city blocks. Often more robust when features are on different conceptual scales.
Cosine similarity cares about the angle between vectors rather than their magnitude — a favourite for text and high-dimensional embeddings.

Pick the metric that matches your data’s geometry, not just whichever one is the default.

Choosing k: the Goldilocks problem

k is the one knob that matters most, and it’s a balancing act. Set k = 1 and your model clings to every single neighbour, including the noisy outliers — the decision boundary becomes a jagged, overfit mess. Crank k too high and you smooth everything into mush, eventually just predicting the majority class for everyone.

A few rules of thumb:

Start somewhere around the square root of your sample size and tune from there.
Use an odd k for binary classification so votes can’t tie.
Let cross-validation pick the winner rather than your gut.

Small k = low bias, high variance. Large k = high bias, low variance. The sweet spot lives in between.

Why scaling is non-negotiable

Here’s the trap that catches almost everyone. Suppose you’re classifying people using age (range 0–100) and salary (range 0–200,000). Distance is computed across all features at once, so salary’s enormous numbers will utterly dominate the calculation. Age might as well not exist. The “nearest” neighbour is really just the nearest salary, and your model is quietly broken.

The fix is to scale your features — standardise them to comparable ranges before computing any distances — so every feature gets a fair say. With kNN, forgetting to scale isn’t a minor oversight; it’s the difference between a working model and noise.

The catch: the curse of dimensionality

kNN has an Achilles’ heel, and it has a wonderfully dramatic name: the curse of dimensionality. As you add features, the space inflates exponentially, and your data points scatter into a vast emptiness. The cruel consequence is that all pairwise distances start to converge — the nearest neighbour ends up barely closer than the farthest one. Once “near” and “far” mean roughly the same thing, the entire premise of kNN collapses.

The second catch is speed. Because prediction means comparing against the whole training set, kNN gets sluggish as your data grows. Lazy at training time, expensive at inference. (Tricks like KD-trees and approximate nearest-neighbour search soften the blow, but the fundamental tension remains.)

Seeing it in action

In scikit-learn, the whole thing is refreshingly short:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Scaling first, then kNN — bundled so we never forget to scale.
model = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=5, metric="euclidean")
)

model.fit(X_train, y_train)
predictions = model.predict(X_test)

That StandardScaler sitting in front isn’t decoration — it’s the single most important line for getting sane results.

The takeaway

kNN is the perfect first algorithm: no training, no math intimidation, just “look around and ask the neighbours.” When you reach for it, remember three things — always scale your features, tune k with cross-validation (odd numbers for binary problems), and be wary once your feature count climbs, because the curse of dimensionality is real and patient. Use it on low-dimensional, well-scaled data and it’ll quietly outperform models ten times its complexity. You really are who your neighbours are.

k-Nearest Neighbors (kNN), Explained

The laziest learner in the room

“Nearest” depends on how you measure

Choosing k: the Goldilocks problem

Why scaling is non-negotiable

The catch: the curse of dimensionality

Seeing it in action

The takeaway

More posts

Reading Learning Curves: Diagnosing Model Training

Model Explainability: Making Sense of SHAP and LIME

Diffusion Models: How AI Generates Images