k-Means Clustering, Explained

Picture a school gym at the start of a dance class. The teacher claps and shouts, “Everybody, form three groups!” Nobody knows who belongs with whom, so people shuffle toward whoever’s nearest. Then the teacher marks the centre of each huddle, claps again, and everyone re-shuffles to their closest centre. A few rounds of this and the groups settle. Congratulations — you’ve just watched k-means clustering. No labels, no answer key, just “sort yourselves out, and here’s how many piles I want.”

That’s the whole charm of it. k-means is the workhorse of unsupervised learning: you hand it a pile of unlabeled data and a number, and it discovers groups you never told it about.

The assign-and-update loop

k-means runs the same two steps over and over until things stop moving.

Pick k starting centres (called centroids). These are just points in your feature space — often chosen by a smart trick we’ll get to.
Assign: label every data point with its nearest centroid. Now you have k tentative clusters.
Update: move each centroid to the average position of the points assigned to it. (The “mean” in k-means — that’s literally where the name comes from.)
Repeat steps 2 and 3 until assignments stop changing.

Each round can only make clusters tighter, never looser, so the algorithm is guaranteed to converge. The thing it’s quietly minimising is WCSS — the within-cluster sum of squares, also called inertia: the total squared distance from every point to its own centroid. Lower inertia means tighter, more cohesive blobs.

One catch: where you drop those first centroids matters a lot. Land them badly and you can converge to a lousy arrangement. The fix is k-means++, a smarter initialisation that spreads the starting centres apart instead of plopping them down at random. It’s the default in scikit-learn, and you should basically never turn it off.

Choosing k (the awkward part)

k-means won’t tell you how many clusters exist — you have to. Pick wrong and the algorithm cheerfully splits one real group in half or merges two distinct ones. Two tools help you guess well.

The elbow method. Run k-means for k = 1, 2, 3… and plot inertia against k. It always drops as k grows (more centroids, shorter distances), but the rate of improvement falls off a cliff at some point. That bend — the “elbow” — is a reasonable choice for k. The honest caveat: real data often gives you a gentle curve with no obvious joint, leaving you squinting at a chart trying to find an elbow that isn’t there.

The silhouette score. This measures, for each point, how close it is to its own cluster versus the nearest other cluster. It ranges from -1 to 1; closer to 1 means well-matched and well-separated. Where the elbow asks “are clusters tight?”, the silhouette asks “are clusters distinct?” Run both, and when they agree on a k, you can stop second-guessing yourself.

The fine print: where k-means trips up

k-means is fast and intuitive, but it makes assumptions that bite if you ignore them.

It loves spherical, similarly-sized blobs. Because it judges everything by distance to a centre, it struggles with long, snaking, or oddly-shaped clusters. Two crescents interlocking? k-means will slice straight through them.
Scaling matters — a lot. Distance is the only thing k-means understands, so a feature measured in the thousands (annual income) will steamroll one measured in single digits (number of visits). Always standardise your features first, or income silently becomes the only thing that counts.
It’s sensitive to outliers, since means get yanked around by extreme points.
You must pick k yourself, as covered above.

A quick taste in code

Here’s the whole thing in scikit-learn — fit, predict, and read off the inertia and silhouette:

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
X = StandardScaler().fit_transform(X)   # scale first — always

km = KMeans(n_clusters=4, init="k-means++", n_init="auto", random_state=42)
labels = km.fit_predict(X)

print("Inertia (WCSS):", round(km.inertia_, 1))
print("Silhouette:", round(silhouette_score(X, labels), 3))
print("Centroids:\n", km.cluster_centers_)

Note init="k-means++" for the smart start and n_init="auto", which reruns the whole thing several times from different seeds and keeps the best result — cheap insurance against an unlucky initialisation.

Where you’ll actually use it

The classic is customer segmentation: feed in spending behaviour and let k-means surface “weekend bulk buyers,” “midnight snackers,” and “people who only appear during sales” — groups nobody hand-labeled. It also powers image colour compression (cluster pixels into k representative colours), document grouping, and anomaly detection (points far from every centroid are suspicious).

The takeaway

When you reach for k-means, run this checklist: scale your features, use k-means++ initialisation, and choose k with the elbow and silhouette together rather than gut feel. Then sanity-check the result — if your clusters look like crescents or wildly different sizes, k-means may be the wrong tool, and something like DBSCAN or Gaussian mixtures will serve you better. Get those four habits right and k-means goes from “mysterious black box” to the reliable first thing you try whenever you’ve got unlabeled data and a hunch there’s structure hiding in it.