PCA and Dimensionality Reduction, Explained

Imagine a dataset with a thousand columns. Customer records, gene expression levels, pixel intensities — pick your poison. Your instinct says more features means more information, so more must be better. Your model disagrees, loudly, by overfitting, training slowly, and producing nearest-neighbour results that are about as meaningful as drawing names from a hat.

Welcome to the curse of dimensionality, and to the tool most people reach for first to break it: Principal Component Analysis.

The curse, briefly

High-dimensional space is weird, and not in a fun way. As you pile on dimensions, the volume of the space explodes exponentially, so your data points scatter into a vast, mostly empty void. To keep the same density you’d need exponentially more samples — which you never have.

Worse is distance concentration. Add enough dimensions and the distance between the nearest and farthest points starts to converge on the same value. “Nearest neighbour” quietly stops meaning anything, which is awkward, because algorithms like k-NN and k-means stake their entire personality on it. In a high-dimensional hypercube, almost all the volume hugs the thin shell near the surface, and almost every point ends up roughly equidistant from every other. The interior is a ghost town.

The fix is to compress those thousand columns down to the handful that actually carry the signal. That’s dimensionality reduction.

What PCA actually does

PCA’s bet is simple: the directions in which your data varies the most are the directions that matter most. A feature where every row has nearly the same value tells you almost nothing; a feature that swings wildly tells you a lot.

So PCA finds new axes — principal components — aligned with the spread of the data. The first component points along the direction of maximum variance. The second points along the most variance remaining, at a right angle to the first. The third does the same, and so on. Mechanically, this is the eigendecomposition of the covariance matrix (or, as scikit-learn actually does it, a singular value decomposition, which is more numerically stable). The eigenvectors are the directions; the eigenvalues say how much variance each one captures.

Think of it as rotating the coordinate system to line up with the shape of your data, like turning a flat photo of a tilted line until the line runs straight across. Once rotated, you keep the top k components and throw away the rest. You’ve gone from D dimensions to k, holding on to as much of the variance — the interesting bit — as possible.

“Variance explained”: how many to keep

Each component captures a fraction of the total variance, and scikit-learn hands you those fractions in explained_variance_ratio_. This is your dial for choosing k.

The usual moves:

Cumulative threshold: keep the fewest components that reach ~90–95% cumulative variance. If five components explain 98% of the action, why carry fifty?
Scree plot: plot variance-per-component and look for the “elbow” where the curve flattens. Past the elbow, you’re collecting noise.

The shortcut is to let PCA do the bookkeeping for you. Hand it a float and it keeps just enough components to hit that variance target:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Standardize FIRST — PCA chases variance, so unscaled
# features with big numeric ranges will hijack it.
X_scaled = StandardScaler().fit_transform(X)

# Keep 95% of the variance, however many components that takes
pca = PCA(n_components=0.95, svd_solver="full")
X_reduced = pca.fit_transform(X_scaled)

print(X_reduced.shape)                 # often dramatically smaller
print(pca.explained_variance_ratio_)   # variance carried by each kept component

One gotcha worth tattooing somewhere: the float-variance trick needs svd_solver="full". And note that StandardScaler is not optional ceremony. PCA equates variance with importance, so a feature measured in dollars will steamroll one measured in fractions purely because its numbers are bigger. Standardize first, always.

Where it shines

Visualization: squash data to 2D or 3D and actually look at it.
Speed: fewer features means faster training and less room to overfit.
Denoising: the low-variance components are frequently just noise — drop them and the signal gets cleaner.
Compression: store k numbers per row instead of D.

Where it doesn’t

PCA is linear. It can only find flat structure, so if your data curls around a curved manifold — the classic “Swiss roll” — PCA will flatten it into mush and lose the very structure you cared about.

It’s also hard to interpret. Each component is a blend of all your original features, so “Component 1” rarely maps to anything a human would name.

For visualization specifically, this is where t-SNE and UMAP come in. Both are nonlinear and obsessive about preserving local neighbourhoods, so they reveal clusters that PCA smears together. UMAP is generally faster than t-SNE and keeps a bit more global structure. But treat them as eye candy, not preprocessing: their axes mean nothing, the gaps and cluster sizes aren’t trustworthy, and they’re stochastic, so two runs give two different pictures. A popular hybrid is to run PCA down to ~50 dimensions first, then feed that into UMAP for the final plot — faster and pre-denoised.

The takeaway

Reach for PCA when you have too many numeric features and want a fast, principled way to shrink them while keeping the signal. The recipe is three lines: standardize with StandardScaler, fit PCA(n_components=0.95), and read off explained_variance_ratio_ to see what you kept. If your goal is a pretty 2D cluster map rather than feeding a downstream model, switch to UMAP — but never read its axes like a chart. Compress the noise, keep the variance, and let your model breathe.