Technology

Anomaly Detection, Explained

Abhay Abhay 5 min read
Anomaly Detection, Explained
Photo by Marcel Eberle on Unsplash

Most machine learning is about finding the haystack: the patterns that repeat, the trends that hold, the average customer who behaves like all the other average customers. Anomaly detection is the awkward sibling that only cares about the needle — the one transaction in a million that’s fraud, the sensor reading that means a turbine is about to eat itself, the login that’s actually an intruder wearing your password like a borrowed coat.

The catch is that needles are, by definition, scarce, sneaky, and forever inventing new disguises. Let’s walk through how we go looking for them, and why it’s harder than it sounds.

What “anomalous” even means

An anomaly is a data point that doesn’t fit the story the rest of your data is telling. That’s a deliberately vague definition because the field is deliberately vague. Fraud, equipment faults, and network intrusions don’t share a tidy mathematical signature — they just share the property of being rare and different. So instead of one grand algorithm, we have three broad families, each with a different theory of what “different” means.

Family 1: Statistical (the back-of-the-envelope approach)

If your data is roughly bell-shaped, the simplest detectors are also the oldest. The z-score asks: how many standard deviations is this point from the mean? Anything past, say, three is suspicious.

import numpy as np

data = np.array([12, 11, 13, 12, 14, 99, 12, 13])
z = (data - data.mean()) / data.std()
outliers = data[np.abs(z) > 3]  # the 99 stands out

The IQR method is the z-score’s robust cousin: it flags anything below Q1 - 1.5×IQR or above Q3 + 1.5×IQR. Because it uses quartiles instead of the mean, one monster outlier can’t drag the threshold around. Both are fast, explainable, and completely lost the moment your data is multi-dimensional or non-Gaussian — which is most of the time.

Family 2: Distance and density (judging by your neighbours)

If “normal” points cluster together, anomalies are the loners. k-Nearest Neighbours (kNN) scores a point by its distance to its k closest friends — far away means weird. Local Outlier Factor (LOF) is the smarter version: it compares a point’s local density to the density of its neighbours. This matters because real data has dense and sparse regions, and a point that looks isolated globally might be perfectly at home in its sparse little neighbourhood. LOF catches the relative loner, not just the geographically distant one.

The price? Distance-based methods struggle in high dimensions (everything becomes equidistant — the famous “curse of dimensionality”) and can be slow on big datasets.

Family 3: Model-based (let the algorithm learn “normal”)

The modern workhorses learn a model of normality and flag whatever doesn’t fit.

Isolation Forest is the clever one. Instead of measuring distance or density, it randomly slices the feature space with decision trees and asks how many cuts it takes to isolate a point. Anomalies, being rare and different, get isolated in just a few cuts. It runs in near-linear time, shrugs off high dimensions, and handles the “masking” and “swamping” effects that trip up older methods — which is why it routinely beats one-class SVM and LOF on benchmarks.

Autoencoders take the deep-learning route: train a neural network to compress data and reconstruct it. Feed it mostly-normal data and it gets good at rebuilding normal things. Anomalies reconstruct badly — the reconstruction error is your anomaly score. Great for images, sequences, and rich sensor streams; overkill for a spreadsheet of three columns.

Here’s Isolation Forest in scikit-learn:

from sklearn.ensemble import IsolationForest

# contamination = your estimated fraction of anomalies
model = IsolationForest(contamination=0.01, random_state=42)
model.fit(X_train)

scores = model.decision_function(X_test)   # lower = more anomalous
labels = model.predict(X_test)             # -1 = anomaly, 1 = normal

That contamination parameter is doing a lot of quiet work — it sets the threshold, and guessing it wrong is one of the easiest ways to ruin your day.

Why this is genuinely hard

Three things conspire against you. Anomalies are rare (often well under 1% of the data), so there’s barely anything to learn from. They’re evolving — fraudsters and attackers actively change tactics, so yesterday’s model rots. And the data is brutally imbalanced, which warps both training and evaluation. (If imbalance is your bottleneck, that’s its own deep topic — see How to Handle Imbalanced Datasets in Classification.)

The evaluation trap

This is where good projects quietly fail. If 99.9% of your data is normal, a model that screams “everything’s fine!” is 99.9% accurate and 100% useless. That’s the accuracy paradox, and it means accuracy is the wrong yardstick entirely.

Reach for precision (of the things I flagged, how many were real?) and recall (of the real anomalies, how many did I catch?). For threshold-independent comparison, prefer PR-AUC over the more famous ROC-AUC: ROC curves paint an over-optimistic picture under heavy imbalance, while precision-recall curves focus squarely on the rare class you actually care about. Just remember the PR-AUC random baseline equals your anomaly rate, so a “0.4” means very different things across datasets.

The takeaway

A practical recipe:

  1. Start dumb. Try z-score or IQR first. If a one-liner catches your needles, ship it.
  2. Match the method to the data. Tabular and medium-sized? Reach for Isolation Forest. Rich/high-dimensional (images, sequences)? Try an autoencoder. Density matters? LOF.
  3. Never trust accuracy. Evaluate with precision, recall, and PR-AUC — and tune your threshold deliberately, not by accepting a default contamination.
  4. Plan to retrain. Anomalies evolve; a model you set and forget is a model that’s already wrong.

Find the needle, sure — but make sure you can prove it’s actually a needle and not just an interesting piece of hay.

More posts