ROC Curves and AUC, Explained
Abhay
4 min read
Your classifier spits out a number between 0 and 1. To turn that into a yes/no answer, you pick a threshold — say, 0.5 — and call everything above it “positive.” But 0.5 is just a hunch in a lab coat. Move it to 0.3 and you catch more positives while also crying wolf more often. Move it to 0.8 and you’re stingy but precise. So how do you judge a model when its behaviour depends entirely on a knob you haven’t turned yet?
That’s the question the ROC curve answers. It refuses to commit to a single threshold and instead asks: across every threshold, how does this model trade catching real positives against raising false alarms?
Two rates, one curve
The ROC curve plots two quantities against each other:
- True Positive Rate (TPR) — of all the actual positives, how many did you catch? This is just recall. Higher is better.
- False Positive Rate (FPR) — of all the actual negatives, how many did you wrongly flag? Lower is better.
Now imagine sweeping the threshold from 1.0 down to 0.0. At 1.0, the model says “positive” to nobody: TPR and FPR are both 0 (bottom-left corner). At 0.0, it says “positive” to everybody: both rates hit 1 (top-left… sorry, top-right corner). In between, each threshold gives you one (FPR, TPR) point, and connecting them traces the curve.
The shape tells the story. A curve that hugs the top-left corner is the dream: high catch rate, almost no false alarms. The boring diagonal line from corner to corner is a model doing no better than a coin flip — for every extra positive it catches, it raises an equal number of false alarms. If your curve dips below the diagonal, congratulations, your model has learned the truth and is confidently telling you the opposite. (Flip its predictions and you’re back in business.)
AUC: squashing the curve into one number
Comparing whole curves by eye is fine for two models and miserable for twenty. So we collapse the curve into a single number: the Area Under the Curve, or AUC. It ranges from 0 to 1:
- 1.0 = perfect separation. Every positive scored higher than every negative.
- 0.5 = the diagonal. Pure guessing.
- < 0.5 = worse than guessing (the inverted-model situation above).
Here’s the genuinely useful intuition, the one worth tattooing on your mental model: AUC is the probability that the model scores a randomly chosen positive higher than a randomly chosen negative. An AUC of 0.9 means that if you grab one real positive and one real negative at random, the model ranks them correctly 90% of the time. That’s why people call AUC a measure of ranking quality — it doesn’t care where you put the threshold, only whether positives generally float above negatives. Threshold-independence is the whole point.
In scikit-learn it’s a two-liner:
from sklearn.metrics import roc_auc_score, roc_curve
# y_true: 0/1 labels, y_scores: model probabilities (e.g. clf.predict_proba(X)[:, 1])
auc = roc_auc_score(y_true, y_scores)
print(f"ROC-AUC: {auc:.3f}")
# Want the curve itself? roc_curve hands back the raw ingredients:
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
Note you feed roc_auc_score the raw probabilities, not the thresholded 0/1 predictions. Hand it binary labels and you’ve thrown away the very ranking information AUC exists to measure.
When AUC quietly lies to you
Now the part that trips up production teams. ROC-AUC has a blind spot, and it’s a big one: class imbalance.
Look again at FPR. Its denominator is the total number of actual negatives. When negatives massively outnumber positives — fraud, rare disease, click-through, the usual suspects — that denominator is enormous. You can rack up thousands of false positives and barely nudge the FPR, because thousands is a rounding error against millions. The curve still hugs the corner. AUC still looks gorgeous. Meanwhile your fraud team is drowning in false alerts.
The numbers make it vivid. In a worked example on a dataset with under 1% positives, the same model scored a flattering ROC-AUC of 0.957 but a sobering PR-AUC of 0.708. Same predictions, two very different verdicts. On a balanced dataset the two metrics tracked each other closely; the gap only yawned open under heavy imbalance.
The fix is the Precision-Recall curve and its area, PR-AUC (in sklearn, average_precision_score). It plots precision against recall and — crucially — uses no true-negative count anywhere. It stares only at the positive class, so the ocean of easy negatives can’t pad your score. When the positive class is rare and finding it is the entire job, PR-AUC tells you the truth ROC-AUC was too polite to mention.
The takeaway
Three rules of thumb to carry out the door:
- Use ROC-AUC to compare ranking quality on roughly balanced data, or when both classes matter equally. Remember: 0.5 is a coin flip, not a passing grade.
- Switch to PR-AUC the moment your positives are rare and false positives are expensive. If your classes are more lopsided than 1-in-10, default to it.
- Always score on probabilities, never hard labels — and never report a single threshold’s accuracy on imbalanced data and call it a day.
A 0.95 AUC is not a trophy until you’ve checked who’s in the room. On a 1%-positive problem, that number might just be the majority class taking a bow.