Technology

Cross-Validation and Data Leakage: Why Your Model's Score Is Lying

Abhay Abhay 5 min read
Cross-Validation and Data Leakage: Why Your Model's Score Is Lying
Photo by Ivan Gromov on Unsplash

Every data scientist has felt the high: you train a model, check the test score, and it reads 96%. You mentally draft the Slack message to your boss. Then the model meets real, live data and faceplants at a humbling 71%. What happened? Your score wasn’t measuring skill. It was measuring how well you’d accidentally cheated. Two culprits are almost always to blame, and they’re best friends: a flimsy evaluation setup and a sneaky thing called data leakage.

One split is a coin toss

The textbook starting point is the train/test split — carve off 80% of your data to learn from, hold back 20% to grade yourself on. Sensible. The problem is that which 20% you hold back matters enormously. Get an easy slice and you look like a genius; get a hard one and you look like you’ve never seen a spreadsheet.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Change random_state to 7 and your accuracy might swing three or four points for no reason other than luck. A single split gives you a number, but not a trustworthy number. It’s a measurement with no error bars — the statistical equivalent of judging a restaurant by one bite of one dish.

k-fold: grade yourself five times

Cross-validation fixes the lottery by refusing to rely on a single draw. k-fold chops the data into k equal chunks (folds), then trains k times — each time holding out a different fold as the test set and training on the rest. You get k scores, and their average is a far steadier estimate of how your model actually behaves.

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
scores = cross_val_score(model, X, y, cv=5)

print(f"Scores: {scores.round(3)}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std():.3f})")

Now you have a mean and a spread. If the five folds read 0.94, 0.71, 0.93, 0.69, 0.92, that wide spread is screaming at you — your model is fragile, and any single split would have lied to your face.

Stratify when classes are lopsided

If you’re predicting something rare — fraud, disease, churn — plain k-fold can hand you a fold with almost no positive cases, wrecking the estimate. StratifiedKFold keeps each fold’s class balance the same as the full dataset. For classifiers, scikit-learn does this automatically when you pass an integer to cv, but being explicit is a good habit:

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv)

The silent killer: data leakage

Here’s where the truly embarrassing scores come from. Data leakage is when information from outside the training set sneaks into your model — so it “knows” things at evaluation time that it could never know in production. The model isn’t smart. It peeked at the answer sheet.

The most common offender is fitting your preprocessing on the whole dataset before splitting. Watch this innocent-looking mistake:

from sklearn.preprocessing import StandardScaler

# WRONG: scaler learns the mean/std of ALL data, test set included
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)        # leakage!
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

By calling fit_transform on the full X, the scaler computed statistics using rows that later end up in your test set. Your test data has quietly informed your training. The same trap snaps shut with encoders, imputers, feature selection, and SMOTE resampling — anything that learns from data.

There are two more flavours worth naming. Target leakage is when a feature secretly encodes the answer — like using total_paid to predict whether a customer paid, or a “discharge date” column to predict hospital admission. Temporal leakage is using the future to predict the past — shuffling time-series data so the model trains on Thursday to predict Tuesday. Both produce gorgeous scores and useless models.

Pipelines: the seatbelt that’s always on

The fix is delightfully simple: bundle preprocessing and model into a Pipeline. When you cross-validate a pipeline, scikit-learn re-fits every step inside each fold, using only that fold’s training data. The scaler never sees the held-out rows. Leakage closed off by construction.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", RandomForestClassifier(random_state=42)),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv)
print(f"Honest mean accuracy: {scores.mean():.3f}")

That cross_val_score line now does the right thing automatically: for each of the five folds, it fits the scaler on the training rows, transforms the test rows with those stats, fits the model, and scores. No peeking, no inflated numbers. (For SMOTE and other resamplers, reach for imblearn.pipeline.Pipeline, which knows how to resample only the training fold.)

The honesty checklist

Before you trust any score, run through this:

  • Never fit on data you’ll later test on. Fit on train, transform test — or just use a Pipeline and stop thinking about it.
  • Cross-validate instead of trusting one split. Report the mean and the standard deviation.
  • Stratify for imbalanced classification; use TimeSeriesSplit for anything time-ordered.
  • Interrogate suspiciously good features. If a column is too predictive, ask whether it would actually exist before the prediction is made.
  • Put every learning step inside the pipeline — scalers, encoders, imputers, selectors, resamplers.

A model that scores 88% honestly beats one that scores 99% by cheating, every single time — because only one of them survives contact with the real world. Build the pipeline, average the folds, and let your score finally tell the truth.

More posts