Regression Metrics: MAE, MSE, RMSE, and R-Squared

Your regression model spits out a number. The true answer is a different number. The gap between them is the error — and the whole job of a regression metric is to squash a pile of those gaps into a single score you can actually reason about. Simple enough. The catch is that there are several ways to do the squashing, each tells a subtly different story, and picking the wrong one is how you end up shipping a model that looks great in the notebook and disappoints in production.

This is the regression counterpart to the confusion-matrix world of precision and recall. No classes here — just continuous predictions and the distances you missed by. Let’s meet the four metrics you’ll actually use.

MAE: the honest, blunt one

Mean Absolute Error is exactly what it sounds like: take the absolute value of each error, average them. If your house-price model has an MAE of $18,000, then on a typical prediction it’s off by about eighteen grand. That’s it. No interpretation gymnastics.

MAE’s two virtues are that it’s in the same units as your target (dollars, degrees, minutes) and it’s robust to outliers — one wildly wrong prediction nudges the average a little, but doesn’t detonate it. The price you pay is that MAE treats a single $100k miss the same as five $20k misses. If big mistakes are especially bad for your use case, MAE shrugs where you’d want it to scream.

MSE and RMSE: the drama queens

Mean Squared Error squares each error before averaging. Squaring does two things: it kills the sign (no cancelling positives against negatives) and it punishes large errors disproportionately. A prediction off by 10 contributes 100; off by 20 contributes 400. Outliers don’t just count — they dominate.

The downside is that MSE lives in squared units. “Squared dollars” means nothing to anyone, including you. So we take the square root and get RMSE, which lands back in the original units and stays interpretable. RMSE is the default reporting metric in much of the field precisely because it’s intuitive and it leans on big errors.

Rule of thumb: RMSE is always ≥ MAE. When the two are close, your errors are fairly uniform. When RMSE towers over MAE, you’ve got a few ugly outliers dragging things around — a useful diagnostic in itself.

R²: variance explained, with traps

MAE and RMSE tell you how far off you are, but not whether that’s good. Is an RMSE of 50 impressive? Depends entirely on the scale of the thing. Enter R-squared (the coefficient of determination), which answers a different question: how much better are you than just guessing the average every time?

R² = 1.0 — perfect predictions.
R² = 0.0 — you’re no better than a constant model that always predicts the mean. Ouch.
R² < 0 — yes, negative R² is real, and it means your model is worse than that lazy mean-predictor. Usually a sign of a serious bug, evaluating on data wildly different from training, or a model that didn’t actually learn anything.

R²’s trap is that it never decreases when you add features, even useless ones — toss in a column of random noise and R² nudges up. That’s why adjusted R² exists: it penalises freeloading features so you don’t fool yourself into thinking complexity equals quality.

The one everyone reaches for and shouldn’t: MAPE

Mean Absolute Percentage Error is tempting because percentages feel universal — “we’re off by 8% on average” sounds great in a meeting. But MAPE has a nasty habit: it divides by the true value, so it blows up toward infinity whenever the actual is near zero, and it asymmetrically punishes over-predictions less than under-predictions. If your target ever flirts with zero, MAPE will lie to you with confidence. Use it only when values are safely positive and far from zero — forecasting demand for a busy product, say — and even then, keep RMSE alongside it.

Computing them in scikit-learn

Modern scikit-learn (1.4+) gives you a dedicated root_mean_squared_error — no more passing the now-deprecated squared=False flag to mean_squared_error.

from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    root_mean_squared_error,
    r2_score,
)

y_true = [310_000, 250_000, 420_000, 180_000]
y_pred = [305_000, 270_000, 390_000, 230_000]

print(f"MAE:  {mean_absolute_error(y_true, y_pred):,.0f}")
print(f"MSE:  {mean_squared_error(y_true, y_pred):,.0f}")
print(f"RMSE: {root_mean_squared_error(y_true, y_pred):,.0f}")
print(f"R²:   {r2_score(y_true, y_pred):.3f}")

That’s the entire toolkit. Notice MAE and RMSE come back in dollars, while R² is a unitless score you can compare across problems.

Which to report when

A quick decision guide:

Want a plain-English “off by roughly X”? Report MAE — same units, outlier-robust.
Big errors are especially costly? Optimise and report RMSE — it punishes them.
Need to know if the model beats a trivial baseline? Use R²; treat anything ≤ 0 as a red flag, not a number to polish.
Tempted by a percentage? Be careful with MAPE — only when values stay well clear of zero.

The takeaway: never report a single metric. Show MAE and RMSE together — the gap between them reveals your outlier story — and add R² for context on whether you’ve actually learned something. One number can flatter a bad model; three numbers rarely conspire to lie.

Sources: scikit-learn — Metrics and scoring, scikit-learn — root_mean_squared_error, scikit-learn — r2_score.