Classification vs Regression: Which ML Problem Do You Have?

Before you reach for a fancy model, ask yourself one boringly important question: am I predicting a label or a number? Get this wrong and everything downstream — your algorithm, your loss function, your evaluation metric — quietly goes sideways. The good news is that the answer is usually obvious once you know what to look for. The bad news is that plenty of people skip the question entirely and then wonder why their “97% accurate” model is useless.

This is the great fork in the road of supervised learning: classification predicts a category, regression predicts a quantity. That’s it. That’s the distinction. Everything else is detail.

Predicting a category vs predicting a number

If the thing you want to predict is something you could write on a label — spam or not spam, cat or dog or iguana, will churn or won’t — you have a classification problem. The output is discrete, drawn from a fixed set of classes.

If the thing you want to predict is something you’d measure with a number that lives on a continuum — house price, tomorrow’s temperature, how many minutes until your food arrives — you have a regression problem. The output is continuous.

A quick gut check: would it make sense to be “off by 3”? Being off by 3 degrees Celsius is meaningful (regression). Being “off by 3 dog breeds” is nonsense (classification). Numbers have distance; categories don’t.

The flavours of classification

Classification isn’t one thing. There are three common shapes:

Binary: exactly two classes. Spam or ham. Fraud or legit. The bread and butter.
Multiclass: more than two mutually exclusive classes. An image is an apple or a pear or an orange — pick one. Handily, every scikit-learn classifier does multiclass out of the box.
Multilabel: each sample can wear several labels at once. A news article can be tagged politics and economics* and Europe. Here the labels aren’t mutually exclusive, so you predict a set, not a single winner.

Mixing these up is its own trap: forcing a multilabel reality (a movie that’s both “comedy” and “romance”) into a single-label multiclass model means inventing awkward combo-classes like “rom-com” and watching your class count explode.

Different problems, different toolkits

The two camps share some surprisingly similar machinery — a decision tree, a random forest, or a neural network can do either job depending on its output layer. But the metrics are where they part ways completely, and metrics are how you know if you’re winning.

For classification, you live in the land of accuracy, precision, recall, and F1. Accuracy alone is a liar when classes are imbalanced — a model that always shouts “not fraud” scores 99.9% accuracy on a dataset that’s 0.1% fraud, while catching exactly zero fraudsters. That’s why precision and recall exist.

For regression, you measure how far off you are with mean squared error (MSE), its friendlier-scaled cousin RMSE, mean absolute error (MAE), or R² (how much variance you actually explained). There’s no “accuracy” because being exactly right on a continuous value almost never happens — you’re always a little off, and the question is how much.

Here’s the same dataset shape, two different problems, in scikit-learn:

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import f1_score, root_mean_squared_error

# Classification: will this customer churn? (label)
clf = RandomForestClassifier()
clf.fit(X_train, y_labels)
preds = clf.predict(X_test)          # -> ['churn', 'stay', 'stay', ...]
print(f1_score(y_test, preds, pos_label="churn"))

# Regression: how much will this customer spend? (number)
reg = RandomForestRegressor()
reg.fit(X_train, y_amounts)
preds = reg.predict(X_test)          # -> [248.10, 19.99, 512.40, ...]
print(root_mean_squared_error(y_test, preds))

Notice the family name barely changes (Classifier vs Regressor) — but .predict() hands you categories in one case and floats in the other, and the metric you grade with is entirely different.

The classic trap: forcing one into the other

The most common mistake is treating a number like a category, or vice versa.

Binning a continuous target into buckets (“low / medium / high” income) throws away ordering and magnitude — the model can’t tell that “high” is closer to “medium” than to “low,” and you’ve manufactured arbitrary cliffs at the bucket edges. If you genuinely care about the number, predict the number.

Treating ordered categories as raw numbers is the reverse sin. Encoding star ratings as 1–5 and running plain regression assumes the gap between 1 and 2 stars equals the gap between 4 and 5 — which is rarely true. (Ordinal models exist for exactly this awkward middle ground.)

And then there’s the eternal source of confusion: logistic regression is a classification algorithm. The “regression” in its name is a historical prank on every beginner. It outputs probabilities you threshold into classes.

The takeaway

Run this two-step check before writing a line of model code:

Is the target a label or a number? Label → classification. Number → regression.
If it’s a label, how many and are they exclusive? Two → binary. Many, pick-one → multiclass. Many, pick-several → multilabel.

Nail those, and your choice of algorithm and — crucially — your evaluation metric fall out almost for free. The most expensive bugs in machine learning aren’t in the code; they’re in answering the wrong question precisely.

Sources: IBM — Classification vs Regression, scikit-learn — Multiclass and multioutput algorithms, Coursera — Classification vs. Regression.