Decision Trees and Random Forests, Explained

If you have ever played twenty questions, you already understand a decision tree. “Is it an animal? Does it have fur? Bigger than a breadbox?” Each answer carves the world of possibilities in half until only one suspect remains. Machine learning’s decision tree does the same thing, except it figures out the questions for you, and it never gets bored. The trouble is that a single tree, left to its own devices, becomes the kind of know-it-all who has memorised the textbook but flunks the exam. Random forests are the fix. Let’s see how we get from one to the other.

How a tree decides where to split

A decision tree learns by repeatedly asking: which question splits my data into the cleanest groups? “Cleanest” needs a number, and the usual one is Gini impurity. A Gini score of 0 means a node is perfectly pure (everyone in it belongs to the same class); higher means more of a mixed bag. The tree tries every candidate split, measures how much purity it buys, and greedily picks the winner.

Its main rival is entropy, borrowed from information theory, which measures the same idea as “surprise” or uncertainty in the node. In practice the two almost always agree. Gini is slightly cheaper to compute, which is why it’s the default in most libraries; entropy occasionally yields marginally more balanced trees. Honestly, for 99% of problems you can flip a coin between them and move on with your life.

The tree keeps splitting, level after level, until the leaves are pure or some stopping rule kicks in. And therein lies the problem.

Why one tree overfits

A tree with no leash will keep splitting until every training point sits in its own private leaf. At that point it has 100% training accuracy and the predictive instincts of a stopped clock. It hasn’t learned the signal; it has memorised the noise. Show it data it has never seen and it falls apart. In the jargon, single trees are low bias, high variance: change a few training rows and you can get a wildly different tree.

You can prune a tree, cap its depth, or demand a minimum number of samples per leaf. These help. But there’s a cleverer trick: instead of agonising over one perfect tree, grow a crowd of imperfect ones.

Bagging: wisdom of the (tree) crowd

The crowd technique is bagging, short for bootstrap aggregating. You draw many random samples of your data with replacement (so some rows repeat, some sit out), train a separate tree on each, and average their votes. Any one tree is twitchy and overconfident, but their errors are largely independent, so averaging cancels much of the variance out. It’s the statistical version of asking a thousand people to guess the weight of an ox: individually noisy, collectively uncanny.

Random forests: bagging with a twist

A random forest is bagging plus one extra dose of randomness. At every split, each tree is only allowed to consider a random subset of the features, not all of them. By default scikit-learn uses the square root of the total feature count (max_features='sqrt').

Why hobble the trees like this? Because without it, if one feature is wildly predictive, every tree would pick it for the first split and they’d all end up near-identical clones. Correlated trees vote together, which defeats the point of a crowd. Forcing each tree to sometimes ignore the obvious star produces a genuinely diverse committee, and diversity is what makes the averaging work. The default forest in scikit-learn grows 100 such trees (n_estimators=100) and lets them vote.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

clf = RandomForestClassifier(
    n_estimators=300,     # more trees, more stable votes
    max_features="sqrt",  # feature randomness at each split
    random_state=42,
)
clf.fit(X_train, y_train)
print(f"Accuracy: {clf.score(X_test, y_test):.3f}")

# Which features actually mattered?
for name, imp in sorted(
    zip(load_wine().feature_names, clf.feature_importances_),
    key=lambda t: t[1], reverse=True,
)[:5]:
    print(f"{name:25s} {imp:.3f}")

Feature importance (with a caveat)

That feature_importances_ attribute is a lovely freebie. The forest tracks how much each feature reduced impurity across all its splits and reports the totals: a quick ranking of what drove the predictions. Just don’t worship it blindly. This default (impurity-based) importance is biased toward high-cardinality features like continuous numbers or IDs, and it can mislead when features are correlated. When the stakes are real, reach for permutation importance instead, which measures how much accuracy drops when you scramble a column.

The trade-offs, honestly

Forests are the reliable workhorse of tabular machine learning: accurate out of the box, robust to outliers, indifferent to feature scaling, and refreshingly hard to overfit. The price is interpretability and size. A single tree you can print and read aloud; 300 trees are a black box you query, not a flowchart you follow. They’re also heavier to store and slower to predict.

The takeaway: when you have tabular data and want a strong baseline with almost no fuss, start with a random forest, leave the defaults mostly alone, bump n_estimators up a bit, and read off feature_importances_ for a first look at what matters, then confirm the important features with permutation importance before you bet anything on them. One tree is a clever guesser; a forest is a wise committee, and the committee almost always wins.