Ensemble Methods: Bagging, Boosting, and Stacking

Ask a thousand strangers to guess the number of jelly beans in a jar and something faintly magical happens: the average of their guesses usually beats almost every individual, including the smug guy who said he “has a feel for these things.” That’s the wisdom of crowds, and it’s also the entire philosophy behind ensemble learning. One model has opinions. A committee of models, combined cleverly, has better opinions.

The catch is how you combine them. Throw a hundred identical models in a room and you just get the same wrong answer a hundred times. The art is in making the members disagree in useful ways, then aggregating their disagreement into something smarter than any single voice. There are three classic recipes for this: bagging, boosting, and stacking. Let’s meet the family.

Bagging: many voices, voting in parallel

Bagging (bootstrap aggregating) trains lots of models independently and in parallel, each on a slightly different random sample of the data (sampled with replacement). Then it averages their predictions, or takes a majority vote.

Because each model sees a different slice of reality, their individual quirks and overfits tend to cancel out when you average them. That’s the headline: bagging reduces variance. It takes a high-variance, twitchy model — a deep decision tree being the poster child — and calms it down by committee.

The most famous bagging method is the random forest, which adds a second sprinkle of randomness (each split considers only a random subset of features) to make the trees even more independent. I’ve covered random forests in their own post, so I’ll resist re-explaining the internals here — just know that “a forest of decorrelated trees, averaged” is bagging in its most successful form.

Boosting: learning from your mistakes, sequentially

Boosting flips the philosophy. Instead of training models independently, it trains them one after another, with each new model focusing on the examples its predecessors got wrong. The committee isn’t a crowd voting at once; it’s a relay team where each runner is told exactly where the last one stumbled.

Because every round chases the residual errors of the round before, boosting attacks bias — it builds an increasingly accurate composite model out of weak learners that, alone, would barely beat a coin flip. The trade-off: it can overfit if you let it run too long, and the sequential nature means you can’t trivially parallelise it like bagging.

The dominant flavour today is gradient boosting, and its battle-hardened implementation XGBoost (plus LightGBM and CatBoost), which quietly wins a suspicious number of Kaggle competitions. Again, I’ve given gradient boosting a dedicated write-up, so here the point is just the taxonomy: bagging is parallel and variance-killing; boosting is sequential and bias-killing.

Voting and averaging: the no-fuss ensemble

Before the fancy stuff, there’s the simplest ensemble of all: train a few different models — say logistic regression, a random forest, and a gradient booster — and just combine their outputs. Hard voting takes the majority class; soft voting averages the predicted probabilities (usually better, since it respects confidence). No meta-magic, no extra training. Often surprisingly effective.

Stacking: a model that referees the other models

Stacking (stacked generalisation) is the clever cousin. Instead of averaging base models with a fixed rule, it trains another model — a “meta-learner” — to figure out the best way to combine them. Maybe the random forest is trustworthy on easy cases but the booster should win on the hard ones; a stacker can learn exactly that nuance.

The crucial detail is avoiding leakage. You can’t train the meta-learner on predictions the base models made about data they were trained on — they’d look like geniuses and the meta-model would learn nonsense. Scikit-learn’s StackingClassifier handles this for you with cross_val_predict: base models are fitted on the full data, but the meta-learner is trained on their out-of-fold predictions. Honest predictions in, honest combiner out.

from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

base_models = [
    ("rf", RandomForestClassifier(n_estimators=200, random_state=0)),
    ("svc", SVC(probability=True, random_state=0)),
]

stack = StackingClassifier(
    estimators=base_models,
    final_estimator=LogisticRegression(),  # the meta-learner / referee
    cv=5,  # out-of-fold predictions prevent leakage
)
stack.fit(X_train, y_train)
print(stack.score(X_test, y_test))

Blending is the lazy sibling of stacking: instead of cross-validation, you carve off a small holdout set to train the meta-learner. Simpler and faster, but it wastes data and is touchier. Stacking is usually worth the extra plumbing.

Which one should you reach for?

A quick rule of thumb:

Model overfitting (high variance)? → Bagging / random forests. Parallel, robust, low-maintenance.
Model underfitting (high bias)? → Boosting / XGBoost. Sequential, accurate, your go-to for tabular leaderboards.
Already have several decent, diverse models and want to squeeze out the last few points? → Stacking (or just soft voting if you want it cheap).

The takeaway: ensembles aren’t a single algorithm, they’re a strategy, and the strategy that works depends on what’s wrong with your single model. Diagnose first — is it bias or variance? — then pick the recipe. Start with a random forest as a strong baseline, try a gradient booster, and only when you’ve got a few genuinely different models worth combining should you bring in stacking. The crowd is wise, but only if you ask it the right way.

Sources: scikit-learn ensemble guide, StackingClassifier docs, VotingClassifier docs.