Gradient Boosting and XGBoost: Why It Keeps Winning

Every few years, someone declares that deep learning has finally conquered tabular data and the boring old gradient-boosted trees can retire. And every few years, the Kaggle leaderboard quietly hands the trophy back to XGBoost. As of 2026, gradient boosting still wins most structured-data competitions and still beats fancy transformer architectures on the spreadsheet-shaped problems that make up the bulk of real-world ML. So let’s talk about why this 25-year-old idea refuses to die.

Bagging vs boosting: committees vs apprentices

To understand boosting, it helps to know its sibling, bagging — the trick behind random forests. Bagging trains many deep trees independently on random subsets of the data, then averages their votes. It’s a committee where everyone studies in their own room and you take the consensus. This reduces variance and is gloriously parallel, but no single member ever learns from anyone else’s mistakes.

Boosting is the opposite philosophy. Instead of one big committee, you train a long chain of weak learners — typically shallow trees, often just a few levels deep. The catch: each new tree is trained specifically to fix the errors the previous ensemble made. It’s less a committee, more an apprenticeship. The first tree makes a rough guess, the second studies where the first went wrong and patches it, the third patches what’s still broken, and so on for hundreds of rounds. Many weak models, each correcting its predecessor, add up to one strong one.

The “gradient” in gradient boosting

So what does each new tree actually fit? Here’s the elegant bit. After the current ensemble makes its predictions, you compute the residuals — the gap between prediction and truth — and the next tree is trained to predict those residuals. Nudge the predictions by that amount, recompute the gaps, repeat.

The “gradient” name comes from generalizing this: those residuals are really the negative gradient of your loss function. By fitting trees to the gradient, you’re doing gradient descent — except instead of stepping through a space of numbers, you’re stepping through a space of functions, one tree at a time. That reframing is what lets boosting optimize any differentiable loss, from squared error to log-loss to ranking objectives, with the same machinery.

Why XGBoost, LightGBM, and CatBoost dominate

Gradient boosting is the idea; the libraries are why it’s everywhere. Three frameworks own the space, and they’re genuinely different beasts:

XGBoost is the reliable all-rounder. It added serious regularization, clever sparsity-aware splitting, and parallelized tree construction. When in doubt, this is the safe default.
LightGBM is built for speed and scale. Its gradient-based one-sided sampling and exclusive feature bundling make it blisteringly fast on large, numeric-heavy tables — the right pick when you’re iterating dozens of times an hour.
CatBoost is the one to reach for when you have meaningful categorical columns. Its ordered boosting and native categorical handling mean far less preprocessing and fewer footguns around leakage.

All three are fast, handle missing values gracefully, and squeeze remarkable accuracy out of structured data. That’s why winning Kaggle solutions are still, overwhelmingly, gradient boosting — often stacked several layers deep.

Three knobs that matter (and one trap)

Most of your results come from three hyperparameters:

learning_rate — how much each tree is allowed to nudge predictions. Smaller is more accurate but needs more trees. Think of it as step size.
n_estimators — how many trees in the chain. More isn’t automatically better.
max_depth — how complex each individual tree is. Boosting likes shallow trees (3–8); deep ones overfit fast.

The trap: because each tree chases the last one’s errors, boosting will happily memorize your training data — including its noise — if you let it run too long. The fix is the classic combo: a low learning rate paired with early stopping, which halts training once a validation score stops improving.

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

model = XGBClassifier(
    n_estimators=2000,      # an upper bound; early stopping decides the real count
    learning_rate=0.03,     # small steps, more trees
    max_depth=4,            # shallow weak learners
    subsample=0.8,          # row sampling adds regularization
    early_stopping_rounds=50,
    eval_metric="logloss",
)

model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
print("Trees actually used:", model.best_iteration)

The takeaway

When you’re staring at a CSV — rows, columns, a target to predict — don’t reach for a neural network. Reach for gradient boosting. Start with XGBoost or LightGBM as a baseline (CatBoost if categoricals dominate), set a low learning rate with a generous tree budget, and let early stopping find the sweet spot. You’ll have a strong model in minutes, and on tabular data it’ll very likely beat whatever deep architecture you were tempted to spend the week building. The leaderboard has been telling us this for over a decade. It’s worth listening.

Gradient Boosting and XGBoost: Why It Keeps Winning

Bagging vs boosting: committees vs apprentices

The “gradient” in gradient boosting

Why XGBoost, LightGBM, and CatBoost dominate

Three knobs that matter (and one trap)

The takeaway

More posts

Reading Learning Curves: Diagnosing Model Training

Model Explainability: Making Sense of SHAP and LIME

Diffusion Models: How AI Generates Images