Experiment Tracking and Reproducibility in ML

There is a specific flavour of dread reserved for the sentence “but it worked on my machine last Tuesday.” A model that scored 94% accuracy in a meeting has quietly become a model that scores 88% today, and nobody can say why. Was it a different random seed? A library that auto-updated overnight? A teammate who “cleaned up” the training data? You don’t know, because last Tuesday lives only in a Jupyter notebook that’s been re-run nineteen times since.

This is the central misery of machine learning without discipline: results that can’t be reproduced aren’t results. They’re rumours.

Why ML breaks reproducibility harder than normal code

Regular software is deterministic enough that “run it again” usually works. ML laughs at this. A training run is a soup of moving parts: stochastic optimisation, random weight initialisation, shuffled batches, and even non-deterministic GPU kernels that give slightly different floating-point sums depending on the hardware’s mood. Change any of them and your numbers drift.

Worse, the inputs are bigger than your code. The 2025 AI Magazine survey on ML reproducibility identified five pillars you have to pin down: code versioning, data access, data versioning, experiment logging, and pipeline creation. Miss one and the whole thing wobbles. Git happily versions your train.py, but it has no idea your 40 GB dataset got re-labelled, or that you bumped scikit-learn from 1.4 to 1.6 and a default quietly changed under you.

Track everything that touches a result

If a number ends up in a slide, you should be able to answer four questions about it:

Parameters — the learning rate, batch size, architecture, and every hyperparameter you fiddled with.
Metrics — accuracy, loss, F1, and the validation curve, not just the final headline figure.
Artifacts — the trained model file, plots, confusion matrices, the actual outputs.
Code + data versions — the exact Git commit and the exact data snapshot that produced it.

That last bullet is the one everyone skips and everyone regrets.

The tooling: MLflow, W&B, and DVC

You don’t have to build this yourself. Three tools dominate in 2026, and the honest answer is that serious teams usually run two of the three together.

MLflow is the open-source, self-hostable workhorse. Its tracking API is refreshingly blunt: roughly three calls (start_run, log_params, log_metrics) cover 80% of what anyone logs, and it ships with a model registry. The UI looks like it was designed in 2019, but it works and you own it.

Weights & Biases is what researchers reach for during active training. Real-time dashboards, native hyperparameter Sweeps, and shareable Reports make it genuinely pleasant. The trade-off is cost — around $50/user/month for teams — and metered artifact storage.

DVC solves the part the other two are weakest at: data versioning. It bolts onto Git so your data, models, and pipeline DAG (dvc.yaml) get versioned alongside your code, giving you audit-grade reproducibility. It assumes you’re comfortable with Git and brings little experiment tracking of its own — which is exactly why it pairs so well with MLflow.

A common, sane stack: MLflow for runs + DVC for data + Git for code. Each covers the others’ blind spots.

What logging actually looks like

The barrier to entry is lower than the anxiety suggests. Here’s MLflow capturing a run end to end:

import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

mlflow.set_experiment("churn-prediction")

with mlflow.start_run():
    params = {"n_estimators": 300, "max_depth": 12, "random_state": 42}
    model = RandomForestClassifier(**params).fit(X_train, y_train)

    acc = accuracy_score(y_val, model.predict(X_val))

    mlflow.log_params(params)              # what you chose
    mlflow.log_metric("val_accuracy", acc) # what happened
    mlflow.sklearn.log_model(model, "model")  # the artifact itself
    mlflow.set_tag("git_commit", "a1b9f3c")   # the exact code

Four extra lines turn an unrepeatable experiment into one with a permanent receipt.

Reproducibility practices that pay rent

Tracking tells you what you did. These habits make it possible to redo it:

Set every seed. Pin Python, NumPy, and your framework (torch.manual_seed, etc.). On CUDA, set torch.backends.cudnn.deterministic = True. Then run with a few different seeds anyway, so you know whether that 2% gain is real or just luck.
Freeze the environment. A requirements.txt with pinned versions is the floor; a Docker image or Conda lockfile is the ceiling. “Latest” is not a version.
Version your data, not just your code. Use DVC (or equivalent) so every run points at an immutable data snapshot. Re-labelling shouldn’t silently rewrite history.
Stamp the commit on the run. Log the Git SHA with each experiment so “the model from Tuesday” maps to actual lines of code.

The takeaway

Do this next, today: wrap your current training script in an mlflow.start_run(), log your params, metrics, and the Git commit, pin your dependencies, and put your dataset under DVC. It’s an afternoon of work that converts “it worked on my machine last Tuesday” into “here’s the exact run, rebuild it whenever you like.” Future-you, staring down a regression three weeks from now, will be quietly, enormously grateful.