Experiment Tracking and Reproducibility in ML
Abhay
4 min read
There is a specific flavour of dread reserved for the sentence “but it worked on my machine last Tuesday.” A model that scored 94% accuracy in a meeting has quietly become a model that scores 88% today, and nobody can say why. Was it a different random seed? A library that auto-updated overnight? A teammate who “cleaned up” the training data? You don’t know, because last Tuesday lives only in a Jupyter notebook that’s been re-run nineteen times since.
This is the central misery of machine learning without discipline: results that can’t be reproduced aren’t results. They’re rumours.
Why ML breaks reproducibility harder than normal code
Regular software is deterministic enough that “run it again” usually works. ML laughs at this. A training run is a soup of moving parts: stochastic optimisation, random weight initialisation, shuffled batches, and even non-deterministic GPU kernels that give slightly different floating-point sums depending on the hardware’s mood. Change any of them and your numbers drift.
Worse, the inputs are bigger than your code. The 2025 AI Magazine survey on ML reproducibility identified five pillars you have to pin down: code versioning, data access, data versioning, experiment logging, and pipeline creation. Miss one and the whole thing wobbles. Git happily versions your train.py, but it has no idea your 40 GB dataset got re-labelled, or that you bumped scikit-learn from 1.4 to 1.6 and a default quietly changed under you.
Track everything that touches a result
If a number ends up in a slide, you should be able to answer four questions about it:
- Parameters — the learning rate, batch size, architecture, and every hyperparameter you fiddled with.
- Metrics — accuracy, loss, F1, and the validation curve, not just the final headline figure.
- Artifacts — the trained model file, plots, confusion matrices, the actual outputs.
- Code + data versions — the exact Git commit and the exact data snapshot that produced it.
That last bullet is the one everyone skips and everyone regrets.
The tooling: MLflow, W&B, and DVC
You don’t have to build this yourself. Three tools dominate in 2026, and the honest answer is that serious teams usually run two of the three together.
MLflow is the open-source, self-hostable workhorse. Its tracking API is refreshingly blunt: roughly three calls (start_run, log_params, log_metrics) cover 80% of what anyone logs, and it ships with a model registry. The UI looks like it was designed in 2019, but it works and you own it.
Weights & Biases is what researchers reach for during active training. Real-time dashboards, native hyperparameter Sweeps, and shareable Reports make it genuinely pleasant. The trade-off is cost — around $50/user/month for teams — and metered artifact storage.
DVC solves the part the other two are weakest at: data versioning. It bolts onto Git so your data, models, and pipeline DAG (dvc.yaml) get versioned alongside your code, giving you audit-grade reproducibility. It assumes you’re comfortable with Git and brings little experiment tracking of its own — which is exactly why it pairs so well with MLflow.
A common, sane stack: MLflow for runs + DVC for data + Git for code. Each covers the others’ blind spots.
What logging actually looks like
The barrier to entry is lower than the anxiety suggests. Here’s MLflow capturing a run end to end:
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
mlflow.set_experiment("churn-prediction")
with mlflow.start_run():
params = {"n_estimators": 300, "max_depth": 12, "random_state": 42}
model = RandomForestClassifier(**params).fit(X_train, y_train)
acc = accuracy_score(y_val, model.predict(X_val))
mlflow.log_params(params) # what you chose
mlflow.log_metric("val_accuracy", acc) # what happened
mlflow.sklearn.log_model(model, "model") # the artifact itself
mlflow.set_tag("git_commit", "a1b9f3c") # the exact code
Four extra lines turn an unrepeatable experiment into one with a permanent receipt.
Reproducibility practices that pay rent
Tracking tells you what you did. These habits make it possible to redo it:
- Set every seed. Pin Python, NumPy, and your framework (
torch.manual_seed, etc.). On CUDA, settorch.backends.cudnn.deterministic = True. Then run with a few different seeds anyway, so you know whether that 2% gain is real or just luck. - Freeze the environment. A
requirements.txtwith pinned versions is the floor; a Docker image or Conda lockfile is the ceiling. “Latest” is not a version. - Version your data, not just your code. Use DVC (or equivalent) so every run points at an immutable data snapshot. Re-labelling shouldn’t silently rewrite history.
- Stamp the commit on the run. Log the Git SHA with each experiment so “the model from Tuesday” maps to actual lines of code.
The takeaway
Do this next, today: wrap your current training script in an mlflow.start_run(), log your params, metrics, and the Git commit, pin your dependencies, and put your dataset under DVC. It’s an afternoon of work that converts “it worked on my machine last Tuesday” into “here’s the exact run, rebuild it whenever you like.” Future-you, staring down a regression three weeks from now, will be quietly, enormously grateful.