Putting a Model in Production: Serving ML Models
Abhay
4 min read
Your model hits 94% accuracy in the notebook. Everyone claps. Then someone asks the question that turns triumph into cold sweat: “Great — how do users actually call it?”
That gap, between a model that works on your laptop and one that answers real requests at 3 a.m. without you babysitting it, is where most ML projects quietly go to die. A .ipynb file is not a product. Let’s walk the bridge from notebook to served endpoint.
Step 1: Get the model out of memory
Right now your trained model lives in a Python variable that vanishes the moment the kernel restarts. You need to serialize it — freeze its weights and structure to disk. Your choices, roughly in order of how much you should think about them:
- pickle — Python’s built-in. Works, but it’s slow, version-fragile, and a genuine security hole (unpickling untrusted data can execute arbitrary code). Fine for a quick local save, never for something you’d download.
- joblib — pickle’s smarter cousin for scikit-learn and anything with big NumPy arrays. The default for classic ML.
- ONNX — the interesting one. It exports your model to a framework-neutral graph, so a PyTorch model can be served by a C++ runtime with no Python in the loop. ONNX Runtime routinely delivers 2–10x inference speedups via graph optimization and quantization. If latency matters, this is your lever.
- safetensors — Hugging Face’s format for neural net weights. It’s just the tensors, loads fast, and (unlike pickle) can’t smuggle in code. The default for transformer-era models.
Rule of thumb: joblib for scikit-learn, safetensors for deep learning weights, ONNX when you need speed or cross-platform portability.
Step 2: Wrap it in an API
A saved model on disk is a paperweight until something can talk to it. The lingua franca is a REST endpoint, and in 2026 the default tool is still FastAPI — async, fast, and it generates request validation and OpenAPI docs from your type hints for free.
The pattern that matters: load the model once at startup, not per request. Loading weights is expensive; doing it on every call is the rookie mistake that tanks your latency.
from contextlib import asynccontextmanager
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
ml = {}
@asynccontextmanager
async def lifespan(app: FastAPI):
ml["model"] = joblib.load("model.joblib") # load once
yield
ml.clear()
app = FastAPI(lifespan=lifespan)
class Features(BaseModel):
values: list[float]
@app.post("/predict")
def predict(payload: Features):
prediction = ml["model"].predict([payload.values])
return {"prediction": prediction[0].item()}
That’s a production-shaped skeleton: validated input, a model held in memory, a clean JSON response. Add a /health route and you can already put it behind a load balancer.
Step 3: Decide when it runs
Not every prediction needs to happen this instant. Pick your serving mode deliberately:
- Real-time (online) inference — one request, one answer, now. Fraud checks, search ranking, recommendations. Latency is king.
- Batch inference — score a million rows overnight on a schedule. Throughput is king; nobody’s waiting. Cheaper and simpler — don’t stand up a 24/7 endpoint for a job a cron task can do at 2 a.m.
Choosing batch when you only think you need real-time is one of the easiest ways to save money and stress.
Step 4: Box it up and scale it out
To run anywhere reliably, containerize it. A Dockerfile pins your Python version, the exact onnxruntime build, and the model file into one immutable image — killing the “works on my machine” demon. From there, an orchestrator (Kubernetes, or a managed equivalent) handles scaling: more replicas behind a load balancer for throughput, autoscaling for spiky traffic.
Two more non-negotiables:
- Versioning. Tag every deployed model (
fraud-v3) and log which version served which prediction. When metrics drift, you’ll want to know exactly what was live. It also makes rolling back, or canarying v4 to 5% of traffic, a one-line change instead of an archaeology dig. - Latency vs throughput. They pull in opposite directions. Batching requests together boosts throughput but adds wait time. Measure p99 latency, not the average — the average hides the angry tail of users who waited two seconds.
When to stop hand-rolling
FastAPI is perfect until you’re juggling GPUs and dynamic batching. Then reach for purpose-built serving: BentoML to package and ship, NVIDIA Triton or Ray Serve for high-throughput GPU inference, vLLM for LLMs, and KServe if you live on Kubernetes and want autoscaling (even scale-to-zero) for free. Managed platforms — SageMaker, Vertex AI — trade flexibility for not paging you at midnight.
The takeaway
Serving a model is four concrete moves: serialize it (joblib / safetensors / ONNX), wrap it in an API that loads weights once, choose batch vs real-time on purpose, and containerize + version it so you can scale and roll back without fear. Start with FastAPI in Docker; reach for Triton, Ray, or KServe only when traffic forces your hand. Do that, and your 94% model stops being a party trick and starts being a product.
Sources: PyImageSearch — FastAPI for MLOps, ML Journey — Low-Latency Inference with FastAPI and ONNX, TrueFoundry — Model Deployment Tools 2026, Spheron — KServe vs Seldon vs BentoML.