Putting a Model in Production: Serving ML Models

Your model hits 94% accuracy in the notebook. Everyone claps. Then someone asks the question that turns triumph into cold sweat: “Great — how do users actually call it?”

That gap, between a model that works on your laptop and one that answers real requests at 3 a.m. without you babysitting it, is where most ML projects quietly go to die. A .ipynb file is not a product. Let’s walk the bridge from notebook to served endpoint.

Step 1: Get the model out of memory

Right now your trained model lives in a Python variable that vanishes the moment the kernel restarts. You need to serialize it — freeze its weights and structure to disk. Your choices, roughly in order of how much you should think about them:

pickle — Python’s built-in. Works, but it’s slow, version-fragile, and a genuine security hole (unpickling untrusted data can execute arbitrary code). Fine for a quick local save, never for something you’d download.
joblib — pickle’s smarter cousin for scikit-learn and anything with big NumPy arrays. The default for classic ML.
ONNX — the interesting one. It exports your model to a framework-neutral graph, so a PyTorch model can be served by a C++ runtime with no Python in the loop. ONNX Runtime routinely delivers 2–10x inference speedups via graph optimization and quantization. If latency matters, this is your lever.
safetensors — Hugging Face’s format for neural net weights. It’s just the tensors, loads fast, and (unlike pickle) can’t smuggle in code. The default for transformer-era models.

Rule of thumb: joblib for scikit-learn, safetensors for deep learning weights, ONNX when you need speed or cross-platform portability.

Step 2: Wrap it in an API

A saved model on disk is a paperweight until something can talk to it. The lingua franca is a REST endpoint, and in 2026 the default tool is still FastAPI — async, fast, and it generates request validation and OpenAPI docs from your type hints for free.

The pattern that matters: load the model once at startup, not per request. Loading weights is expensive; doing it on every call is the rookie mistake that tanks your latency.

from contextlib import asynccontextmanager
from fastapi import FastAPI
from pydantic import BaseModel
import joblib

ml = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    ml["model"] = joblib.load("model.joblib")  # load once
    yield
    ml.clear()

app = FastAPI(lifespan=lifespan)

class Features(BaseModel):
    values: list[float]

@app.post("/predict")
def predict(payload: Features):
    prediction = ml["model"].predict([payload.values])
    return {"prediction": prediction[0].item()}

That’s a production-shaped skeleton: validated input, a model held in memory, a clean JSON response. Add a /health route and you can already put it behind a load balancer.

Step 3: Decide when it runs

Not every prediction needs to happen this instant. Pick your serving mode deliberately:

Real-time (online) inference — one request, one answer, now. Fraud checks, search ranking, recommendations. Latency is king.
Batch inference — score a million rows overnight on a schedule. Throughput is king; nobody’s waiting. Cheaper and simpler — don’t stand up a 24/7 endpoint for a job a cron task can do at 2 a.m.

Choosing batch when you only think you need real-time is one of the easiest ways to save money and stress.

Step 4: Box it up and scale it out

To run anywhere reliably, containerize it. A Dockerfile pins your Python version, the exact onnxruntime build, and the model file into one immutable image — killing the “works on my machine” demon. From there, an orchestrator (Kubernetes, or a managed equivalent) handles scaling: more replicas behind a load balancer for throughput, autoscaling for spiky traffic.

Two more non-negotiables:

Versioning. Tag every deployed model (fraud-v3) and log which version served which prediction. When metrics drift, you’ll want to know exactly what was live. It also makes rolling back, or canarying v4 to 5% of traffic, a one-line change instead of an archaeology dig.
Latency vs throughput. They pull in opposite directions. Batching requests together boosts throughput but adds wait time. Measure p99 latency, not the average — the average hides the angry tail of users who waited two seconds.

When to stop hand-rolling

FastAPI is perfect until you’re juggling GPUs and dynamic batching. Then reach for purpose-built serving: BentoML to package and ship, NVIDIA Triton or Ray Serve for high-throughput GPU inference, vLLM for LLMs, and KServe if you live on Kubernetes and want autoscaling (even scale-to-zero) for free. Managed platforms — SageMaker, Vertex AI — trade flexibility for not paging you at midnight.

The takeaway

Serving a model is four concrete moves: serialize it (joblib / safetensors / ONNX), wrap it in an API that loads weights once, choose batch vs real-time on purpose, and containerize + version it so you can scale and roll back without fear. Start with FastAPI in Docker; reach for Triton, Ray, or KServe only when traffic forces your hand. Do that, and your 94% model stops being a party trick and starts being a product.

Sources: PyImageSearch — FastAPI for MLOps, ML Journey — Low-Latency Inference with FastAPI and ONNX, TrueFoundry — Model Deployment Tools 2026, Spheron — KServe vs Seldon vs BentoML.