diagram.mmd — flowchart
Inference Pipeline flowchart diagram

An inference pipeline is the real-time serving path that transforms a prediction request into a model output — retrieving features, loading the correct model version, running the forward pass, and returning a structured prediction with low latency.

What the diagram shows

This flowchart maps the execution path of a single inference request through a production ML serving system:

1. Prediction request: an upstream service or user-facing application sends a prediction request containing entity identifiers (e.g., user ID, item ID). 2. Input validation: request schema is validated and required fields are checked before any computation begins. 3. Feature retrieval: the pipeline fetches precomputed features from the online feature store (see Feature Engineering Pipeline). Features absent from the store may fall back to real-time computation. 4. Feature assembly: retrieved features are joined, ordered, and shaped into the exact input tensor format expected by the model. 5. Model version lookup: the serving layer consults the model registry to resolve the currently active model version and its serving endpoint (see Model Version Deployment). 6. Model forward pass: the assembled feature vector is passed to the model, which performs inference and returns raw scores or logits. 7. Post-processing: raw outputs are transformed — scores are calibrated, logits converted to probabilities via softmax, or class labels decoded. 8. Result caching: predictions for high-traffic entity pairs may be cached with a short TTL to reduce repeated model invocations. 9. Logging: input features, model version, and prediction output are logged to a feature/prediction store for downstream use in the AI Feedback Loop.

Why this matters

Low-latency inference with consistent feature retrieval is critical for user-facing ML applications such as recommendation, ranking, and fraud detection. A well-structured inference pipeline separates feature serving from model serving, making each independently scalable.

Free online editor
Edit this diagram in Graphlet
Fork, modify, and export to SVG or PNG. No sign-up required.
Open in Graphlet →

Frequently asked questions

An ML inference pipeline is the real-time serving path that takes a prediction request, retrieves the required features from an online feature store, assembles them into the model's expected input format, runs the forward pass, post-processes the output, and returns a structured prediction — all within a latency budget typically measured in tens of milliseconds.
Optimisation targets each stage independently: feature retrieval is accelerated with low-latency online stores (Redis, DynamoDB); model forward passes are optimised via quantization, TensorRT, or ONNX export; predictions for hot entity pairs are cached with a short TTL; and model versions are pre-loaded into serving instances to avoid cold starts.
Use online inference when a prediction is needed synchronously before a user action — recommendations, fraud scoring, search ranking. Use batch inference when predictions can be precomputed in advance and stored — nightly report generation, bulk item scoring — since batch is far cheaper per prediction.
Frequent issues include training-serving skew (features computed differently at training and serving time), feature store staleness (TTL too long causing stale values), model version mismatches (serving an old artifact after a failed deployment), and unbounded latency tail percentiles from slow feature joins.
mermaid
flowchart TD A([Prediction request: entity IDs]) --> B[Validate request schema] B --> C{Valid?} C -- No --> D([Return 400 Bad Request]) C -- Yes --> E[Fetch features from online feature store] E --> F{All features found?} F -- Missing --> G[Compute missing features in real time] G --> H[Assemble feature vector] F -- Found --> H H --> I[Look up active model version from registry] I --> J[Model forward pass] J --> K[Post-process: calibration and decoding] K --> L{Cache eligible?} L -- Yes --> M[Write prediction to cache with TTL] M --> N[Log features + model version + prediction] L -- No --> N N --> O([Return prediction response])
Copied to clipboard