Inference Pipeline: Mermaid Flowchart Diagram

About Source

An inference pipeline is the real-time serving path that transforms a prediction request into a model output — retrieving features, loading the correct model version, running the forward pass, and returning a structured prediction with low latency.

What the diagram shows

This flowchart maps the execution path of a single inference request through a production ML serving system:

1. Prediction request: an upstream service or user-facing application sends a prediction request containing entity identifiers (e.g., user ID, item ID). 2. Input validation: request schema is validated and required fields are checked before any computation begins. 3. Feature retrieval: the pipeline fetches precomputed features from the online feature store (see Feature Engineering Pipeline). Features absent from the store may fall back to real-time computation. 4. Feature assembly: retrieved features are joined, ordered, and shaped into the exact input tensor format expected by the model. 5. Model version lookup: the serving layer consults the model registry to resolve the currently active model version and its serving endpoint (see Model Version Deployment). 6. Model forward pass: the assembled feature vector is passed to the model, which performs inference and returns raw scores or logits. 7. Post-processing: raw outputs are transformed — scores are calibrated, logits converted to probabilities via softmax, or class labels decoded. 8. Result caching: predictions for high-traffic entity pairs may be cached with a short TTL to reduce repeated model invocations. 9. Logging: input features, model version, and prediction output are logged to a feature/prediction store for downstream use in the AI Feedback Loop.

Why this matters

Low-latency inference with consistent feature retrieval is critical for user-facing ML applications such as recommendation, ranking, and fraud detection. A well-structured inference pipeline separates feature serving from model serving, making each independently scalable.

Frequently asked questions

An ML inference pipeline is the real-time serving path that takes a prediction request, retrieves the required features from an online feature store, assembles them into the model's expected input format, runs the forward pass, post-processes the output, and returns a structured prediction — all within a latency budget typically measured in tens of milliseconds.

Optimisation targets each stage independently: feature retrieval is accelerated with low-latency online stores (Redis, DynamoDB); model forward passes are optimised via quantization, TensorRT, or ONNX export; predictions for hot entity pairs are cached with a short TTL; and model versions are pre-loaded into serving instances to avoid cold starts.

Use online inference when a prediction is needed synchronously before a user action — recommendations, fraud scoring, search ranking. Use batch inference when predictions can be precomputed in advance and stored — nightly report generation, bulk item scoring — since batch is far cheaper per prediction.

Frequent issues include training-serving skew (features computed differently at training and serving time), feature store staleness (TTL too long causing stale values), model version mismatches (serving an old artifact after a failed deployment), and unbounded latency tail percentiles from slow feature joins.

mermaid

flowchart TD
    A([Prediction request: entity IDs]) --> B[Validate request schema]
    B --> C{Valid?}
    C -- No --> D([Return 400 Bad Request])
    C -- Yes --> E[Fetch features from online feature store]
    E --> F{All features found?}
    F -- Missing --> G[Compute missing features in real time]
    G --> H[Assemble feature vector]
    F -- Found --> H
    H --> I[Look up active model version from registry]
    I --> J[Model forward pass]
    J --> K[Post-process: calibration and decoding]
    K --> L{Cache eligible?}
    L -- Yes --> M[Write prediction to cache with TTL]
    M --> N[Log features + model version + prediction]
    L -- No --> N
    N --> O([Return prediction response])