LLM Request Flow: Mermaid Sequence Diagram

About Source

An LLM request flow describes the end-to-end lifecycle of a single inference call — from the moment a client application sends a prompt to the moment a generated response is returned and logged.

What the diagram shows

This sequence diagram traces the path a request takes through each layer of a production LLM stack:

1. Client → API Gateway: the application sends an HTTP POST carrying the model identifier, messages array, and parameters such as temperature and max tokens. 2. Authentication & rate limiting: the gateway validates the API key and enforces per-user or per-organization token quotas. Rejected requests receive a 401 or 429 before reaching the model. 3. Request routing: the gateway forwards the validated request to the appropriate model serving cluster, selecting the correct model version and region. 4. Tokenization: the serving layer converts the raw text prompt into a sequence of integer tokens using the model's tokenizer. 5. KV cache lookup: the serving layer checks whether a prefix of the token sequence is already cached in GPU memory, avoiding redundant computation for repeated context. 6. Model inference: the transformer performs a forward pass, producing logit distributions over the vocabulary at each output position. 7. Sampling / decoding: a decoding strategy (greedy, top-p, or beam search) selects the next token until an end-of-sequence token is produced or the max token limit is reached. 8. Detokenization: the output token IDs are converted back to text. 9. Logging & metering: token counts and latency are recorded for billing and observability. 10. Response: the final text is returned to the client, wrapped in the API response envelope.

Why this matters

Understanding the full request path helps engineers identify where latency is introduced — whether in network overhead, tokenization, KV cache misses, or pure model compute. It also clarifies which layers are responsible for safety, cost control, and observability.

For streaming variants see LLM Streaming Response. To understand how the prompt itself is assembled before the request is sent, see Prompt Processing Pipeline. The caching layer is explored in depth in Prompt Cache System.

Frequently asked questions

An LLM request flow is the end-to-end lifecycle of a single inference call — covering every hop from client authentication through tokenization, KV cache lookup, model forward pass, sampling, detokenization, and logging before the response is returned to the caller.

The client sends an HTTP POST to an API gateway, which authenticates the key and enforces rate limits. The validated request is routed to a model serving cluster, where the prompt is tokenized, checked against the KV cache, passed through the transformer for inference, decoded via a sampling strategy, and then returned as text with token usage logged for billing.

It matters most when diagnosing latency — whether delays stem from network overhead, tokenization, KV cache misses, or pure GPU compute. It is also essential when designing rate limiting, cost control, and observability instrumentation around a production LLM deployment.

Common issues include KV cache misses on long system prompts (fixed by prefix caching), cold-start latency from model loading, throttling at the gateway tier, and unbounded max-token settings that extend generation time unpredictably.

A REST API call typically executes a deterministic function in milliseconds. An LLM request involves autoregressive token generation — output length is variable, computation scales with the number of output tokens, and the serving layer must manage GPU memory (KV cache) across concurrent requests in ways that have no parallel in standard web services.

mermaid

sequenceDiagram
    participant Client as Client App
    participant GW as API Gateway
    participant Auth as Auth & Rate Limiter
    participant Router as Model Router
    participant Serving as Model Serving
    participant Model as LLM (GPU)
    participant Logger as Logging & Metering

    Client->&gt;GW: POST /v1/chat/completions {model, messages, params}
    GW->&gt;Auth: Validate API key + check quota
    Auth-->&gt;GW: 401 Unauthorized (if invalid)
    Auth-->&gt;GW: 429 Too Many Requests (if quota exceeded)
    Auth-->&gt;GW: OK (valid, within quota)
    GW->&gt;Router: Route to model cluster
    Router->&gt;Serving: Forward request to serving replica
    Serving->&gt;Serving: Tokenize prompt
    Serving->&gt;Serving: KV cache lookup (prefix match)
    Serving->&gt;Model: Forward pass (cached prefix skipped)
    Model-->&gt;Serving: Logits for next token
    Serving->&gt;Serving: Sample / decode next token
    Serving->&gt;Model: Continue until EOS or max_tokens
    Model-->&gt;Serving: Final token sequence
    Serving->&gt;Serving: Detokenize output tokens
    Serving->&gt;Logger: Record token counts + latency
    Logger-->&gt;Serving: Ack
    Serving-->&gt;GW: Response {id, choices, usage}
    GW-->&gt;Client: HTTP 200 response with generated text