LLM Streaming Response: Mermaid Sequence Diagram

About Source

LLM streaming response is a delivery pattern where the model's output tokens are sent to the client incrementally as they are generated, rather than waiting for the complete response to be assembled — dramatically reducing time-to-first-token and improving perceived responsiveness.

What the diagram shows

This sequence diagram illustrates how server-sent events (SSE) or chunked transfer encoding enables token-level streaming from the model serving layer to the client:

1. Client request: the client sends a completion request with stream: true, signaling it wants incremental delivery. 2. Connection upgrade: the API gateway establishes a persistent HTTP connection using SSE or chunked transfer encoding, keeping the socket open for the duration of generation. 3. Tokenization and inference start: the serving layer tokenizes the prompt and begins the autoregressive forward pass. 4. Token streaming loop: as each output token is sampled from the model's logit distribution, it is immediately detokenized and dispatched as a data: chunk event — typically as {"delta": {"content": " token"}} in OpenAI-compatible APIs. 5. Client renders tokens: the client UI appends each token to the display buffer as it arrives, producing the typewriter effect familiar from chat interfaces. 6. Stream termination: when the model produces an end-of-sequence token or reaches max_tokens, a final data: [DONE] event is sent and the connection is closed. 7. Usage accounting: the full token count (prompt + completion) is reported in the final event or via a separate metering call.

Why this matters

Without streaming, users wait for the entire response before seeing any output — for long answers this can take many seconds. Streaming cuts perceived latency to near-zero and makes AI chat interfaces feel interactive. See LLM Request Flow for the non-streaming variant and AI Chat Application Architecture for the broader application context.

Frequently asked questions

An LLM streaming response is a delivery pattern in which the model's output tokens are sent to the client incrementally as they are generated — using server-sent events (SSE) or chunked transfer encoding — rather than waiting for the complete response to be assembled before transmission.

The client sends a completion request with `stream: true`. The API gateway opens a persistent HTTP connection. As the model samples each output token, it is immediately detokenized and dispatched as a `data:` SSE event. The client appends each token to its display buffer in real time, producing a typewriter effect. A final `data: [DONE]` event closes the stream when generation ends.

Disable streaming when the downstream system requires the complete response before it can act — for example, when post-processing the full output for structured data extraction, when applying output moderation that must review the complete text, or when feeding the response into a batch pipeline where partial delivery provides no UX benefit.

mermaid

sequenceDiagram
    participant Client as Client App
    participant GW as API Gateway
    participant Serving as Model Serving
    participant Model as LLM (GPU)
    participant Meter as Usage Metering

    Client->&gt;GW: POST /v1/chat/completions {stream: true, messages}
    GW->&gt;Serving: Forward streaming request
    Serving->&gt;Serving: Tokenize prompt
    GW-->&gt;Client: HTTP 200 (SSE connection open)
    Serving->&gt;Model: Begin autoregressive forward pass

    loop Token generation
        Model-->&gt;Serving: Next token logits
        Serving->&gt;Serving: Sample token from distribution
        Serving->&gt;Serving: Detokenize token to text
        Serving-->&gt;GW: data: {"delta":{"content":" token"}}
        GW-->&gt;Client: data: {"delta":{"content":" token"}}
        Client->&gt;Client: Append token to display buffer
    end

    Model-->&gt;Serving: End-of-sequence token reached
    Serving-->&gt;GW: data: {"finish_reason":"stop"}
    Serving-->&gt;GW: data: [DONE]
    GW-->&gt;Client: data: [DONE] (close SSE stream)
    Serving->&gt;Meter: Record prompt_tokens + completion_tokens
    Meter-->&gt;Serving: Ack