diagram.mmd — sequence
LLM Streaming Response sequence diagram

LLM streaming response is a delivery pattern where the model's output tokens are sent to the client incrementally as they are generated, rather than waiting for the complete response to be assembled — dramatically reducing time-to-first-token and improving perceived responsiveness.

What the diagram shows

This sequence diagram illustrates how server-sent events (SSE) or chunked transfer encoding enables token-level streaming from the model serving layer to the client:

1. Client request: the client sends a completion request with stream: true, signaling it wants incremental delivery. 2. Connection upgrade: the API gateway establishes a persistent HTTP connection using SSE or chunked transfer encoding, keeping the socket open for the duration of generation. 3. Tokenization and inference start: the serving layer tokenizes the prompt and begins the autoregressive forward pass. 4. Token streaming loop: as each output token is sampled from the model's logit distribution, it is immediately detokenized and dispatched as a data: chunk event — typically as {"delta": {"content": " token"}} in OpenAI-compatible APIs. 5. Client renders tokens: the client UI appends each token to the display buffer as it arrives, producing the typewriter effect familiar from chat interfaces. 6. Stream termination: when the model produces an end-of-sequence token or reaches max_tokens, a final data: [DONE] event is sent and the connection is closed. 7. Usage accounting: the full token count (prompt + completion) is reported in the final event or via a separate metering call.

Why this matters

Without streaming, users wait for the entire response before seeing any output — for long answers this can take many seconds. Streaming cuts perceived latency to near-zero and makes AI chat interfaces feel interactive. See LLM Request Flow for the non-streaming variant and AI Chat Application Architecture for the broader application context.

Free online editor
Edit this diagram in Graphlet
Fork, modify, and export to SVG or PNG. No sign-up required.
Open in Graphlet →

Frequently asked questions

An LLM streaming response is a delivery pattern in which the model's output tokens are sent to the client incrementally as they are generated — using server-sent events (SSE) or chunked transfer encoding — rather than waiting for the complete response to be assembled before transmission.
The client sends a completion request with `stream: true`. The API gateway opens a persistent HTTP connection. As the model samples each output token, it is immediately detokenized and dispatched as a `data:` SSE event. The client appends each token to its display buffer in real time, producing a typewriter effect. A final `data: [DONE]` event closes the stream when generation ends.
Disable streaming when the downstream system requires the complete response before it can act — for example, when post-processing the full output for structured data extraction, when applying output moderation that must review the complete text, or when feeding the response into a batch pipeline where partial delivery provides no UX benefit.
mermaid
sequenceDiagram participant Client as Client App participant GW as API Gateway participant Serving as Model Serving participant Model as LLM (GPU) participant Meter as Usage Metering Client->>GW: POST /v1/chat/completions {stream: true, messages} GW->>Serving: Forward streaming request Serving->>Serving: Tokenize prompt GW-->>Client: HTTP 200 (SSE connection open) Serving->>Model: Begin autoregressive forward pass loop Token generation Model-->>Serving: Next token logits Serving->>Serving: Sample token from distribution Serving->>Serving: Detokenize token to text Serving-->>GW: data: {"delta":{"content":" token"}} GW-->>Client: data: {"delta":{"content":" token"}} Client->>Client: Append token to display buffer end Model-->>Serving: End-of-sequence token reached Serving-->>GW: data: {"finish_reason":"stop"} Serving-->>GW: data: [DONE] GW-->>Client: data: [DONE] (close SSE stream) Serving->>Meter: Record prompt_tokens + completion_tokens Meter-->>Serving: Ack
Copied to clipboard