Prompt Cache System: Mermaid Flowchart Diagram

About Source

A prompt cache system stores previously computed LLM responses keyed by a deterministic hash of the input prompt, enabling instant cache-hit responses that bypass the model entirely — reducing both latency and per-token inference costs.

What the diagram shows

This flowchart details the read and write paths of a two-tier prompt caching architecture:

1. Incoming prompt: a fully assembled prompt (after Prompt Processing Pipeline has run) arrives at the caching layer. 2. Cache key computation: a canonical hash (e.g., SHA-256) is computed from the prompt contents, model identifier, and key generation parameters such as temperature. Requests with temperature > 0 may be excluded from caching since outputs are non-deterministic. 3. L1 cache lookup (in-memory): the hash is looked up in a fast in-memory store (Redis or local LRU cache). If found, the response is returned immediately. 4. L2 cache lookup (distributed): on an L1 miss, the lookup falls through to a distributed cache layer (e.g., a shared Redis cluster or object store index). 5. Cache miss → LLM dispatch: on a full cache miss, the prompt is forwarded to the LLM serving layer (see LLM Request Flow). 6. Response received: the LLM returns the generated text. 7. Cache write: the response is written to both L1 and L2 caches under the computed key, with a TTL appropriate to the content's expected freshness. 8. Return response: the response is returned to the caller, flagged with a cache: miss or cache: hit header for observability.

Why this matters

For applications where identical or near-identical prompts are common — documentation Q&A, template-based generation, repeated system prompts — a prompt cache can reduce LLM API costs by 40–80% and cut response latency from seconds to milliseconds.

Frequently asked questions

A prompt cache system stores previously computed LLM responses keyed by a deterministic hash of the input prompt, enabling instant cache-hit responses that bypass the model entirely and eliminate redundant inference costs for repeated or near-identical prompts.

When a prompt arrives, a canonical hash is computed from the prompt content, model ID, and key generation parameters. The hash is checked against a fast in-memory L1 cache, then a distributed L2 cache. On a hit, the stored response is returned immediately — no GPU compute required. On a miss, the LLM is called and the response is written to both cache tiers with a TTL, so subsequent identical prompts hit the cache.

Exclude prompts where responses must be non-deterministic, personalized per request, or time-sensitive. Requests with `temperature > 0` where response variety matters, prompts injecting real-time data (current prices, live events), and prompts that include session-specific content that should not be shared across users should all bypass the cache.

mermaid

flowchart TD
    A([Assembled prompt]) --> B{Caching enabled and temperature = 0?}
    B -- No --> Z[Dispatch directly to LLM]
    Z --> ZR([Return response])
    B -- Yes --> C[Compute cache key: SHA-256 of prompt + model + params]
    C --> D{L1 in-memory cache hit?}
    D -- Hit --> E([Return cached response: latency < 1ms])
    D -- Miss --> F{L2 distributed cache hit?}
    F -- Hit --> G[Populate L1 cache with response]
    G --> H([Return cached response])
    F -- Miss --> I[Forward prompt to LLM serving layer]
    I --> J[Receive LLM response]
    J --> K[Write response to L2 cache with TTL]
    K --> L[Write response to L1 cache]
    L --> M([Return response with cache:miss header])