Rate Limiting Architecture: Mermaid Flowchart

Rate Limiting Architecture flowchart diagram

About Source

Rate limiting is a traffic control mechanism that restricts how many requests a client can make within a given time window, protecting backend services from overload, abuse, and denial-of-service conditions.

What the diagram shows

This flowchart describes the decision path a request takes through a rate limiting layer, covering two common algorithm choices — Token Bucket and Sliding Window — and the system components involved:

1. Identify client: the rate limiter extracts a client key — usually an API key, user ID, or IP address — from the request. 2. Fetch counter from shared store: rate limit state is stored in a fast shared data store (Redis is the canonical choice) so that all gateway replicas apply the same limits. 3. Algorithm check: the limiter checks whether tokens remain (token bucket) or whether the request count in the current window is below threshold (sliding window). 4. Allow or reject: requests within limits are forwarded with updated counter state written back to the store. Requests that exceed the limit receive a 429 Too Many Requests with a Retry-After header. 5. Limit headers: allowed requests include X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers so clients can self-throttle.

Why this matters

A single misbehaving client — whether a buggy script or a deliberate attacker — can saturate backend resources and degrade the experience for all users. Rate limiting isolates that impact at the edge. It also enforces fair use policies in multi-tenant SaaS platforms.

For what happens after the rate limit check passes, see API Gateway Request Flow. For the client-side response to a 429, explore Request Retry Logic. The Bulkhead Pattern complements rate limiting by isolating resource pools per tenant.

Frequently asked questions

Rate limiting is a traffic control mechanism that restricts how many requests a client can make within a given time window. It protects backend services from overload, abuse, and denial-of-service conditions by rejecting requests that exceed configured thresholds with a 429 Too Many Requests response, typically including a Retry-After header.

The rate limiter extracts a client key (API key, user ID, or IP address) from the request and fetches the client's current counter from a shared store like Redis. It checks the counter against the configured limit using an algorithm — token bucket or sliding window — and either allows the request (updating the counter) or rejects it with a 429. Allowed requests also receive rate limit headers so clients can self-throttle.

Use rate limiting in any public-facing API to protect against misbehaving clients, buggy scripts, and deliberate abuse. It is also essential for enforcing fair use in multi-tenant SaaS platforms where one tenant's traffic could otherwise degrade service for all others. Apply different limits by client tier — free plans get lower limits than paid plans.

Token bucket allows short bursts above the average rate: tokens accumulate in the bucket up to a maximum capacity, and each request consumes one token. Clients can burst until the bucket is empty, then must wait for tokens to refill. Sliding window counts all requests within a rolling time window (e.g., the last 60 seconds) and rejects any that would exceed the limit — it is smoother and prevents burst exploitation but requires more precise counter management. Token bucket favors burst-tolerant APIs; sliding window is better for strict per-second enforcement.

mermaid

flowchart TD
    A([Inbound Request]) --> B[Extract client identifier]
    B --> C[Fetch rate limit counter from Redis]
    C --> D{Algorithm type}
    D -- Token Bucket --> E{Tokens available?}
    D -- Sliding Window --> F{Request count below threshold?}

    E -- No tokens --> G[Return 429 with Retry-After header]
    E -- Tokens available --> H[Consume one token]
    H --> I[Write updated token count to Redis]

    F -- Threshold exceeded --> G
    F -- Below threshold --> J[Increment request counter with TTL]
    J --> I

    I --> K[Add rate limit headers to request]
    K --> L[Forward request to upstream service]
    L --> M[Upstream processes request]
    M --> N[Add X-RateLimit-Remaining header to response]
    N --> O([Return response to client])