diagram.mmd — flowchart
Rate Limiting Architecture flowchart diagram

Rate limiting is a traffic control mechanism that restricts how many requests a client can make within a given time window, protecting backend services from overload, abuse, and denial-of-service conditions.

What the diagram shows

This flowchart describes the decision path a request takes through a rate limiting layer, covering two common algorithm choices — Token Bucket and Sliding Window — and the system components involved:

1. Identify client: the rate limiter extracts a client key — usually an API key, user ID, or IP address — from the request. 2. Fetch counter from shared store: rate limit state is stored in a fast shared data store (Redis is the canonical choice) so that all gateway replicas apply the same limits. 3. Algorithm check: the limiter checks whether tokens remain (token bucket) or whether the request count in the current window is below threshold (sliding window). 4. Allow or reject: requests within limits are forwarded with updated counter state written back to the store. Requests that exceed the limit receive a 429 Too Many Requests with a Retry-After header. 5. Limit headers: allowed requests include X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers so clients can self-throttle.

Why this matters

A single misbehaving client — whether a buggy script or a deliberate attacker — can saturate backend resources and degrade the experience for all users. Rate limiting isolates that impact at the edge. It also enforces fair use policies in multi-tenant SaaS platforms.

For what happens after the rate limit check passes, see API Gateway Request Flow. For the client-side response to a 429, explore Request Retry Logic. The Bulkhead Pattern complements rate limiting by isolating resource pools per tenant.

Free online editor
Edit this diagram in Graphlet
Fork, modify, and export to SVG or PNG. No sign-up required.
Open in Graphlet →

Frequently asked questions

Rate limiting is a traffic control mechanism that restricts how many requests a client can make within a given time window. It protects backend services from overload, abuse, and denial-of-service conditions by rejecting requests that exceed configured thresholds with a 429 Too Many Requests response, typically including a Retry-After header.
The rate limiter extracts a client key (API key, user ID, or IP address) from the request and fetches the client's current counter from a shared store like Redis. It checks the counter against the configured limit using an algorithm — token bucket or sliding window — and either allows the request (updating the counter) or rejects it with a 429. Allowed requests also receive rate limit headers so clients can self-throttle.
Use rate limiting in any public-facing API to protect against misbehaving clients, buggy scripts, and deliberate abuse. It is also essential for enforcing fair use in multi-tenant SaaS platforms where one tenant's traffic could otherwise degrade service for all others. Apply different limits by client tier — free plans get lower limits than paid plans.
Token bucket allows short bursts above the average rate: tokens accumulate in the bucket up to a maximum capacity, and each request consumes one token. Clients can burst until the bucket is empty, then must wait for tokens to refill. Sliding window counts all requests within a rolling time window (e.g., the last 60 seconds) and rejects any that would exceed the limit — it is smoother and prevents burst exploitation but requires more precise counter management. Token bucket favors burst-tolerant APIs; sliding window is better for strict per-second enforcement.
mermaid
flowchart TD A([Inbound Request]) --> B[Extract client identifier] B --> C[Fetch rate limit counter from Redis] C --> D{Algorithm type} D -- Token Bucket --> E{Tokens available?} D -- Sliding Window --> F{Request count below threshold?} E -- No tokens --> G[Return 429 with Retry-After header] E -- Tokens available --> H[Consume one token] H --> I[Write updated token count to Redis] F -- Threshold exceeded --> G F -- Below threshold --> J[Increment request counter with TTL] J --> I I --> K[Add rate limit headers to request] K --> L[Forward request to upstream service] L --> M[Upstream processes request] M --> N[Add X-RateLimit-Remaining header to response] N --> O([Return response to client])
Copied to clipboard