Design a Rate Limiter: System Design Interview Walkthrough

"Design a rate limiter" tests whether you know the standard algorithms, where rate limiting belongs in an architecture, and how to make it correct across many servers. It's a focused question, so depth on the algorithm and the distributed-counter problem is what earns signal.

Here's the full walkthrough with a diagram, covering the four common algorithms and the Redis-backed implementation interviewers expect.

Rate limiter architecture: client, token-bucket limiter, Redis counters, protected API, 429 rejection path

1. Clarify the requirements

Functional requirements

Allow at most N requests per user (or IP, or API key) per time window
Reject excess requests with HTTP 429 and a Retry-After header
Support different limits per endpoint or plan tier

Non-functional requirements

Low added latency (single-digit ms)
Correct across a fleet of API servers (distributed)
Memory-efficient at millions of users
Fail open or closed by policy if the limiter store is down

Back-of-the-envelope scale: Assume 10M users and 10K requests/sec. Each user needs a counter; the limiter must do an atomic check-and-update per request without becoming the bottleneck.

2. API design

Rate limiting is usually middleware at the API gateway, not a public endpoint. Conceptually it exposes one operation:

allow(userId, endpoint) -> { allowed: bool, retryAfter?: seconds }

# On reject the gateway returns:
HTTP 429 Too Many Requests
Retry-After: 30

3. High-level architecture

The limiter lives at the API gateway / middleware layer so it runs before any real work. For a single server an in-memory counter suffices, but across a fleet you need a shared store — Redis — so all servers see the same counts.

Token bucket is the most common choice: each user has a bucket that refills at a fixed rate up to a capacity; each request consumes a token, and an empty bucket means reject. It's simple, allows short bursts, and maps cleanly to Redis.

The other algorithms: leaky bucket (smooths output to a constant rate), fixed window counter (simplest, but allows double-rate bursts at window edges), sliding window log (exact, but stores a timestamp per request — memory-heavy), and sliding window counter (a weighted approximation that's the usual production pick for balancing accuracy and cost).

4. Data model & storage

Per user you store a tiny amount of state: for token bucket, the current token count and the last-refill timestamp; for sliding window counter, the counts in the current and previous windows. All of it lives in Redis with a TTL so idle users expire automatically.

The critical detail is atomicity: the check-and-decrement must be a single atomic operation. Use Redis INCR with EXPIRE for window counters, or a small Lua script for token-bucket refill-and-consume so two concurrent requests can't both pass on the last token.

5. Scaling and bottlenecks

Atomic Redis operations or Lua eliminate race conditions on the counter.
Shard users across Redis nodes by user ID; counters are independent so there are no cross-shard transactions.
Local + global hybrid: a small in-process cache reduces Redis round-trips, syncing periodically — trading a little accuracy for latency.
Decide fail-open vs fail-closed if Redis is unreachable: fail-open keeps the API available, fail-closed protects it.

Key trade-offs the interviewer probes

Accuracy vs memory. Sliding window log is exact but stores a timestamp per request; the sliding window counter approximates it with two integers. Production systems almost always choose the cheaper approximation.
Local vs distributed. Local counters are fastest but inconsistent across servers; a shared Redis store is consistent but adds a network hop. The hybrid splits the difference.
Token bucket vs leaky bucket. Token bucket allows bursts (good for user-facing APIs); leaky bucket enforces a strict constant rate (good for protecting a fragile downstream).

Framework reminder: every system design answer follows the same arc — requirements → estimates → API → high-level design → data model → scale → trade-offs. Keep the system design cheat sheet in mind and narrate which stage you're in.

Reason through trade-offs with live AI support

CoPilot Interview surfaces a structured design skeleton — requirements, API, data model, and scaling — in about 4 seconds during real Zoom and Teams calls. Free for Windows and macOS, with a private desktop window.

Download free

FAQ

Which rate-limiting algorithm should I use?

Token bucket is the most common default: it's simple, allows short bursts, and maps cleanly to Redis. For tighter accuracy without the memory cost of a full request log, the sliding window counter is the usual production choice. Mention fixed-window's edge-burst weakness to show depth.

How do you make a rate limiter work across many servers?

Store the counters in a shared store like Redis so every API server sees the same counts, and make the check-and-update atomic using INCR/EXPIRE or a Lua script. Optionally add a small local cache that syncs periodically to cut Redis round-trips.

How do you avoid race conditions on the counter?

Use atomic operations. Redis INCR with EXPIRE is atomic for window counters, and a short Lua script makes token-bucket refill-and-consume atomic so two concurrent requests can't both consume the last token.

Where should the rate limiter live?

At the API gateway or middleware layer, before requests reach application logic. That way rejected requests cost almost nothing, and the policy is centralized rather than duplicated in each service.

What's the difference between token bucket and leaky bucket?

Token bucket refills tokens at a fixed rate and allows short bursts up to the bucket capacity - good for user-facing APIs. Leaky bucket processes requests at a strict constant rate, smoothing output - good when you must protect a fragile downstream service.

Design a Rate Limiter

1. Clarify the requirements

2. API design

3. High-level architecture

4. Data model & storage

5. Scaling and bottlenecks

Key trade-offs the interviewer probes

Reason through trade-offs with live AI support

FAQ

Related guides