"Design a rate limiter" tests whether you know the standard algorithms, where rate limiting belongs in an architecture, and how to make it correct across many servers. It's a focused question, so depth on the algorithm and the distributed-counter problem is what earns signal.
Here's the full walkthrough with a diagram, covering the four common algorithms and the Redis-backed implementation interviewers expect.
1. Clarify the requirements
Functional requirements
- Allow at most N requests per user (or IP, or API key) per time window
- Reject excess requests with HTTP 429 and a Retry-After header
- Support different limits per endpoint or plan tier
Non-functional requirements
- Low added latency (single-digit ms)
- Correct across a fleet of API servers (distributed)
- Memory-efficient at millions of users
- Fail open or closed by policy if the limiter store is down
Back-of-the-envelope scale: Assume 10M users and 10K requests/sec. Each user needs a counter; the limiter must do an atomic check-and-update per request without becoming the bottleneck.
2. API design
Rate limiting is usually middleware at the API gateway, not a public endpoint. Conceptually it exposes one operation:
allow(userId, endpoint) -> { allowed: bool, retryAfter?: seconds }
# On reject the gateway returns:
HTTP 429 Too Many Requests
Retry-After: 30
3. High-level architecture
The limiter lives at the API gateway / middleware layer so it runs before any real work. For a single server an in-memory counter suffices, but across a fleet you need a shared store — Redis — so all servers see the same counts.
Token bucket is the most common choice: each user has a bucket that refills at a fixed rate up to a capacity; each request consumes a token, and an empty bucket means reject. It's simple, allows short bursts, and maps cleanly to Redis.
The other algorithms: leaky bucket (smooths output to a constant rate), fixed window counter (simplest, but allows double-rate bursts at window edges), sliding window log (exact, but stores a timestamp per request — memory-heavy), and sliding window counter (a weighted approximation that's the usual production pick for balancing accuracy and cost).
4. Data model & storage
Per user you store a tiny amount of state: for token bucket, the current token count and the last-refill timestamp; for sliding window counter, the counts in the current and previous windows. All of it lives in Redis with a TTL so idle users expire automatically.
The critical detail is atomicity: the check-and-decrement must be a single atomic operation. Use Redis INCR with EXPIRE for window counters, or a small Lua script for token-bucket refill-and-consume so two concurrent requests can't both pass on the last token.
5. Scaling and bottlenecks
- Atomic Redis operations or Lua eliminate race conditions on the counter.
- Shard users across Redis nodes by user ID; counters are independent so there are no cross-shard transactions.
- Local + global hybrid: a small in-process cache reduces Redis round-trips, syncing periodically — trading a little accuracy for latency.
- Decide fail-open vs fail-closed if Redis is unreachable: fail-open keeps the API available, fail-closed protects it.
Key trade-offs the interviewer probes
- Accuracy vs memory. Sliding window log is exact but stores a timestamp per request; the sliding window counter approximates it with two integers. Production systems almost always choose the cheaper approximation.
- Local vs distributed. Local counters are fastest but inconsistent across servers; a shared Redis store is consistent but adds a network hop. The hybrid splits the difference.
- Token bucket vs leaky bucket. Token bucket allows bursts (good for user-facing APIs); leaky bucket enforces a strict constant rate (good for protecting a fragile downstream).
Framework reminder: every system design answer follows the same arc — requirements → estimates → API → high-level design → data model → scale → trade-offs. Keep the system design cheat sheet in mind and narrate which stage you're in.
Reason through trade-offs with live AI support
CoPilot Interview surfaces a structured design skeleton — requirements, API, data model, and scaling — in about 4 seconds during real Zoom and Teams calls. Free for Windows and macOS, invisible on screen-share.
Download freeFAQ
Which rate-limiting algorithm should I use?
Token bucket is the most common default: it's simple, allows short bursts, and maps cleanly to Redis. For tighter accuracy without the memory cost of a full request log, the sliding window counter is the usual production choice. Mention fixed-window's edge-burst weakness to show depth.
How do you make a rate limiter work across many servers?
Store the counters in a shared store like Redis so every API server sees the same counts, and make the check-and-update atomic using INCR/EXPIRE or a Lua script. Optionally add a small local cache that syncs periodically to cut Redis round-trips.
How do you avoid race conditions on the counter?
Use atomic operations. Redis INCR with EXPIRE is atomic for window counters, and a short Lua script makes token-bucket refill-and-consume atomic so two concurrent requests can't both consume the last token.
Where should the rate limiter live?
At the API gateway or middleware layer, before requests reach application logic. That way rejected requests cost almost nothing, and the policy is centralized rather than duplicated in each service.
What's the difference between token bucket and leaky bucket?
Token bucket refills tokens at a fixed rate and allows short bursts up to the bucket capacity - good for user-facing APIs. Leaky bucket processes requests at a strict constant rate, smoothing output - good when you must protect a fragile downstream service.