HomeBlog › Design a Rate Limiter

Design a Rate Limiter

Token bucket vs sliding window, where to place it, the Redis implementation, and the accuracy-versus-memory trade-off interviewers push on.

"Design a rate limiter" tests whether you know the standard algorithms, where rate limiting belongs in an architecture, and how to make it correct across many servers. It's a focused question, so depth on the algorithm and the distributed-counter problem is what earns signal.

Here's the full walkthrough with a diagram, covering the four common algorithms and the Redis-backed implementation interviewers expect.

Rate limiter architecture: client, token-bucket limiter, Redis counters, protected API, 429 rejection path
Rate limiter architecture: client, token-bucket limiter, Redis counters, protected API, 429 rejection path

1. Clarify the requirements

Functional requirements

Non-functional requirements

Back-of-the-envelope scale: Assume 10M users and 10K requests/sec. Each user needs a counter; the limiter must do an atomic check-and-update per request without becoming the bottleneck.

2. API design

Rate limiting is usually middleware at the API gateway, not a public endpoint. Conceptually it exposes one operation:

allow(userId, endpoint) -> { allowed: bool, retryAfter?: seconds }

# On reject the gateway returns:
HTTP 429 Too Many Requests
Retry-After: 30

3. High-level architecture

The limiter lives at the API gateway / middleware layer so it runs before any real work. For a single server an in-memory counter suffices, but across a fleet you need a shared store — Redis — so all servers see the same counts.

Token bucket is the most common choice: each user has a bucket that refills at a fixed rate up to a capacity; each request consumes a token, and an empty bucket means reject. It's simple, allows short bursts, and maps cleanly to Redis.

The other algorithms: leaky bucket (smooths output to a constant rate), fixed window counter (simplest, but allows double-rate bursts at window edges), sliding window log (exact, but stores a timestamp per request — memory-heavy), and sliding window counter (a weighted approximation that's the usual production pick for balancing accuracy and cost).

4. Data model & storage

Per user you store a tiny amount of state: for token bucket, the current token count and the last-refill timestamp; for sliding window counter, the counts in the current and previous windows. All of it lives in Redis with a TTL so idle users expire automatically.

The critical detail is atomicity: the check-and-decrement must be a single atomic operation. Use Redis INCR with EXPIRE for window counters, or a small Lua script for token-bucket refill-and-consume so two concurrent requests can't both pass on the last token.

5. Scaling and bottlenecks

Key trade-offs the interviewer probes

Framework reminder: every system design answer follows the same arc — requirements → estimates → API → high-level design → data model → scale → trade-offs. Keep the system design cheat sheet in mind and narrate which stage you're in.

Reason through trade-offs with live AI support

CoPilot Interview surfaces a structured design skeleton — requirements, API, data model, and scaling — in about 4 seconds during real Zoom and Teams calls. Free for Windows and macOS, invisible on screen-share.

Download free

FAQ

Which rate-limiting algorithm should I use?

Token bucket is the most common default: it's simple, allows short bursts, and maps cleanly to Redis. For tighter accuracy without the memory cost of a full request log, the sliding window counter is the usual production choice. Mention fixed-window's edge-burst weakness to show depth.

How do you make a rate limiter work across many servers?

Store the counters in a shared store like Redis so every API server sees the same counts, and make the check-and-update atomic using INCR/EXPIRE or a Lua script. Optionally add a small local cache that syncs periodically to cut Redis round-trips.

How do you avoid race conditions on the counter?

Use atomic operations. Redis INCR with EXPIRE is atomic for window counters, and a short Lua script makes token-bucket refill-and-consume atomic so two concurrent requests can't both consume the last token.

Where should the rate limiter live?

At the API gateway or middleware layer, before requests reach application logic. That way rejected requests cost almost nothing, and the policy is centralized rather than duplicated in each service.

What's the difference between token bucket and leaky bucket?

Token bucket refills tokens at a fixed rate and allows short bursts up to the bucket capacity - good for user-facing APIs. Leaky bucket processes requests at a strict constant rate, smoothing output - good when you must protect a fragile downstream service.