Design a Rate Limiter — System Design

Q: Token bucket or leaky bucket — when do you use each?

Token bucket allows short bursts up to the bucket capacity then throttles to the refill rate — ideal for APIs that tolerate bursty traffic. Leaky bucket enforces a constant outflow rate via a FIFO queue, smoothing bursts into a steady stream — ideal when a downstream system needs a stable, predictable request rate.

Q: Why does a fixed window allow double the limit at the boundary, and how does a sliding window fix it?

A fixed window resets the counter at each interval boundary, so a client can send the full limit in the last second of one window and again in the first second of the next — up to 2x the limit in a short span. A sliding window log keeps exact timestamps (accurate but memory-heavy); a sliding window counter weights the previous window's count by overlap, fixing the boundary burst at O(1) memory and near-exact accuracy.

Q: Why do you need a Lua script for a Redis rate limiter?

Check-then-increment is a read-modify-write race: two concurrent requests can both read count=99, both decide they are under the limit of 100, and both increment to 101 — over-admitting. A Lua script runs atomically inside Redis as a single operation, so the read, the limit check, and the increment happen with no interleaving.

Q: Should the rate limiter fail open or fail closed when Redis is down?

Most production systems fail open — if the rate-limit store is unreachable, allow the request rather than reject all traffic, because a limiter outage should not become a full outage. Critical endpoints like login or payments may fail closed or fall back to a conservative local in-memory limit. State the choice explicitly; it is an availability-versus-abuse tradeoff.

Q: Where should the rate limiter live?

Centralize it at the API gateway or a dedicated middleware in front of your services so every service inherits protection and the rules live in one place. A client-side limiter is advisory only because clients can bypass it. For ultra-low-latency hot paths, combine a local in-process token bucket as the first line with a shared Redis counter as global truth.

A rate limiter caps how many requests a client may make in a given window, protecting your services from abuse, accidental traffic storms, and the cascading failures that follow when one tenant saturates a shared resource. It sounds like a single counter and an if statement, but the design gets hard fast: the counter must be shared and consistent across a fleet of stateless servers, the check must add almost no latency to the hot path, the algorithm has to handle bursts and window boundaries fairly, and the whole thing must keep working — fail open or fail closed by deliberate choice — when its own backing store is having a bad day. This guide walks the algorithms (token bucket, leaky bucket, fixed and sliding windows), the distributed implementation in Redis, the atomicity guarantees that make concurrent counting correct, and the HTTP semantics (429, Retry-After, X-RateLimit-*) that clients depend on.

⚡ Quick Takeaways

Token bucket is the default — allows bursts up to bucket capacity, then throttles to a steady refill rate; used by Stripe, AWS, and most public APIs.
Leaky bucket smooths output — a FIFO queue drained at a constant rate; choose it when a downstream needs a stable, predictable request rate.
Fixed window is simplest but leaks at the boundary — a client can send up to 2× the limit across two adjacent windows.
Sliding window counter fixes the boundary — weights the previous window's count by overlap; O(1) memory, near-exact accuracy.
Distributed counters live in Redis — INCR + EXPIRE per (client, window) key; one shared source of truth across all gateway nodes.
Lua makes the check atomic — read, compare, and increment run as one indivisible Redis operation, eliminating the check-then-increment race.
Lives at the API gateway — one choke point protects every service; client-side limiting is advisory only.
Reject with 429 + headers — return Retry-After and X-RateLimit-Limit/Remaining/Reset so well-behaved clients back off.

tldr

Put the limiter at the API gateway so every service is protected from one place. Token bucket is the go-to algorithm — bursty-friendly with a smooth refill; leaky bucket when you need constant output; sliding window counter when you want fixed-window simplicity without the boundary burst. Keep the counter in Redis and make the check-and-increment atomic with a Lua script. Reject over-limit traffic with 429 Too Many Requests plus Retry-After and X-RateLimit-* headers. Decide fail-open vs fail-closed for when Redis is unreachable.

Step 1 — Clarify Requirements

Rate limiter is a deceptively deep interview prompt because "limit requests" hides a dozen decisions. Pin down scope before drawing boxes.

Functional requirements

Throttle requests by a key: per API key, per user ID, per IP, or per (user, endpoint) tuple.
Support multiple rules — e.g. 10 req/s burst and 1,000 req/hour sustained on the same client.
Return a clear signal to the client when throttled: 429 with a Retry-After hint.
Rules are configurable and hot-reloadable without redeploying services.
Different limits for different tiers (free vs paid) and different endpoints (cheap reads vs expensive writes).

Non-functional requirements

Low added latency — the check sits on every request; target <1–2 ms p99 overhead.
Accuracy under concurrency — counting must be correct when thousands of requests for the same key hit different nodes simultaneously.
High availability — the limiter must not become a single point of failure; decide its behavior when the counter store is down.
Distributed correctness — limits are global across a horizontally scaled, stateless server fleet behind a load balancer, not per-node.
Scalability — handle the platform's full request volume (hundreds of thousands of checks/sec) with headroom.

interview tip

The first clarifying question worth asking: "Is the limit global across all servers, or per server?" Per-server is trivial (an in-memory counter) but wrong for most real systems — behind a load balancer, a client's requests fan out across nodes, so a per-node limit of N with M nodes effectively permits N×M. Stating this distinction early signals you understand the real problem is distributed counting.

Step 2 — Capacity Estimation

Size the limiter against the traffic it must guard. Assume an API platform serving 500K requests/sec at peak.

Throughput

Every request triggers one rate-limit check → 500K checks/sec. Each check is one Redis round trip (or one Lua eval).
A single Redis instance handles ~100K–200K simple ops/sec. At 500K/sec you need a sharded Redis cluster (counters partitioned by client key) or a local-plus-global hybrid to cut Redis traffic.
Budget the network: a co-located Redis adds ~0.2–0.5 ms per check. Keep Redis in the same AZ as the gateway to avoid cross-AZ latency on the hot path.

Memory

Each active counter key is tiny: rl:{client}:{window} → an integer plus TTL ≈ ~100 bytes with Redis overhead.
10M distinct active clients in a window → 10M × 100 B ≈ ~1 GB. Comfortably fits in memory; TTLs evict expired windows automatically so memory does not grow unbounded.
Token-bucket state needs two fields per client (token count + last-refill timestamp) — still tiny, stored as a Redis hash.

note

The data is small; the hard part is throughput and correctness, not storage. A rate limiter is a latency- and consistency-bound system, not a storage-bound one — which is exactly why the design conversation centers on algorithms, atomicity, and where the check lives, not on sharding terabytes.

Step 3 — Where the Rate Limiter Lives

Placement is a first-class design decision. The same algorithm behaves very differently depending on where it sits.

Location	Pros	Cons
Client-side	Zero server cost; instant feedback	Advisory only — a malicious client just ignores it. Never a security boundary.
API gateway / reverse proxy	One choke point protects all services; rules centralized; rejects before requests reach app servers	Gateway needs access to shared counter state; can become a bottleneck if not scaled
Middleware in each service	Fine-grained per-service rules; no extra hop	Logic duplicated across services; harder to keep consistent
Dedicated sidecar (service mesh)	Decoupled from app code; uniform policy via mesh control plane	Operational complexity of a mesh; per-pod state coordination

The canonical answer is the API gateway: it is the natural single entry point, it can reject over-limit traffic before it consumes downstream resources, and it keeps the rules in one place. The gateway is stateless itself and consults a shared store (Redis) for the counts, so horizontal scaling of the gateway does not fragment the limit. Client-side limiting is still worth adding as a courtesy layer (it reduces wasted round trips) but is never trusted for enforcement.

Step 4 — Rules and Configuration

Limits are data, not code. A rules engine lets product define limits per tier and endpoint and reload them without a deploy. A compact representation:

YAML

# rate-limit rules, hot-reloaded from a config service
rules:
  - match: { tier: free,  endpoint: "POST /api/*" }
    limit:  { requests: 10,   window: 1s,  algorithm: token_bucket, burst: 20 }
  - match: { tier: free,  endpoint: "*" }
    limit:  { requests: 1000, window: 1h,  algorithm: sliding_window }
  - match: { tier: paid,  endpoint: "*" }
    limit:  { requests: 100,  window: 1s,  algorithm: token_bucket, burst: 200 }
key_by:   api_key        # or user_id, ip, "user_id:endpoint"
on_fail:  allow          # fail-open if the counter store is unreachable

Multiple rules can apply to one request — a free user hitting POST /api/x is governed by both the 10/s burst rule and the 1,000/hour sustained rule. The limiter evaluates all matching rules and rejects if any is exceeded (the most restrictive wins). Order rules from most specific to least specific and short-circuit on the first rejection to save Redis calls.

Step 5 — Algorithms: The Landscape

Five algorithms dominate the conversation. Each trades accuracy, memory, and burst behavior differently.

Algorithm	Burst handling	Memory	Accuracy
Token bucket	Allows bursts up to bucket size, then steady refill	O(1) — 2 fields/key	Exact
Leaky bucket	Smooths bursts into constant outflow (queue)	O(queue size)	Exact rate, adds delay
Fixed window	Permits up to 2× at window boundary	O(1) — 1 counter/key	Approximate (boundary leak)
Sliding window log	No boundary leak; exact	O(N) — timestamp per request	Exact
Sliding window counter	Near-exact; weighted approximation	O(1) — 2 counters/key	Approximate (very close)

In interviews, name token bucket as your default and sliding window counter as the refinement when fixed-window's boundary leak matters. Leaky bucket is the right call only when the downstream genuinely needs a smoothed, constant output rate (e.g. feeding a fixed-throughput worker pool). The next steps go deep on each.

Step 6 — Token Bucket vs Leaky Bucket

Token bucket

A bucket holds up to capacity tokens and refills at refill_rate tokens per second. Each request removes one token; if the bucket is empty, the request is rejected. Because a full bucket can be drained in one burst, token bucket tolerates bursts up to its capacity while bounding the long-run average to the refill rate — exactly the behavior most public APIs want.

Python

def allow(bucket, capacity, refill_rate, now):
    # lazily refill based on elapsed time — no background timer needed
    elapsed = now - bucket.last_refill
    bucket.tokens = min(capacity, bucket.tokens + elapsed * refill_rate)
    bucket.last_refill = now

    if bucket.tokens >= 1:
        bucket.tokens -= 1
        return True          # admit
    return False             # reject → 429

The key trick is lazy refill: rather than a background timer adding tokens, you compute how many tokens would have accrued since the last request from the elapsed time. This makes each check O(1) and stateless between calls except for the two stored fields (tokens, last_refill). In a distributed setting those two fields live in a Redis hash per client key, and the whole read-modify-write must be atomic (Step 10).

why it's the default

Token bucket separates two concerns cleanly: capacity controls how big a burst you tolerate, refill_rate controls the sustained rate. You can let a client burst to 200 requests instantly after an idle period yet hold them to 100/s on average — friendly to real bursty clients (page loads, retries) without permitting sustained abuse. That tunability is why Stripe, GitHub, and AWS publish token-bucket-style limits.

Step 7 — Leaky Bucket

Leaky bucket models requests as water poured into a bucket with a hole: requests enter a FIFO queue and are processed (leak out) at a fixed rate. If the queue is full, new requests overflow and are rejected. The defining property is the constant outflow rate regardless of how bursty the input is.

Property	Token bucket	Leaky bucket
Output rate	Bursty (up to capacity instantly)	Constant (smoothed)
Added latency	None — admit or reject immediately	Queued requests wait
Best for	User-facing APIs that tolerate bursts	Protecting a fixed-throughput downstream
Memory	O(1)	O(queue length)

The tradeoff is latency: leaky bucket can delay a request (it sits in the queue) rather than rejecting it, which smooths spikes but adds tail latency and requires queue memory. Use it when the thing you protect — a payment processor, a legacy system, a worker pool with fixed capacity — needs a steady drip and would rather make callers wait than see a spike. For most rate-limiting (reject-don't-delay), token bucket is the better fit.

Step 8 — Fixed Window and Sliding Window

The window-counter family is the easiest to implement on top of Redis but hides a subtle fairness bug at window boundaries.

Fixed window

Bucket time into fixed intervals (e.g. each calendar minute). Keep one counter per (client, minute); increment on each request; reject when it exceeds the limit; let the key expire when the window ends. Dead simple — a single INCR — but it permits a boundary burst: a client can fire the full limit in the last moment of one window and the full limit again in the first moment of the next, admitting up to 2× the limit within a span shorter than one window.

Sliding window log

Store a timestamp for every request in a sorted set per client. On each request, drop timestamps older than now - window and count what remains; admit if under the limit. This is exact and has no boundary leak — but it stores one entry per request, so a client doing 10K req/s costs 10K timestamps in memory per window. Accurate but expensive.

Sliding window counter

The pragmatic middle ground: keep a counter for the current and previous fixed windows, and estimate the rolling count by weighting the previous window by how much it still overlaps the sliding window.

Python

# weighted estimate of requests in the trailing window
elapsed = now - current_window_start          # seconds into current window
weight  = (window_size - elapsed) / window_size  # overlap with previous window

estimated = current_count + previous_count * weight
if estimated < limit:
    current_count += 1
    return True     # admit
return False        # reject → 429

If you're 25% into the current minute, the previous minute still contributes 75% of its count to the rolling estimate. This kills the boundary burst, costs only two counters per key (O(1) memory), and is accurate to within a fraction of a percent in practice. It's the algorithm Cloudflare popularized for exactly this reason — fixed-window cost with sliding-window fairness.

interview tip

If asked "fixed or sliding window?", lead with the boundary-burst problem and draw it: two adjacent windows, full limit at the end of one and start of the next. Then offer sliding window counter as the O(1) fix. Reach for sliding window log only when the interviewer demands exactness and is fine paying per-request memory.

Step 9 — Distributed Counters in Redis

Behind a load balancer the limiter is stateless and the count must be shared, so the counter lives in Redis — single-threaded (no internal data races), in-memory (sub-millisecond), and equipped with atomic INCR and per-key TTL. A fixed-window limiter is just two commands:

Redis

# key = rl:{client}:{window_id}, e.g. rl:api_key_42:1719600000
INCR rl:api_key_42:1719600000        # returns the new count
EXPIRE rl:api_key_42:1719600000 60   # TTL = window size; auto-cleanup

# app logic:
#   count = INCR(key); if count == 1: EXPIRE(key, 60)
#   if count > limit: reject (429)

The window id is derived from the timestamp (floor(now / window_size)), so the key rotates automatically and old windows expire via TTL — no sweep job needed. At 500K checks/sec a single Redis node is over budget, so partition counters across a sharded cluster: hash the client key to a shard. Because every counter is scoped to one client key, sharding by client keeps each client's counter on a single node — no cross-shard coordination, no distributed transaction.

scalability note

Sharding by client key means a single very hot client still concentrates on one shard (a hot-key problem). Mitigate with a local in-process limiter in front of Redis for that client (Step 11), or by splitting a hot client's counter across N sub-keys and summing — trading exactness for spread. For most workloads, client-key sharding distributes load evenly because no single client dominates.

Step 10 — Atomicity with Lua Scripts

The naive flow — read the count, compare to the limit, then increment — is a classic read-modify-write race. Two requests for the same key arrive at two gateway nodes simultaneously; both GET count=99, both conclude they're under the limit of 100, both INCR to 101. The limit is breached. INCR alone is atomic, but the token-bucket and sliding-window math involves reading state, computing, and writing back — multiple steps that must not interleave.

Redis solves this by running a Lua script atomically: the entire script executes as one indivisible unit on Redis's single thread — no other command runs in between. The read, the limit check, and the write happen together or not at all.

Lua

-- token bucket, evaluated atomically via EVAL
-- KEYS[1]=bucket key  ARGV: capacity, refill_rate, now, requested
local b = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(b[1]) or tonumber(ARGV[1])   -- start full
local ts     = tonumber(b[2]) or tonumber(ARGV[3])
local cap, rate, now, need = tonumber(ARGV[1]), tonumber(ARGV[2]), tonumber(ARGV[3]), tonumber(ARGV[4])

tokens = math.min(cap, tokens + (now - ts) * rate)   -- lazy refill
local allowed = tokens >= need
if allowed then tokens = tokens - need end

redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', KEYS[1], 3600)
return allowed and 1 or 0

The script does the refill, the check, the decrement, and the persistence in one atomic step, returning a single allow/deny. Ship it once with SCRIPT LOAD and invoke by SHA via EVALSHA to avoid resending the body each call. The alternatives — WATCH/MULTI optimistic transactions — work but retry on contention; for a hot key under heavy concurrency, the Lua approach is simpler and faster because it never aborts.

Step 11 — Sync vs Async and Local-Plus-Global

Consulting Redis on every request is the synchronous model: perfectly accurate, but it puts a network round trip on the hot path and makes Redis a throughput bottleneck and a failure dependency. Two refinements relax this.

Local-plus-global (two-tier)

Each gateway node keeps a small in-process token bucket sized to its fair share of the global limit, and only periodically reconciles with Redis. Most requests are admitted or rejected locally in nanoseconds; Redis is consulted to redistribute headroom, not on every call. This slashes Redis traffic and survives brief Redis blips, at the cost of slight over-admission — node-local buckets can collectively exceed the global limit during a sync interval. For a limit of 1,000/s across 10 nodes you might briefly admit 1,050. Usually acceptable; tune the sync interval to trade accuracy for Redis load.

Async write-behind

Admit based on a local read, then push the increment to Redis asynchronously. Lowest latency, weakest accuracy — counts lag and over-admission grows. Reserve this for soft limits where approximate is fine (e.g. analytics throttling), not for hard security limits.

tradeoff

Synchronous Redis = exact but a per-request round trip and a hard dependency. Local-plus-global = near-exact with far less Redis traffic and graceful degradation, at the price of bounded over-admission. The right choice depends on whether the limit is a hard security boundary (sync, fail-closed) or a fairness/cost control (local-plus-global, fail-open). Name the limit's purpose, then pick.

Step 12 — The 429 Response and Headers

When a request is throttled, return 429 Too Many Requests and tell the client how to behave. Good headers turn blind retry storms into polite back-off.

HTTP

HTTP/1.1 429 Too Many Requests
Retry-After: 2                  # seconds until the client may retry
X-RateLimit-Limit: 100           # ceiling for this window
X-RateLimit-Remaining: 0          # tokens left right now
X-RateLimit-Reset: 1719600062     # epoch when the window refills
Content-Type: application/json

{ "error": "rate_limited",
  "message": "Too many requests. Retry after 2s." }

429 is the correct status — distinct from 503 (server overloaded) so clients can special-case throttling.
Retry-After is the single most useful header: it converts client retry logic from guesswork into a precise wait, preventing the retry storms that turn a throttle into an outage.
X-RateLimit-* let well-behaved clients self-pace before hitting the wall — returning them on successful responses too lets clients see remaining budget in real time.
Pair server limits with client-side exponential backoff + jitter so that a fleet of clients hitting the limit simultaneously doesn't retry in a synchronized thundering herd.

Step 13 — Edge Cases and Failure Modes

The interview depth lives here — the cases that separate a textbook answer from a production one.

Counter store is down — decide fail-open (allow traffic; a limiter outage shouldn't be a full outage — the common default) vs fail-closed (reject; for limits protecting fragile or security-critical resources). A robust design falls back to a conservative local in-memory limit when Redis is unreachable, so it neither floods nor blacks out.
Clock skew — token-bucket lazy refill and window ids both depend on time. Use Redis server time (TIME command) inside the Lua script as the single clock, rather than each gateway's local clock, to avoid skew across nodes.
Burst at the window boundary — the fixed-window 2× leak (Step 8); fix with sliding window counter.
Race conditions — concurrent check-then-increment over-admits; fix with the Lua atomic script (Step 10).
Distributed over-counting — local-plus-global tiers over-admit by a bounded amount during sync; acceptable for soft limits, not for hard ones.
Hot keys — one client hammering a single shard; mitigate with a local first-tier bucket or sub-key splitting.
What identity to key on — IP is spoofable and shared (NAT, corporate proxies punish many users under one IP); prefer authenticated API key or user ID, falling back to IP only for unauthenticated traffic.
Retry storms — without Retry-After and jittered backoff, throttled clients hammer in lockstep; the limiter must shape client behavior, not just reject.

Step 14 — Scaling and Fault Tolerance

At platform scale the limiter itself must be highly available and horizontally scalable.

Scaling the check path

Gateways are stateless — scale them horizontally behind the load balancer. The shared state is Redis; scale it by sharding counters across a Redis Cluster keyed by client. Add the local-plus-global tier (Step 11) to cut per-request Redis round trips by an order of magnitude, keeping Redis well under its ops ceiling.

Fault tolerance

Redis failure — run Redis with replicas and automatic failover (Sentinel or Cluster). During a failover window, the limiter follows its on_fail policy: fail-open with a local fallback bucket for most endpoints, fail-closed for security-critical ones.
Gateway failure — stateless, so the load balancer simply routes around a dead node; no counter state is lost because it lives in Redis.
Region failure — run a Redis cluster per region and limit per-region. Truly global limits across regions require cross-region replication and accept eventual consistency; most designs limit per region because cross-region synchronous counting would add unacceptable latency to the hot path.

scalability note

Global-across-regions exactness fights physics: a synchronous global counter means every request waits on a cross-region round trip. The standard compromise is to allocate each region a slice of the global budget (e.g. 60% to the region with most traffic, 40% to the other) and limit locally — approximate globally, exact locally, and zero cross-region latency on the request path.

Step 15 — Key Tradeoffs

Decision	Choice	Trade-off accepted
Algorithm	Token bucket	Allows bursts up to capacity; tune capacity vs refill to shape it
Window fairness	Sliding window counter	Tiny approximation error in exchange for O(1) memory and no boundary leak
State store	Sharded Redis	A dependency on the hot path; mitigated by local fallback and fail-open
Atomicity	Lua script (EVALSHA)	Logic runs inside Redis; harder to debug than app code, but race-free
Consistency model	Local-plus-global	Bounded over-admission during sync windows for far less Redis traffic
Failure behavior	Fail-open (default)	Brief over-admission during a Redis outage beats rejecting all traffic
Placement	API gateway	One choke point; the gateway must scale and reach shared state

takeaway

A rate limiter is a small amount of state guarded by a lot of careful reasoning. Token bucket is the default for its burst-friendly tunability; sliding window counter fixes the fixed-window boundary leak at O(1) cost; Redis holds the shared counter and a Lua script makes the check-and-increment atomic; the API gateway is where it lives; and a deliberate fail-open-vs-fail-closed choice keeps a limiter outage from becoming a service outage. The depth is in the edge cases — clock skew, hot keys, distributed over-counting, retry storms — so name them and state your tradeoffs.

🎯 interview hot-takes

Token bucket or leaky bucket — when do you use each? Token bucket allows short bursts up to the bucket capacity then throttles to the refill rate — ideal for APIs that tolerate bursty traffic. Leaky bucket enforces a constant outflow rate via a FIFO queue, smoothing bursts into a steady stream — ideal when a downstream system needs a stable, predictable request rate.
Why does a fixed window allow double the limit at the boundary, and how does a sliding window fix it? A fixed window resets the counter at each interval boundary, so a client can send the full limit in the last second of one window and again in the first second of the next — up to 2x the limit in a short span. A sliding window log keeps exact timestamps (accurate but memory-heavy); a sliding window counter weights the previous window's count by overlap, fixing the boundary burst at O(1) memory and near-exact accuracy.
Why do you need a Lua script for a Redis rate limiter? Check-then-increment is a read-modify-write race: two concurrent requests can both read count=99, both decide they are under the limit of 100, and both increment to 101 — over-admitting. A Lua script runs atomically inside Redis as a single operation, so the read, the limit check, and the increment happen with no interleaving.
Should the rate limiter fail open or fail closed when Redis is down? Most production systems fail open — if the rate-limit store is unreachable, allow the request rather than reject all traffic, because a limiter outage should not become a full outage. Critical endpoints like login or payments may fail closed or fall back to a conservative local in-memory limit. State the choice explicitly; it is an availability-versus-abuse tradeoff.
Where should the rate limiter live? Centralize it at the API gateway or a dedicated middleware in front of your services so every service inherits protection and the rules live in one place. A client-side limiter is advisory only because clients can bypass it. For ultra-low-latency hot paths, combine a local in-process token bucket as the first line with a shared Redis counter as global truth.