System Design: Rate Limiter

Requirements

Functional Requirements:

Limit requests per consumer per time window (per second, minute, hour, day)
Support multiple algorithms: fixed window, sliding window log, sliding window counter, token bucket, leaky bucket
Apply limits at multiple granularities: per user, per IP, per API key, per endpoint, globally
Return appropriate HTTP 429 responses with Retry-After and X-RateLimit-* headers
Allow burst capacity above steady-state rate (token bucket burst parameter)
Configurable limits per consumer tier (free: 100/min, pro: 10,000/min, enterprise: unlimited)*

Non-Functional Requirements:

Rate limit decision latency under 2ms p99 (must not significantly impact API response time)
99.99% availability — unavailability of the rate limiter should fail open (allow requests) not fail closed
Handle 5 million requests/second across the system
Counters must be accurate within 0.1% across a distributed cluster

Scale Estimation

5M requests/sec across 50 API servers. Each request requires one rate limit check. If using centralized Redis: 5M Redis commands/sec — a single Redis node handles ~200K commands/sec, so 25 Redis nodes needed. With local token bucket caches on API servers (absorbing 90% of checks locally): Redis load drops to 500K/sec, manageable with 3-5 Redis nodes. Counter storage: 1M active API keys × 3 windows (minute/hour/day) × 8 bytes = 24 MB — trivially fits in Redis RAM.

High-Level Architecture

A two-tier rate limiting architecture balances accuracy and performance. Tier 1: local in-process rate limiting on each API server using token bucket algorithm. This handles the common case (consumer well within their limit) with zero network calls. Tier 2: centralized Redis-backed counters for exact enforcement across the cluster, used when the local token bucket is exhausted or when precise distributed counting is required.

Redis is the source of truth for rate limit counters. API servers periodically synchronize their local token bucket state with Redis — incrementing by the number of requests processed locally and checking if the global counter exceeds the limit. This sync happens every 100ms, meaning brief windows where a consumer could exceed their limit by (local_rate × 0.1s × num_servers) before Redis reflects the overage — an acceptable approximation for most use cases.

For strict enforcement (financial APIs, security-sensitive endpoints), all requests go through Redis synchronously with Lua scripts to ensure atomicity. For lenient enforcement (general API throttling), the two-tier approach provides sub-millisecond decisions.

Core Components

Token Bucket Algorithm

Each consumer has a bucket with capacity C (burst limit) and refill rate R (steady-state limit). Tokens are added at rate R continuously (up to capacity C). Each request consumes 1 token; if no tokens remain, the request is rejected. Implementation: store (token_count, last_refill_timestamp) per consumer. On each request: compute tokens_to_add = (now - last_refill) × R; new_count = min(C, current_count + tokens_to_add); if new_count >= 1, subtract 1 and allow; else reject. In Redis: a Lua script reads, computes, and writes atomically in a single round trip. Token bucket naturally handles bursts — a consumer that was idle accumulates tokens up to burst capacity.

Sliding Window Counter

The fixed window counter has a boundary problem: a consumer with limit 100/min can send 100 at 00:59 and 100 at 01:00 — 200 requests in 2 seconds. The sliding window counter fixes this by blending two adjacent windows. Algorithm: current_window_count = floor(requests_in_current_window + requests_in_previous_window × (1 - elapsed_fraction_of_current_window)). This approximates true sliding window with O(1) memory instead of O(requests) for a full sliding log. Redis keys: rl:{consumer}:{window_start_minute} with INCR and EXPIRE set to 2× window duration. Error vs true sliding window: <1% at steady state.

Sliding Window Log (Exact)

For exact sliding window enforcement: maintain a sorted set (Redis ZSET) per consumer, with timestamps as scores. On each request: ZADD the current timestamp; ZREMRANGEBYSCORE to remove entries older than the window; ZCARD to count remaining; compare against limit. This is O(log N) per request and O(N) memory (N = requests per window). Suitable for low-volume, high-accuracy use cases (e.g., login attempt limiting). For high-volume endpoints, the approximate sliding window counter is preferred.

Database Design

Redis is the primary store for all rate limit state. Data structures by algorithm: token bucket uses Redis hashes (HGETALL/HMSET in a Lua script); sliding window counter uses Redis strings (INCR/EXPIRE); sliding window log uses Redis sorted sets. All keys include consumer_id and time bucket: rl:token:{consumer_id}, rl:window:{consumer_id}:{minute_epoch}. Key TTLs are set to 2× the window to ensure automatic cleanup of expired counters without a separate GC process.

Limit configurations (which consumer gets which limit, which endpoints have which policies) are stored in PostgreSQL and cached in each API server's memory with a 60-second TTL. Changes propagate within 60 seconds — acceptable for limit configuration changes, which are infrequent.

API Design

Scaling & Bottlenecks

Redis becomes the bottleneck at extreme scale. Sharding by consumer_id distributes load — use consistent hashing to map consumer_id to a Redis shard, with virtual nodes to handle shard rebalancing when adding capacity. Redis Cluster provides automatic sharding but adds latency for cross-slot operations; since each rate limit key is independent, Redis Cluster works well here.

Network round trips to Redis (0.5-1ms each) dominate rate limiter latency. Lua scripts reduce round trips from 3 (GET, compute, SET) to 1 (atomic Lua execution on Redis). Pipelining multiple rate limit checks in a single Redis connection further reduces overhead for batch checking.

Key Trade-offs

Token bucket vs. fixed window: Token bucket handles bursts gracefully and is fairer to bursty-but-low-average consumers; fixed window is simpler to reason about and implement but has boundary vulnerability
Distributed vs. local enforcement: Local enforcement is fast (zero network) but allows overage proportional to (cluster size × sync interval); distributed is accurate but adds Redis latency to every request
Fail open vs. fail closed on Redis downtime: Fail open (allow all requests when Redis is down) prevents cascading failures but enables abuse; fail closed (reject all when Redis is down) is safer but causes service disruption
Per-endpoint vs. global limits: Per-endpoint limits prevent a single expensive endpoint from consuming all quota; global limits are simpler; most production systems use both with the stricter limit applying