System Design: API Gateway

Requirements

Functional Requirements:

Route incoming HTTP/gRPC requests to appropriate backend microservices
Authenticate requests via JWT, API keys, OAuth2, and mTLS
Enforce rate limiting per consumer, per endpoint, and globally
Transform requests and responses: header manipulation, payload transformation, protocol translation (HTTP→gRPC)
Aggregate multiple backend calls into a single client response (request aggregation/BFF pattern)
Provide a developer portal: API documentation, key management, usage analytics

Non-Functional Requirements:

P99 latency overhead under 5ms (gateway adds minimal latency to each request)
99.999% availability — the gateway is the single ingress point; its downtime = full outage
Handle 1 million requests/second throughput
Zero-downtime configuration updates: new routes, auth policies apply without restarting

Scale Estimation

1M requests/sec across a gateway cluster. Each request requires: auth token validation (~0.5ms with cached public key), rate limit check (~0.3ms Redis lookup), routing table lookup (~0.1ms in-memory), upstream proxy (~2-4ms network). Total gateway overhead: ~5ms. Gateway nodes: at 50,000 RPS per node (single-threaded event loop like Nginx/Envoy), 20 nodes needed with 2x headroom = 40 nodes. Configuration store: ~10,000 routes × 1 KB = 10 MB, trivially fits in memory on all nodes. Rate limit counters: 100,000 active consumers × 8 bytes = 800 KB in Redis.

High-Level Architecture

The gateway is a reverse proxy cluster sitting between the internet and internal microservices. All external traffic enters through the gateway — no service is directly internet-accessible. The gateway is deployed across multiple availability zones behind a cloud load balancer (AWS ALB/NLB) for HA. Each gateway node is stateless; shared state (rate limit counters, session tokens) lives in Redis.

Configuration is stored in a control plane database (PostgreSQL or etcd). A configuration reconciler watches for changes and pushes updates to all gateway nodes via a pub/sub channel — gateway nodes reload routing tables and policies in-place without dropping connections. This is the same model used by Kong (which wraps Nginx) and Envoy-based gateways (Istio, Ambassador).

For high-throughput scenarios, the gateway uses async I/O (event-driven, non-blocking): a small number of worker threads handles thousands of concurrent connections. Connection pooling to backend services is maintained per-node with configurable pool sizes and health checking.

Core Components

Auth Middleware

The auth middleware runs as the first plugin in the request pipeline. For JWT validation: the gateway caches the upstream identity provider's JWKS (public keys), validates the token signature and claims (exp, iss, aud) locally without a network call. For API key auth: keys are hashed (SHA-256) and stored in Redis with associated consumer metadata; a single Redis GET validates the key and returns rate limit policy. For mTLS: the gateway terminates TLS, extracts the client certificate's CN/SAN, and maps it to an internal consumer identity. Auth failures return 401 immediately, short-circuiting all downstream processing.

Rate Limiter

Implements a sliding window counter using Redis INCR + EXPIRE. For each consumer + endpoint combination, the gateway increments a counter keyed by consumer_id:endpoint:window_timestamp. If the counter exceeds the limit, the request is rejected with 429 and a Retry-After header. The sliding window is approximated using two fixed windows (current and previous) weighted by the elapsed time fraction — this reduces memory from O(requests) to O(1) per consumer while closely approximating true sliding window behavior. Local token bucket caches in each gateway node absorb burst traffic without hitting Redis, with periodic synchronization to the shared counter.

Request Router

The routing table maps (host, path prefix, method) tuples to upstream service clusters. Matching priority: exact path > prefix match > regex. The routing table is an in-memory trie indexed by path segments for O(depth) lookup regardless of table size. Each route entry includes: upstream cluster (a list of backend endpoints), load balancing policy (round-robin, least-connections, weighted), timeout, retry policy (max attempts, retry conditions like 503/timeout), and circuit breaker thresholds. Route updates are applied atomically — the old trie is swapped for the new one after construction, so in-flight requests always use a consistent view.

Database Design

PostgreSQL stores the gateway configuration: routes (id, host, path, methods, upstream_cluster_id, plugins JSONB), consumers (id, username, auth_credentials JSONB), plugins (id, route_id, plugin_name, config JSONB), upstreams (id, name, targets JSONB, health_check_config JSONB). The JSONB plugin config supports arbitrary per-route plugin configuration without schema changes.

Request logs are written asynchronously to a log aggregation pipeline (Kafka → ClickHouse). Each log entry captures: timestamp, consumer_id, route_id, upstream_latency_ms, status_code, request_size, response_size. ClickHouse's columnar storage enables fast aggregations for the developer analytics portal (p50/p99 latency per API, error rates, top consumers).

API Design

Scaling & Bottlenecks

The Redis rate limiter becomes a bottleneck at extreme scale (>500K RPS). Mitigation: shard Redis by consumer_id hash across multiple Redis instances; use local token bucket approximations with periodic Redis sync (trades strict accuracy for throughput). Alternatively, use a Lua script in Redis to perform atomic increment-and-check in a single round trip, halving Redis calls.

TLS termination is CPU-intensive. At 1M RPS with TLS 1.3, session resumption (0-RTT) eliminates the handshake overhead for returning clients. Hardware SSL offload (dedicated SSL termination layer, or crypto-accelerated NICs) prevents TLS from bottlenecking the gateway workers.

Key Trade-offs

Centralized vs. sidecar gateway: Centralized gateway is simpler to operate but is a single failure domain; sidecar (service mesh) pushes routing to each service, eliminating the single ingress bottleneck but multiplying operational surface area
Synchronous auth vs. async auth: Inline JWT validation adds <1ms but requires caching JWKS; async auth (callback to identity service) is flexible but adds 5-20ms per request
Strict vs. approximate rate limiting: Exact counting requires distributed coordination (Redis); approximate local counting (token bucket per node) is faster but allows brief overages during traffic spikes
Eager vs. lazy config propagation: Pushing config to all nodes immediately ensures consistency but increases control plane load; lazy pull-on-change reduces load but means nodes may briefly serve stale config