SYSTEM_DESIGN

System Design: API Key Management

Design a secure API key management system that issues, validates, rotates, and revokes API keys for developer platforms. Covers key hashing, rate limiting per key, quota enforcement, key scoping, and audit trails for enterprise compliance.

12 min readUpdated Jan 15, 2025
system-designapi-keysauthenticationrate-limitingsecurity

Requirements

Functional Requirements:

  • Issue API keys with configurable scopes (read-only, write, admin), expiry dates, and rate limits
  • Validate API keys on every API request within the authentication middleware
  • Support key rotation: issue a replacement key while the old key remains valid during a transition period
  • Revoke keys immediately: a revoked key must be rejected within 1 second across all API servers
  • Maintain an audit log of key creation, usage, and revocation events
  • Support workspace/organization-level key management: keys belong to orgs, not individuals

Non-Functional Requirements:

  • Key validation latency under 2ms (in the hot path of every API request)
  • Support 500,000 API requests per second across all API servers
  • Store key metadata for 7 years after revocation for compliance auditing
  • Keys must never be stored in plaintext; only HMAC-SHA256 hashes stored in the database
  • 99.999% availability for key validation

Scale Estimation

500,000 API requests/second, each requiring key validation. At 2ms budget: the validation layer must sustain 1 million lookups/second (round-trip fetch + comparison). Redis Cluster with 500,000 GET ops/second is well within capacity (each node handles 100,000 ops/second; 5-node cluster suffices). Active keys: 10 million keys across all customers. Redis memory: 10 million keys * 200 bytes per key record = 2 GB — trivially fits in a Redis Cluster.*

High-Level Architecture

API keys are structured as {prefix}_{random_payload}. The prefix (e.g., sk_live_) is a human-readable discriminator that identifies the key type and environment. The random_payload is 32 bytes of CSPRNG entropy, base58-encoded to 43 characters. The full key (prefix + payload) is shown to the user only once at creation; only HMAC-SHA256(key, server_secret) is stored in the database.

On every API request, the authentication middleware extracts the key from the Authorization: Bearer header (or X-API-Key header), computes HMAC-SHA256(presented_key, server_secret), and looks up the hash in Redis. The Redis value contains: org_id, key_id, scopes, rate_limit_rpm, expiry_timestamp, revocation_flag. If the hash is not found or the key is expired/revoked, the request is rejected with 401. If valid, the org_id and scopes are injected into the request context for downstream authorization checks.

Rate limiting is enforced per key using a sliding window counter in Redis. A Lua script atomically increments a per-key counter (with 60-second TTL for per-minute limits) and compares against the key's rate_limit_rpm. If the counter exceeds the limit, the script returns 429 with a Retry-After header. The Lua script approach ensures the increment and comparison are atomic, preventing race conditions in distributed rate limiting.

Core Components

Key Issuance Service

Key generation: prefix + base58(CSPRNG(32 bytes)). The full key is returned to the client once and never stored. The server stores: (key_id UUID, org_id, key_hash = HMAC-SHA256(key, secret), prefix, name, scopes TEXT[], rate_limit_rpm INT, expires_at TIMESTAMP, created_at, created_by, revoked_at). The HMAC secret is stored in AWS KMS; key hashing uses KMS DataKey to avoid plaintext secrets in application memory. A Key Derivation Function (HKDF) derives per-key HMACs from the master KMS key, limiting blast radius of any single key compromise.

Revocation Propagation

When a key is revoked, the API key service: (1) sets revoked_at in PostgreSQL, (2) publishes a key_revoked event to Kafka, (3) deletes the key hash from Redis immediately. All API servers subscribe to the revocation event via Redis Pub/Sub and remove the key from their local LRU cache (if any). The combination of Redis deletion + Pub/Sub ensures revocation propagates to all API servers within 100ms — well within the 1-second SLA.

Key Rotation Workflow

Key rotation is a multi-step process: (1) user creates a new key via the management UI, (2) system marks the new key as PENDING with the same scopes as the old key, (3) user updates their application to use the new key, (4) user confirms rotation complete, (5) system transitions the old key to ROTATING status (still valid but flagged), (6) after a configurable grace period (24–72 hours), the old key is automatically revoked. During the grace period, both keys are valid, minimizing downtime risk.

Database Design

PostgreSQL: api_keys (key_id UUID PK, org_id UUID, key_hash VARCHAR(64), prefix VARCHAR(20), name VARCHAR, scopes TEXT[], rate_limit_rpm INT, expires_at TIMESTAMP, status ENUM(ACTIVE, ROTATING, REVOKED), created_at, created_by, revoked_at, revoked_by, rotation_parent_key_id UUID). api_key_audit_log (event_id, key_id, org_id, event_type ENUM(CREATED, VALIDATED, ROTATED, REVOKED), ip_address, user_agent, request_id, occurred_at). Redis: key apikey:{key_hash} → JSON of org_id, scopes, rate_limit_rpm, expires_at; key ratelimit:{key_id}:{window_minute} → counter.

API Design

POST /api-keys — Create a new API key with name, scopes, expiry, and rate limit; returns the full key once. DELETE /api-keys/{key_id} — Immediately revoke an API key. POST /api-keys/{key_id}/rotate — Initiate key rotation; returns a new key while keeping the old one valid during the grace period. GET /api-keys/{key_id}/usage — Return usage statistics: requests in last 24h, rate limit hits, unique IPs.

Scaling & Bottlenecks

At 500,000 RPS, a local in-process cache (JVM Caffeine or Go sync.Map, capacity 100,000 keys, TTL 30 seconds) on each API server reduces Redis hits to cache misses only (~5% of traffic for frequently used keys). Cache invalidation on revocation is handled via Redis Pub/Sub: a revocation event triggers immediate cache eviction on all API servers, ensuring revoked keys are rejected within 100ms even from the local cache.

HMAC computation at 500,000 ops/second: HMAC-SHA256 on 50 bytes takes ~1 microsecond on modern CPUs; 500,000 ops/second requires 0.5 CPU cores — negligible. The hash lookup in Redis is the dominant latency contributor at 0.5–1ms per lookup.

Key Trade-offs

  • HMAC-SHA256 hash vs. bcrypt for key storage: HMAC-SHA256 with a server secret is fast (1 microsecond) enabling 500,000 RPS validation; bcrypt is too slow (100ms) for hot-path validation. HMAC is secure because the 256-bit server secret prevents rainbow table attacks.
  • Per-request Redis lookup vs. local cache: Per-request Redis lookup guarantees fresh revocation status but adds 1ms; local cache with Pub/Sub invalidation provides <0.1ms lookup with 100ms revocation latency — the right trade-off for most use cases.
  • Scoped vs. unscoped keys: Scoped keys limit blast radius (a compromised read-only key cannot delete resources) but add complexity to the authorization layer; unscoped keys are simpler to implement but give attackers full access on compromise.
  • Short vs. long key length: 256 bits (32 bytes) of entropy is cryptographically sufficient and unguessable; longer keys don't improve security but increase storage and transmission size.

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.