System Design: Webhook Delivery System

Requirements

Functional Requirements:

Producers (internal services) publish events; the system delivers them as HTTP POST requests to subscriber-registered endpoints
Subscribers register webhook endpoints with event type filtering (e.g., subscribe to "payment.completed" but not "payment.pending")
Guaranteed at-least-once delivery with configurable retry policies (exponential backoff, max retries)
Cryptographic signature on each delivery (HMAC-SHA256) for endpoint verification
Dead letter queue for events that exhaust all retry attempts
Delivery logs with request/response details, latency, and retry history

Non-Functional Requirements:

Deliver 100K events/sec with P95 delivery latency under 5 seconds from event publication
Support 1M registered webhook endpoints across 100K customers
99.95% successful delivery rate (within retry window) to healthy endpoints
Retry window: up to 72 hours with exponential backoff
Isolated delivery: a slow/failing endpoint cannot affect delivery to other endpoints

Scale Estimation

100K events/sec published. Average fan-out: 3 endpoints per event = 300K deliveries/sec. Each delivery: HTTP POST with ~2KB payload, average 500ms latency = 150K concurrent outbound HTTP connections. With retries (assuming 5% failure rate, 3 average retries per failed delivery): 300K × 0.05 × 3 = 45K additional deliveries/sec. Total delivery attempts: 345K/sec. Delivery logs: 345K/sec × 500 bytes = 172.5MB/sec = 14.9TB/day. Dead letter events (after exhausting retries): 300K × 0.001 (0.1% permanently failing) = 300/sec = 26M/day.

High-Level Architecture

The system has four components: Event Ingestion, Routing, Delivery, and Observability. The Event Ingestion Layer receives events from internal producers via a Kafka topic (events topic). Each event contains: event_type, payload (JSON), event_id (for idempotency), and timestamp. The Routing Layer consumes events from Kafka and fans them out to delivery tasks. For each event, the Router queries the Subscription Store (Redis cache backed by PostgreSQL) to find all endpoints subscribed to that event_type. For each matching endpoint, it creates a Delivery Task: {event_id, endpoint_url, payload, signing_secret, attempt_number: 1} and publishes it to a delivery queue (a separate Kafka topic partitioned by endpoint_id for ordering guarantees per endpoint).

The Delivery Layer is a fleet of HTTP workers that consume delivery tasks and execute HTTP POST requests. Each worker: (1) constructs the HTTP request with the event payload as the body, (2) computes the HMAC-SHA256 signature using the endpoint's signing secret and adds it to the X-Webhook-Signature header, (3) sends the request with a 30-second timeout, (4) evaluates the response: 2xx = success, 4xx = permanent failure (except 429), 5xx or timeout = transient failure. Transient failures are requeued with exponential backoff delay. The worker records the delivery attempt in the delivery log.

The Observability Layer tracks delivery health per endpoint. An Endpoint Health Service monitors delivery success rates, latency trends, and failure patterns. Endpoints with >50% failure rate over 1 hour are flagged as degraded; endpoints failing 100% for 24 hours are auto-disabled with a notification to the subscriber. This prevents wasted resources on dead endpoints.

Core Components

Event Fan-Out & Routing

The Router consumes events from the events Kafka topic (100K events/sec across 50 partitions). For each event, it performs a subscription lookup: a Redis hash per event_type stores the list of subscribed endpoint_ids. The Router then creates individual delivery tasks. Fan-out amplification (1 event → N endpoints) is handled by producing N messages to the delivery topic. To avoid overwhelming the delivery topic during high-fan-out events (e.g., a global event subscribed by 100K endpoints), the Router rate-limits fan-out to 10K messages/sec per event_type, spreading delivery over seconds rather than creating instantaneous burst. Each delivery task carries the full event payload (denormalized) to avoid a database lookup in the delivery worker.

Delivery Workers & Retry Logic

Delivery workers are async HTTP clients (Go with net/http or Node.js with undici) optimized for high concurrency. Each worker handles 10K concurrent outbound connections using connection pooling per endpoint hostname (max 100 connections per host to avoid overwhelming a single endpoint). Retry schedule: attempt 1 immediate, attempt 2 at +1 minute, attempt 3 at +5 minutes, attempt 4 at +30 minutes, attempt 5 at +2 hours, attempt 6 at +8 hours, attempt 7 at +24 hours, attempt 8 at +72 hours. Delayed retries are implemented using Kafka delayed topics (one topic per delay tier) or a Redis sorted set (score = delivery_at timestamp, polled every second). After attempt 8, the event is moved to the dead letter queue with a notification to the subscriber.

Signature Verification

Each webhook delivery includes a cryptographic signature enabling the receiver to verify authenticity. The signature is computed as: HMAC-SHA256(signing_secret, timestamp + "." + payload). The timestamp is included to prevent replay attacks; receivers should reject signatures older than 5 minutes. The signature and timestamp are sent in HTTP headers: X-Webhook-Signature: v1={signature} and X-Webhook-Timestamp: {unix_seconds}. Each endpoint has a unique signing_secret generated on registration (256-bit random, stored encrypted in PostgreSQL). Secret rotation is supported: during rotation, both old and new secrets are valid for 24 hours, and deliveries include signatures for both (X-Webhook-Signature: v1={new_sig},v0={old_sig}).

Database Design

PostgreSQL stores subscription and endpoint data: endpoints (endpoint_id UUID PK, customer_id, url VARCHAR, signing_secret_encrypted VARCHAR, event_types ARRAY, status ENUM(active, degraded, disabled), created_at, updated_at), subscriptions (subscription_id, endpoint_id, event_type, created_at). Redis caches the subscription index: per event_type, a SET of endpoint_ids. Cache is refreshed on subscription changes via CDC from PostgreSQL.

Delivery logs are stored in ClickHouse (columnar, optimized for analytical queries): delivery_logs (event_id, endpoint_id, customer_id, event_type, attempt_number, status ENUM(success, failed, retrying), http_status INT nullable, response_body TEXT nullable truncated to 1KB, latency_ms INT, error_message nullable, delivered_at DateTime). Partitioned by delivered_at (daily) with TTL retention of 30 days. ClickHouse enables fast analytical queries: "show me all failed deliveries to endpoint X in the last 24 hours" or "what is the average delivery latency for customer Y?".

Dead letter queue: dlq_events (dlq_id UUID PK, event_id, endpoint_id, customer_id, event_type, payload JSONB, final_error VARCHAR, total_attempts INT, first_attempt_at, last_attempt_at) stored in PostgreSQL. Subscribers can view DLQ events via API and replay them (re-enqueue for delivery).

API Design

POST /api/v1/endpoints — Register a webhook endpoint; body contains url, event_types; returns endpoint_id and signing_secret
POST /api/v1/events — Publish an event; body contains event_type, payload; returns event_id
GET /api/v1/endpoints/{endpoint_id}/deliveries?status=failed&limit=50 — Fetch delivery log for an endpoint with filtering
POST /api/v1/endpoints/{endpoint_id}/dlq/{dlq_id}/replay — Replay a dead-letter event to the endpoint

Scaling & Bottlenecks

Outbound HTTP connections are the primary bottleneck. At 150K concurrent connections, the delivery fleet must manage connection pools efficiently. Each worker maintains per-host connection pools (HTTP/2 multiplexing where supported, reducing connection count). A fleet of 20 workers (each handling 10K concurrent connections with async I/O) provides capacity. The key isolation requirement is that a slow endpoint (30-second timeout) must not affect delivery to other endpoints. Per-endpoint connection limits (max 100 concurrent connections) ensure a single slow endpoint consumes at most 100 slots across the fleet, leaving 149,900 for healthy endpoints.

The retry system must manage a growing queue of delayed deliveries. During an endpoint outage lasting 72 hours, the system accumulates: 300 deliveries/sec for that endpoint × 259,200 seconds = 77.7M pending deliveries. The delayed retry mechanism (Redis sorted set or delayed Kafka topics) must handle this volume. Using Redis sorted sets with ZRANGEBYSCORE polling every second, the system processes due retries efficiently. Multiple retry workers partition the sorted set by endpoint_id hash to parallelize processing.

Key Trade-offs

At-least-once vs at-most-once delivery: At-least-once (retries on failure) can deliver duplicates, requiring receivers to implement idempotency using the event_id — the alternative (at-most-once) risks lost events, which is unacceptable for payment and order webhooks
Denormalized payload in delivery task vs event_id reference: Including the full payload in each delivery task avoids a database lookup in the hot delivery path but increases Kafka message size and storage — the latency reduction (eliminating a DB call per delivery) justifies the storage cost
Exponential backoff over 72 hours vs shorter retry window: 72 hours accommodates weekend outages and deployment issues, but accumulates retry volume — the exponential schedule (1min → 5min → 30min → 2h → 8h → 24h → 72h) keeps retry frequency manageable
Auto-disable failing endpoints vs unlimited retries: Auto-disabling after 24 hours of 100% failure prevents resource waste on dead endpoints but risks disabling endpoints experiencing temporary DNS issues — the subscriber notification and easy re-enable mechanism balance automation with user control