System Design: Service Discovery System

Requirements

Functional Requirements:

Services register their address (IP:port) on startup and deregister on graceful shutdown
Health checks determine which instances are available: HTTP, TCP, and script-based checks
Clients query for healthy instances of a service by name, optionally filtered by tags or datacenter
Support both client-side load balancing (client picks from returned list) and server-side (returns single endpoint)
Propagate health status changes to all clients within 10 seconds of a service becoming unhealthy
Support multi-datacenter federation: services in DC-A can discover services in DC-B

Non-Functional Requirements:

High availability: service discovery must survive the loss of (n-1)/2 server nodes
Query latency under 5ms for cached lookups; under 50ms for uncached
Support 100,000 registered service instances across 10,000 services
Handle 500,000 discovery queries/sec across all clients

Scale Estimation

100,000 service instances, each sending a heartbeat every 10 seconds = 10,000 heartbeats/sec to the registry. Each heartbeat is a small HTTP request (~200 bytes). 500,000 queries/sec: most are DNS lookups (~50 bytes each) served by local caching DNS resolvers. The registry itself handles far fewer queries — clients cache results with short TTLs (5-30 seconds). Registry state: 100,000 instances × 500 bytes metadata = 50 MB, trivially fits in memory on all server nodes.

High-Level Architecture

The service discovery system runs as a cluster of server nodes (typically 3 or 5) using Raft consensus to maintain a consistent view of the service registry. All writes (registrations, deregistrations, health status changes) go through the Raft leader and are replicated before acknowledgment. Reads can be served by any node — with stale-read semantics (bounded by replication lag, typically <100ms) or strong consistency (always through the leader).

Clients interact via two interfaces: an HTTP/gRPC API for rich queries (filter by tag, datacenter, health status) and a DNS interface for simple service name resolution. The DNS interface returns A records (IP addresses) for healthy instances. Clients cache DNS responses for the TTL (5 seconds by default) and re-query after expiry. The DNS interface enables zero-code-change service discovery for legacy applications.

Health checks run server-side (the registry server polls service health endpoints) or client-side (the service agent sends heartbeats). Server-side checks: the registry sends an HTTP GET to http://service:8080/health every 10 seconds; two consecutive failures mark the instance unhealthy. Client-side checks (TTL-based): the service must call /agent/check/pass/<check_id> within a configured interval; if no call arrives, the check fails. Client-side checks are better for application-level health (checking DB connections, processing queue depth) that only the application itself knows.

Core Components

Raft Consensus Cluster

Server nodes run Raft for strong consistency. The leader handles all writes from clients (direct registration calls) and from agents (forwarded from the local agent on each service host). Followers participate in consensus and can serve stale reads. Leader election uses randomized timeouts (150-300ms) — if a follower doesn't hear from the leader, it starts an election. With 3 nodes, the system tolerates 1 failure; with 5 nodes, 2 simultaneous failures. Raft log entries for service registrations are compacted via snapshots (a point-in-time dump of the entire registry) to bound WAL size.

Health Check Engine

The health check engine runs on each server node for server-side checks. To avoid thundering herd (all health checks expiring at the same moment), checks are jitter-distributed: a 10-second interval check fires at a random offset within the first interval. Failed checks are retried immediately; after N consecutive failures (configurable, default 2), the instance is marked critical. Critical instances are excluded from discovery results but not immediately deregistered — they may recover. Instances deregister via explicit API call (graceful shutdown) or via a separate deregistration timeout (if the service process crashes without deregistering).

DNS Interface

The DNS server returns SRV records (IP + port + priority + weight) and A records for service names following the pattern <service>.<tag>.service.<datacenter>.consul. SRV records enable weighted load balancing at the DNS level — instances with more capacity get higher weight. The DNS interface uses UDP for queries under 512 bytes and TCP for larger responses. EDNS0 support extends the UDP payload limit to 4096 bytes, allowing larger service instance lists in a single response. The DNS TTL is kept short (5 seconds) to ensure health state propagates quickly; negative caching is disabled to avoid caching "service not found" responses during transient registration lag.

Database Design

All registry state is in-memory on server nodes, replicated via Raft. The backing store is a radix tree (for fast prefix-based service lookups) layered on a time-ordered index (for efficient health check scheduling and TTL-based expiry). No disk-based database — state is reconstructed from the Raft log and snapshots on restart. Snapshots are written to disk periodically (when the log exceeds 16,384 entries) and on graceful shutdown.

For long-term analytics (registration events, health check failure history), events are streamed to an external store (Kafka → ClickHouse). The registry itself only retains current state and a short event buffer for watch/blocking queries.

API Design

Scaling & Bottlenecks

At 500,000 queries/sec, the server cluster would be overwhelmed if every client queried directly. The solution is a local agent on every host: each service host runs an agent that caches service catalog data and serves local queries without hitting the server cluster. The agent maintains a local copy of the catalog, updated via blocking queries (long-poll watches) against the server cluster. This reduces server query load to ~1 query per service name per agent per TTL period — roughly 10,000 agents × 100 watched services ÷ 30 second TTL = 33,000 queries/sec to the server cluster.

Raft write throughput limits registration rate. At peak (mass deployment), 10,000 services registering simultaneously = 10,000 Raft writes in seconds. Mitigation: batch registrations, increase Raft heartbeat frequency to detect leader failures faster during high-write periods, and scale the Raft cluster to 5 nodes for better fault tolerance without significantly impacting write latency.

Key Trade-offs

Client-side vs. server-side load balancing: Client-side (client picks from list) distributes load naturally but requires each client to implement balancing logic; server-side (single endpoint returned) centralizes balancing but creates a single proxy bottleneck
DNS vs. HTTP API: DNS requires no client library and works with all languages, but is limited to IP:port (no rich metadata, no health filtering beyond basic); HTTP API provides rich filtering but requires integration
Push vs. pull health status: Pushing health changes to all clients immediately reduces propagation latency but requires maintaining client connections; pull (clients re-query after TTL) is simpler but means stale data for the TTL duration
Strong vs. eventual consistency: Strong consistency (all reads through leader) ensures clients never see stale data but limits read throughput; eventual consistency (follower reads) scales reads but clients may briefly see deregistered instances