SYSTEM_DESIGN
System Design: Distributed Tracing System
Design a distributed tracing system like Jaeger or Zipkin that correlates requests across microservices, identifies performance bottlenecks, and provides flame graph visualizations at high-throughput scale.
Requirements
Functional Requirements:
- Correlate a single user request across all microservices it touches using a trace ID
- Record spans (units of work within a service) with start time, duration, tags, and logs
- Propagate trace context across service boundaries via HTTP headers (W3C TraceContext) and message queue metadata
- Search traces by service, operation, duration, error status, and custom tags
- Visualize trace as a waterfall/Gantt chart showing span timing relationships
- Compute service dependency graphs and latency percentiles per operation
Non-Functional Requirements:
- Ingest 1 million spans/sec with end-to-end latency from emission to queryability under 30 seconds
- Sampling: collect 100% of error traces, 1% of success traces — reducing storage by 100x
- Query response time under 3 seconds for a trace lookup by trace ID
- Retain traces for 7 days hot, 30 days warm
Scale Estimation
A system with 1,000 services, each handling 1,000 RPS with average 10 spans per trace: 10 million spans/sec before sampling. At 1% sample rate: 100,000 spans/sec stored. Each span: ~1 KB (operation name, tags, logs, timestamps) = 100 MB/sec = 8.6 TB/day. At 7-day retention: 60 TB of hot trace storage. Error traces (100% sampled): if 0.1% of requests error, 10,000 spans/sec × 10 spans/trace = 100,000 additional spans/sec from errors. Query load: 1,000 engineers × 10 trace lookups/minute = ~167 lookups/sec.
High-Level Architecture
The tracing system has three phases: instrumentation, collection, and storage/query. Instrumentation is handled by OpenTelemetry SDKs embedded in each service. The SDK creates spans, samples them, and batches/exports them to a local OpenTelemetry Collector (running as a sidecar or DaemonSet). The Collector performs local processing (attribute filtering, sampling decisions) and forwards to the central tracing backend.
The central backend receives spans from thousands of Collectors, writes them to a message queue (Kafka) for buffering, processes them (span aggregation, trace assembly), and writes complete traces to the storage backend. Cassandra is a common storage choice (Jaeger uses it): its wide-column model maps naturally to trace data (partition key = trace_id, clustering columns = span timing). Elasticsearch is used for the search index (find traces by service/operation/tag).
Tail-based sampling is the key architectural challenge. Head-based sampling (decide at trace start whether to sample) is simple but can't selectively sample error traces (the error hasn't happened yet). Tail-based sampling (decide after the trace is complete) requires buffering all spans for an in-flight trace until it's complete — a collector must see all spans for a trace to make the sampling decision. This requires routing all spans for a trace_id to the same collector (using consistent hashing on trace_id).
Core Components
OpenTelemetry SDK & Collector
The OTel SDK in each service creates spans via a Tracer. Spans are nested: a root span per request, child spans for each downstream call or significant operation. Context propagation: when service A calls service B, A injects the trace context (traceparent header with trace_id and span_id) into the HTTP/gRPC request; B's SDK extracts it and creates a child span with A's span as the parent. The OTel Collector receives spans via OTLP (OpenTelemetry Protocol), applies processors (batch, attribute filtering, tail sampling), and exports to the backend. Running a Collector per node (DaemonSet) keeps trace data local until batched, reducing network calls.
Tail-Based Sampler
The tail sampler buffers all spans for each in-flight trace (for up to 30 seconds, the max expected trace duration). When the root span arrives (completing the trace), the sampler evaluates sampling rules: sample 100% if any span has error=true or status=ERROR; sample 100% if trace duration > 5 seconds (slow traces); otherwise sample 1%. The challenge: spans for a single trace arrive from many services and must be routed to the same sampler instance. Consistent hashing on trace_id routes spans to the correct sampler shard. If a sampler shard crashes, its in-flight trace data is lost — acceptable, as these are sampled traces, not critical data.
Trace Storage (Cassandra)
Cassandra's data model for traces: primary table traces partitioned by (trace_id, service) with rows for each span ordered by start_time. A lookup by trace_id fetches all spans for the trace in a single partition scan — O(spans_in_trace) I/O. Secondary table for service search: service_operations partitioned by (service_name, operation_name) with trace_id and timestamp as clustering columns — enables efficient queries like "find all traces for service=auth, operation=login in the last 1 hour." A separate dependencies table stores computed service dependency graphs (aggregated periodically from trace data).
Database Design
Cassandra stores raw span data, optimized for write throughput and trace_id lookup. Elasticsearch stores a search index with one document per trace (not per span): {trace_id, root_service, root_operation, duration_ms, error, start_time, services[], tags{}}. This index enables multi-field search but doesn't store the full span tree — trace details come from Cassandra after the trace_id is found via Elasticsearch. This two-store architecture separates search (Elasticsearch) from retrieval (Cassandra), each optimized for its workload.
Span data is written to Cassandra with TTL matching the retention policy (7 days = 604,800 seconds). Cassandra's compaction strategy is TWCS (TimeWindowCompactionStrategy), which groups SSTables by time window and compacts within windows — ideal for time-series data where old data is never updated and expires via TTL.
API Design
Scaling & Bottlenecks
The tail-based sampler is the most operationally complex component. Each sampler instance must buffer spans for all in-flight traces in its hash range — memory usage scales with (concurrent_traces × avg_spans × avg_span_size). At 100,000 sampled spans/sec × 30s max trace duration = 3M buffered spans × 1 KB = 3 GB per sampler instance. Mitigation: evict traces after configurable timeout; use off-heap storage (RocksDB) for the span buffer to avoid GC pressure.
Cassandra write throughput scales linearly with nodes. At 100,000 spans/sec × 1 KB = 100 MB/sec, a 10-node Cassandra cluster with NVMe SSDs handles this comfortably. The bottleneck shifts to read throughput during high query load — adding read replicas (increasing replication factor from 3 to 5) improves read parallelism.
Key Trade-offs
- Head-based vs. tail-based sampling: Head-based is simple (no buffering needed, decision at trace start) but can't selectively sample slow/error traces; tail-based enables intelligent sampling but requires buffering and consistent routing
- Cassandra vs. ClickHouse for span storage: Cassandra excels at trace_id point lookups; ClickHouse's columnar storage is better for analytics ("p99 latency of operation X across all traces") but slower for single-trace retrieval
- Span granularity: Too many spans (every function call) overwhelms storage and adds SDK overhead; too few misses performance bottlenecks; teams should instrument at service boundaries and explicit performance-critical operations
- Agent vs. agentless instrumentation: OTel SDK (in-process) captures rich application context but requires code changes; eBPF-based tracing (no code changes, kernel-level) captures network calls but misses application-level context like SQL queries and cache operations
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.