System Design: Distributed Tracing System

Requirements

Functional Requirements:

Correlate a single user request across all microservices it touches using a trace ID
Record spans (units of work within a service) with start time, duration, tags, and logs
Propagate trace context across service boundaries via HTTP headers (W3C TraceContext) and message queue metadata
Search traces by service, operation, duration, error status, and custom tags
Visualize trace as a waterfall/Gantt chart showing span timing relationships
Compute service dependency graphs and latency percentiles per operation

Non-Functional Requirements:

Ingest 1 million spans/sec with end-to-end latency from emission to queryability under 30 seconds
Sampling: collect 100% of error traces, 1% of success traces — reducing storage by 100x
Query response time under 3 seconds for a trace lookup by trace ID
Retain traces for 7 days hot, 30 days warm

Scale Estimation

A system with 1,000 services, each handling 1,000 RPS with average 10 spans per trace: 10 million spans/sec before sampling. At 1% sample rate: 100,000 spans/sec stored. Each span: ~1 KB (operation name, tags, logs, timestamps) = 100 MB/sec = 8.6 TB/day. At 7-day retention: 60 TB of hot trace storage. Error traces (100% sampled): if 0.1% of requests error, 10,000 spans/sec × 10 spans/trace = 100,000 additional spans/sec from errors. Query load: 1,000 engineers × 10 trace lookups/minute = ~167 lookups/sec.

High-Level Architecture

The tracing system has three phases: instrumentation, collection, and storage/query. Instrumentation is handled by OpenTelemetry SDKs embedded in each service. The SDK creates spans, samples them, and batches/exports them to a local OpenTelemetry Collector (running as a sidecar or DaemonSet). The Collector performs local processing (attribute filtering, sampling decisions) and forwards to the central tracing backend.

The central backend receives spans from thousands of Collectors, writes them to a message queue (Kafka) for buffering, processes them (span aggregation, trace assembly), and writes complete traces to the storage backend. Cassandra is a common storage choice (Jaeger uses it): its wide-column model maps naturally to trace data (partition key = trace_id, clustering columns = span timing). Elasticsearch is used for the search index (find traces by service/operation/tag).

Tail-based sampling is the key architectural challenge. Head-based sampling (decide at trace start whether to sample) is simple but can't selectively sample error traces (the error hasn't happened yet). Tail-based sampling (decide after the trace is complete) requires buffering all spans for an in-flight trace until it's complete — a collector must see all spans for a trace to make the sampling decision. This requires routing all spans for a trace_id to the same collector (using consistent hashing on trace_id).

Core Components

OpenTelemetry SDK & Collector

The OTel SDK in each service creates spans via a Tracer. Spans are nested: a root span per request, child spans for each downstream call or significant operation. Context propagation: when service A calls service B, A injects the trace context (traceparent header with trace_id and span_id) into the HTTP/gRPC request; B's SDK extracts it and creates a child span with A's span as the parent. The OTel Collector receives spans via OTLP (OpenTelemetry Protocol), applies processors (batch, attribute filtering, tail sampling), and exports to the backend. Running a Collector per node (DaemonSet) keeps trace data local until batched, reducing network calls.

Tail-Based Sampler

The tail sampler buffers all spans for each in-flight trace (for up to 30 seconds, the max expected trace duration). When the root span arrives (completing the trace), the sampler evaluates sampling rules: sample 100% if any span has error=true or status=ERROR; sample 100% if trace duration > 5 seconds (slow traces); otherwise sample 1%. The challenge: spans for a single trace arrive from many services and must be routed to the same sampler instance. Consistent hashing on trace_id routes spans to the correct sampler shard. If a sampler shard crashes, its in-flight trace data is lost — acceptable, as these are sampled traces, not critical data.

Trace Storage (Cassandra)

Cassandra's data model for traces: primary table traces partitioned by (trace_id, service) with rows for each span ordered by start_time. A lookup by trace_id fetches all spans for the trace in a single partition scan — O(spans_in_trace) I/O. Secondary table for service search: service_operations partitioned by (service_name, operation_name) with trace_id and timestamp as clustering columns — enables efficient queries like "find all traces for service=auth, operation=login in the last 1 hour." A separate dependencies table stores computed service dependency graphs (aggregated periodically from trace data).

Database Design

Cassandra stores raw span data, optimized for write throughput and trace_id lookup. Elasticsearch stores a search index with one document per trace (not per span): {trace_id, root_service, root_operation, duration_ms, error, start_time, services[], tags{}}. This index enables multi-field search but doesn't store the full span tree — trace details come from Cassandra after the trace_id is found via Elasticsearch. This two-store architecture separates search (Elasticsearch) from retrieval (Cassandra), each optimized for its workload.

Span data is written to Cassandra with TTL matching the retention policy (7 days = 604,800 seconds). Cassandra's compaction strategy is TWCS (TimeWindowCompactionStrategy), which groups SSTables by time window and compacts within windows — ideal for time-series data where old data is never updated and expires via TTL.

API Design

Scaling & Bottlenecks

The tail-based sampler is the most operationally complex component. Each sampler instance must buffer spans for all in-flight traces in its hash range — memory usage scales with (concurrent_traces × avg_spans × avg_span_size). At 100,000 sampled spans/sec × 30s max trace duration = 3M buffered spans × 1 KB = 3 GB per sampler instance. Mitigation: evict traces after configurable timeout; use off-heap storage (RocksDB) for the span buffer to avoid GC pressure.

Cassandra write throughput scales linearly with nodes. At 100,000 spans/sec × 1 KB = 100 MB/sec, a 10-node Cassandra cluster with NVMe SSDs handles this comfortably. The bottleneck shifts to read throughput during high query load — adding read replicas (increasing replication factor from 3 to 5) improves read parallelism.

Key Trade-offs

Head-based vs. tail-based sampling: Head-based is simple (no buffering needed, decision at trace start) but can't selectively sample slow/error traces; tail-based enables intelligent sampling but requires buffering and consistent routing
Cassandra vs. ClickHouse for span storage: Cassandra excels at trace_id point lookups; ClickHouse's columnar storage is better for analytics ("p99 latency of operation X across all traces") but slower for single-trace retrieval
Span granularity: Too many spans (every function call) overwhelms storage and adds SDK overhead; too few misses performance bottlenecks; teams should instrument at service boundaries and explicit performance-critical operations
Agent vs. agentless instrumentation: OTel SDK (in-process) captures rich application context but requires code changes; eBPF-based tracing (no code changes, kernel-level) captures network calls but misses application-level context like SQL queries and cache operations