System Design: Streaming Analytics (Viewer Metrics)

Requirements

Functional Requirements:

Ingest viewer events: play, pause, seek, buffer, quality change, complete, and heartbeat (every 30 seconds)
Real-time dashboards showing concurrent viewers, play count, average watch duration, and rebuffer rate
Historical analytics: daily/weekly/monthly aggregates per video, channel, and platform-wide
Audience demographics: views by country, device type, OS, and browser
Content performance: engagement rate, average percentage watched, peak concurrent viewers
Alerting: notify creators and ops when metrics cross thresholds (e.g., rebuffer rate > 2%)

Non-Functional Requirements:

Ingest 10 billion events per day (115K events/sec average, 500K/sec peak)
Real-time metrics available within 30 seconds of the event
Historical queries return results within 2 seconds for any time range
99.9% data completeness — no more than 0.1% event loss
Retain raw events for 30 days, aggregated metrics for 3 years
Multi-tenant: each creator sees only their own data

Scale Estimation

10 billion events/day at average 500 bytes per event = 5TB/day of raw event data. 30-day retention of raw events = 150TB. Aggregated metrics (pre-computed rollups): 500M videos × 365 days × 50 metrics × 8 bytes = 73TB/year. Concurrent viewers counter: updated 500K times/sec (heartbeat events). Dashboard queries: assume 10M creator dashboard loads/day = 116 queries/sec, each scanning aggregated data. Alerting: 500K active channels monitored for 10 alert rules each = 5M alert evaluations per minute.

High-Level Architecture

The analytics platform follows a Lambda-like architecture with a speed layer (real-time) and a batch layer (historical), unified by a serving layer. The Ingestion Layer collects events from video players via a lightweight HTTP endpoint (the Beacon API or navigator.sendBeacon). Events are batched client-side (up to 10 events or 30 seconds) and sent as a JSON array. The Beacon Collector (a horizontally scaled Go service behind an ALB) validates events, enriches them with server-side context (IP geolocation, user-agent parsing), and publishes to Kafka.

The Speed Layer processes events in real-time using Apache Flink. Flink jobs compute sliding-window aggregations: concurrent viewers per video (tumbling 10-second window using heartbeat events), play count per minute, average buffer ratio per 5-minute window, and quality-of-experience (QoE) scores. Results are written to Redis for real-time dashboard serving and to a time-series database (Apache Druid) for queryable real-time analytics.

The Batch Layer runs hourly and daily Spark jobs that read raw events from S3 (Kafka → S3 via a Kafka Connect S3 sink). Spark computes precise aggregations (exact unique viewers via HyperLogLog, percentile watch times, cohort analysis) and writes results to Druid's historical segments. The Serving Layer provides a unified query API that transparently merges real-time (Druid real-time segments) and historical (Druid historical segments) data for seamless time-range queries.

Core Components

Event Ingestion Pipeline

The Beacon Collector is a stateless Go service optimized for high-throughput event ingestion. Each instance handles 50K events/sec using a non-blocking architecture (goroutines + channel-based buffering). Events are validated (schema check, timestamp sanity, required fields), enriched (MaxMind GeoIP lookup, device/OS parsing via ua-parser), and serialized to Avro (schema-registered with Confluent Schema Registry for evolution). Batches of 1,000 events are published to Kafka using the async producer with acks=1 (leader acknowledgment) for throughput. A separate validation pipeline samples 1% of events for quality monitoring.

Real-Time Stream Processing (Flink)

The Flink cluster runs four primary jobs. (1) Concurrent Viewers: consumes heartbeat events, maintains a session window per (video_id, user_id) with 60-second gap timeout; the count of active sessions per video_id is the concurrent viewer count, written to Redis every 10 seconds. (2) Play Counter: tumbling 1-minute window counting distinct play events per video_id; results written to Druid for minute-granularity time series. (3) QoE Aggregator: sliding 5-minute window computing average rebuffer ratio, average startup time, and bitrate distribution per video_id; results written to Druid and Redis (for alerting). (4) Geographic Heatmap: tumbling 1-minute window aggregating viewer counts by country and city per video_id; written to Druid.

Time-Series Analytics (Druid)

Apache Druid serves as the OLAP engine for both real-time and historical queries. Druid's architecture (real-time nodes ingesting from Kafka + historical nodes serving batch-loaded segments from S3) provides sub-second query latency across time ranges from seconds to years. The primary datasource schema: timestamp, video_id, channel_id, event_type, country, device_type, os, browser, bitrate_kbps, buffer_ratio, watch_duration_ms. Dimensions are indexed for fast filtering; metrics (count, sum_watch_duration, sum_buffer_events) are pre-aggregated at ingestion. Rollup reduces raw event volume by 10x for historical storage.

Database Design

Kafka serves as the durable event log with 30-day retention across 200 partitions (partitioned by video_id hash for ordered processing). Raw events are archived to S3 in Parquet format via Kafka Connect S3 sink (hourly flushes, partitioned by date/hour). S3 storage uses Glacier transition after 30 days for cost optimization.

Druid stores aggregated metrics in segments. Real-time segments (last 2 hours) are held in memory on real-time nodes. Historical segments (older than 2 hours) are persisted to S3 and loaded by historical nodes on query. Segment granularity: MINUTE for the last 24 hours, HOUR for the last 30 days, DAY for older data. This tiered granularity balances query resolution with storage efficiency. Redis stores ephemeral real-time counters: concurrent:{video_id} (integer counter), plays:minute:{video_id}:{minute_ts} (counter with 2-hour TTL), and channel-level aggregates for the creator dashboard.

API Design

POST /api/v1/events/beacon — Ingest a batch of viewer events; body is a JSON array of event objects; returns 204 No Content (fire-and-forget)
GET /api/v1/analytics/realtime/{video_id} — Fetch real-time metrics (concurrent viewers, plays last 5 min, QoE score); served from Redis
GET /api/v1/analytics/query — Query historical analytics; body contains video_id or channel_id, time_range, metrics[], dimensions[], granularity; returns Druid query results
POST /api/v1/alerts/rules — Create an alert rule; body contains channel_id, metric, threshold, operator, notification_channel (email/webhook)

Scaling & Bottlenecks

Kafka ingestion at 500K events/sec is the primary throughput challenge. With 200 partitions, each partition handles 2,500 events/sec — well within Kafka's per-partition throughput. The Kafka cluster uses 20 brokers with NVMe SSDs for low-latency writes. Consumer lag monitoring (Burrow) ensures Flink consumers keep up; if lag exceeds 1 minute, an auto-scaling trigger adds Flink task managers. The Beacon Collector fleet scales horizontally behind an ALB; each instance's throughput is bounded by Kafka producer throughput (~50K events/sec per instance), so 10 instances handle the 500K/sec peak.

Druid query performance is the read-path bottleneck. Complex queries spanning months of data across millions of videos can be slow. The solution is pre-computed rollups: a Spark job computes daily/weekly/monthly aggregates per channel and writes them to a separate Druid datasource with coarser granularity. Dashboard queries hit the rollup datasource for long time ranges and the real-time datasource for the last 24 hours. A query cache (Redis, 5-minute TTL) prevents redundant Druid queries for frequently accessed dashboards.

Key Trade-offs

Flink over Spark Streaming for real-time: Flink's true streaming model (event-at-a-time with exactly-once semantics) provides lower latency than Spark Streaming's micro-batch approach — critical for sub-30-second metric freshness
Druid over ClickHouse: Druid's native real-time ingestion from Kafka and time-partitioned architecture is purpose-built for this use case; ClickHouse offers better SQL support but requires a separate real-time ingestion layer — Druid wins for OLAP on streaming data
Client-side event batching vs per-event POST: Batching 10 events reduces HTTP request volume by 10x, saving server and network resources — the trade-off is up to 30 seconds of additional latency for the last events in a batch, acceptable for analytics
HyperLogLog for unique viewers vs exact count: HLL provides 99.7% accuracy with 12KB memory per counter vs exact distinct requiring unbounded memory — the 0.3% error is acceptable for analytics dashboards; exact counts are used only for billing-critical metrics