System Design: Real-Time Analytics Dashboard

Requirements

Functional Requirements:

Display real-time metrics: page views, active users, revenue, error rates updated every second
Support configurable chart types: line charts, bar charts, heat maps, and funnel visualizations
Allow users to define custom metrics using a SQL-like query builder
Multi-tenant: each organization sees only its own data
Support date range filtering, dimension drilldowns, and segment comparisons
Alert users via UI notification when a metric crosses a user-defined threshold

Non-Functional Requirements:

Dashboard updates delivered to clients within 1 second of event occurrence
Support 1 million concurrent WebSocket connections
Dashboard API responds in under 100ms for pre-aggregated metrics
Historical query API responds in under 3 seconds for 30-day window queries
99.9% availability; partial degradation acceptable (show cached data if real-time pipeline is down)

Scale Estimation

With 1 million concurrent viewers each receiving updates every second, the WebSocket layer pushes 1 million messages/second. Each message averages 500 bytes = 500 MB/s outbound bandwidth. Backend event ingestion at 500,000 events/second from tracked applications feeds streaming aggregation. Pre-aggregated metric storage: 10,000 organizations * 100 metrics each * 86,400 seconds/day = 86 billion metric points/day — requiring aggressive downsampling for storage efficiency.

High-Level Architecture

The architecture has four layers: Event Ingestion, Stream Aggregation, Query Serving, and Client Push. Events from web/mobile clients hit a fleet of ingestion servers that validate, enrich (add geo, device type), and publish to Kafka. A Flink job consumes events and maintains real-time aggregations (counts, sums, distinct counts using HyperLogLog) in 1-second tumbling windows, writing results to a low-latency serving store (Apache Druid or ClickHouse).

The Query Serving layer handles two types of requests: real-time pushes for live metrics and historical queries for time-range analysis. Pre-aggregated 1-second rollups are stored in Redis (for the last 10 minutes) and ClickHouse (for longer history). A Fanout Service subscribes to real-time metric updates from Kafka and pushes them via WebSocket to all connected clients subscribed to those metrics.

A Dashboard Configuration Service stores chart definitions and metric queries in PostgreSQL. When a user opens a dashboard, the client loads the configuration, fetches the last 10 minutes of history from ClickHouse, and subscribes to WebSocket channels for live updates. The client-side rendering engine merges historical data with live updates to display smooth continuous charts.

Core Components

Stream Aggregation (Flink)

A Flink job maintains keyed state per (org_id, metric_name, dimension_values) combination. Every second, it emits aggregated metric snapshots to a Kafka output topic. HyperLogLog sketches compute distinct user counts with 1% error at 1% of the memory cost of exact sets. Count-Min Sketch tracks top-N dimension values (top pages, top countries) in O(1) space. Results are written to Redis (SORTED SET keyed by metric + timestamp) and asynchronously to ClickHouse for persistence.

WebSocket Fanout Service

Each WebSocket server maintains a subscription registry: a mapping from metric keys to sets of connected client WebSocket connections. The Fanout Service reads metric update events from Kafka and, for each update, looks up subscribed connections and pushes the update. Connection state is stored in Redis Pub/Sub: each WebSocket server subscribes to Redis channels for its connected clients' metric keys. Horizontal scaling adds WebSocket servers; each server handles 100,000 concurrent connections using async I/O (Node.js or Java Netty).

ClickHouse for Historical Queries

ClickHouse stores pre-aggregated metric rollups at multiple granularities: 1-second (10 min retention), 1-minute (7 day retention), 1-hour (1 year retention), 1-day (indefinite). Automatic downsampling TTL rules aggregate older rows on expiry. Queries selecting 30 days of hourly data scan only the 1-hour rollup table, reading ~720 rows per metric — returning in milliseconds. ClickHouse's MergeTree engine with primary key sorted by (org_id, metric_name, timestamp) enables O(log N) point lookups.

Database Design

Redis HASH stores current metric values: key = metrics:{org_id}:{metric_name}, fields = dimension values. Redis SORTED SET for time series: key = ts:{org_id}:{metric_name}, score = timestamp, value = serialized aggregate. ClickHouse table: (org_id UInt32, metric_name LowCardinality(String), dimensions Map(String, String), window_start DateTime, value Float64, count UInt64) partitioned by toYYYYMM(window_start) and ordered by (org_id, metric_name, window_start).

API Design

GET /dashboards/{dashboard_id} — Return dashboard configuration including chart definitions and metric queries. GET /metrics/{metric_id}/history?from={ts}&to={ts}&granularity=1m — Return historical metric values at specified granularity. WebSocket /ws/subscribe — Establish WebSocket connection; client sends subscription messages for metric keys; server pushes real-time updates. POST /alerts — Define a threshold alert: notify when metric crosses a value for more than N consecutive seconds.

Scaling & Bottlenecks

WebSocket connection management is the primary scalability challenge. A single server handles 100K connections; 1 million connections require 10 WebSocket servers. Routing updates to the correct server uses Redis Pub/Sub: the Fanout Service publishes metric updates to Redis channels, and every WebSocket server subscribes to channels for its connected clients. Redis Cluster handles 1 million channel subscriptions with sub-millisecond publish latency.

Flink backpressure during traffic spikes (e.g., marketing campaigns driving 10x normal event volume) can delay metric updates. Adaptive scaling of Flink task managers (via KEDA watching Kafka consumer lag) adds capacity within 60 seconds. A circuit breaker degrades gracefully: if real-time metrics fall more than 5 seconds behind, the dashboard UI displays a staleness indicator rather than showing potentially misleading data.

Key Trade-offs

Push (WebSocket) vs. pull (polling): WebSocket push achieves 1-second update latency with minimal bandwidth overhead; polling every second per client creates thundering herd load on the API layer and increases latency by up to 1 polling interval.
Pre-aggregated vs. raw query: Pre-aggregated metrics answer common queries in milliseconds but cannot answer arbitrary ad hoc queries; a hybrid approach serves pre-aggregated for standard charts and raw ClickHouse queries for custom exploration.
Exact vs. approximate distinct counts: Exact HyperUnique counting uses unbounded memory; HyperLogLog provides 1% error with 12 KB per counter, enabling millions of counters in memory.
Multi-tenant shared cluster vs. dedicated clusters: Shared ClickHouse cluster reduces cost but risks noisy neighbor; resource groups and query quotas per org_id mitigate this at the cost of added management complexity.