System Design: Monitoring & Alerting System

Requirements

Functional Requirements:

Collect time-series metrics from services via pull (scraping) and push (StatsD, remote write) models
Store metrics with configurable retention (15 days raw, 1 year downsampled)
Evaluate alert rules periodically against stored metrics
Route alerts to appropriate channels (PagerDuty, Slack, email) based on severity and team
Suppress redundant alerts: group related alerts, apply inhibitions, and silence during maintenance
Provide a query interface for dashboards: PromQL or similar time-series query language

Non-Functional Requirements:

Ingest 5 million metric samples/sec across the system
Alert evaluation latency under 30 seconds from metric change to notification delivery
Query response time under 3 seconds for 24-hour range dashboard queries
99.9% availability for metric ingestion; alert delivery must not be lost

Scale Estimation

10,000 services × 500 metrics each = 5M metric time series. Each series scraped every 15 seconds = 5M samples/15s = 333,000 samples/sec. Each sample: 8 bytes (float64) + ~100 bytes labels = ~100 bytes average. Storage: 333,000 samples/sec × 100 bytes = 33 MB/sec raw. For 15-day retention: 33 MB × 86,400 × 15 = ~43 TB. Compressed with Prometheus's delta-of-delta + XOR float compression: ~43 TB / 10x = 4.3 TB. Query load: 1,000 engineers × 10 dashboard queries/hour = 167 queries/sec.

High-Level Architecture

Prometheus's pull model: the Prometheus server scrapes HTTP /metrics endpoints from targets on a configurable interval (15 seconds). Targets are discovered via service discovery (Kubernetes API, Consul, EC2 tags) — when a new pod starts, it's automatically added to the scrape targets without configuration changes. This contrasts with push-based systems (StatsD, InfluxDB line protocol) where services send metrics to a central collector. Pull has an advantage: if a target is unreachable, it's immediately visible (scrape failure); push-based systems may silently drop metrics if the network path fails.

At 5M time series, a single Prometheus server hits memory limits (~1 GB per 100K time series = 50 GB for 5M series — manageable on a large instance but with limited horizontal scaling). Sharding: Prometheus federation (hierarchical) or Thanos/Cortex/Mimir for horizontally scalable, long-term storage. These systems sidecar the Prometheus server to upload data blocks to object storage (S3/GCS) and provide a global query layer that fans out queries to multiple Prometheus shards.

Alertmanager is a separate component that receives firing alerts from Prometheus, groups them (bundle 50 alerts about the same service into one notification), deduplicates (don't resend the same alert every 15 seconds), applies inhibition (if a data center is down, suppress all pod-level alerts from that DC), and routes to notification channels based on label-based routing trees.

Core Components

Time-Series Storage (TSDB)

Prometheus's TSDB stores data in 2-hour blocks on disk. Each block is immutable after creation — it contains compressed chunks for each time series in that 2-hour window, an inverted index (mapping label sets to series IDs), and tombstones (for deleted series). In-memory: a head block holds the last 2 hours of data in mutable structures for fast ingestion. Writes go to the WAL (write-ahead log) for durability before being applied to the head block. On query, the TSDB merges data from the head block and relevant disk blocks for the queried time range. The XOR float compression (Gorilla encoding) achieves ~1.37 bytes per sample for typical monotonically increasing metrics.

Alert Rule Engine

Alert rules are PromQL expressions evaluated periodically (every evaluation_interval, typically 15-30 seconds). A rule like ALERT HighErrorRate IF rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 FOR 2m fires when the error rate exceeds 5% for 2 consecutive minutes (the FOR clause prevents flapping on transient spikes). Rules are evaluated against the local TSDB — the rule engine is part of the Prometheus server. Pending state (condition met but FOR duration not yet elapsed) is tracked in memory; firing state (condition met for duration) triggers an alert payload sent to Alertmanager.

Alertmanager Routing

Alertmanager receives alerts as webhook POSTs from Prometheus. The routing tree maps label selectors to receivers: match: severity=critical AND team=payments → PagerDuty; match: severity=warning → Slack #ops-alerts. Grouping: alerts with the same alertname and service labels are grouped into a single notification with group_wait=30s (delay before sending to accumulate related alerts) and group_interval=5m (resend interval for ongoing alerts). Silences: operators create time-limited silences (e.g., silence all payments alerts for 2 hours during scheduled maintenance) — any alert matching the silence's label matchers is suppressed.

Database Design

Prometheus TSDB is the primary storage — not a general-purpose database. For long-term storage, Thanos or Cortex writes Prometheus blocks to S3-compatible object storage in Parquet-like format, enabling cost-effective retention for 1+ years. Queries against long-term storage (>15 days) are served by the global query layer, which retrieves blocks from S3 on demand.

Alert history (when alerts fired, when they resolved, who acknowledged) is stored in a separate operational database (PostgreSQL). This enables SLA/SLO reporting: total time a service was in an alerting state, MTTR trends. Alertmanager's silence and inhibition config is stored in a persistent backing store (Consul or a Kubernetes ConfigMap) — not in memory — so it survives Alertmanager restarts.

API Design

Scaling & Bottlenecks

Prometheus's single-node TSDB limits horizontal write scaling — all scrapes for a given shard go to one server. Sharding by consistent hashing (each Prometheus instance scrapes a subset of targets determined by hashmod on target labels) distributes load but requires a global query layer to merge results. Thanos Query acts as this layer, sending fan-out queries to all shards and merging responses — at the cost of query latency proportional to the slowest shard.

High-cardinality metrics (metrics with many unique label combinations — e.g., per-user-ID metrics) create memory pressure: each unique label set is a separate time series. 1M unique users × 5 metrics each = 5M series just from user-level metrics. Mitigation: aggregate high-cardinality metrics before storage (use recording rules to pre-compute aggregations), or use a purpose-built high-cardinality store (ClickHouse, Druid) for user-level metrics and Prometheus for service-level metrics.

Key Trade-offs

Pull vs. push model: Pull (Prometheus default) is operationally simpler (targets just expose /metrics, no push config needed) but requires the monitoring system to be able to reach all targets; push is better for ephemeral jobs (batch jobs that start and stop faster than the scrape interval)
Local TSDB vs. remote storage: Local TSDB is fast and simple but single-node; remote storage (Cortex, Thanos) is horizontally scalable and globally queryable but operationally complex
Alert sensitivity vs. noise: Low thresholds and short FOR durations detect issues quickly but produce alert fatigue from transient spikes; high thresholds and long durations reduce noise but delay detection of real issues
Pull interval granularity: 15-second scrape intervals capture transient spikes but produce 4x the data volume vs. 60-second intervals; for most SLO metrics, 60-second resolution is sufficient