SYSTEM_DESIGN
System Design: API Analytics Platform
System design of an API analytics platform covering request logging, real-time metrics aggregation, usage-based billing, anomaly detection, and developer dashboards for API providers.
Requirements
Functional Requirements:
- Capture every API request/response with metadata: endpoint, method, status code, latency, request/response size, API key, IP address, user agent
- Real-time dashboards showing request volume, error rates, latency percentiles (P50, P95, P99), and geographic distribution
- Usage-based billing: aggregate request counts and data transfer per API key per billing period
- Alerting on anomalies: spike in error rate, latency degradation, unusual traffic patterns, rate limit exhaustion
- Historical analytics with drill-down: filter by endpoint, time range, status code, API key; export to CSV
- API key management with per-key rate limits and usage quotas
Non-Functional Requirements:
- Ingest 1 million API events/sec with no sampling (100% capture)
- Dashboard latency under 2 seconds for real-time metrics (last 5 minutes)
- 99.9% availability for the ingestion pipeline (dropping events means inaccurate billing)
- Store raw events for 90 days, aggregated metrics for 2 years
- Billing accuracy: request counts must be exact (no approximation)
Scale Estimation
1M events/sec × 1KB per event = 1GB/sec ingest throughput. Daily volume: 86.4B events, 86.4TB raw data. 90-day retention for raw events: 7.8PB. Pre-aggregated metrics (per endpoint, per minute) reduce query data: 10K endpoints × 1,440 minutes/day × 100 bytes = 1.4GB/day of aggregated data; 2-year retention: 1TB. Billing aggregation: 1M API keys × daily rollups = 1M rows/day. Dashboard queries: 10K users × 5 queries/minute = 50K queries/minute = 833/sec. Alerting evaluation: 10K alert rules evaluated every 60 seconds.
High-Level Architecture
The platform uses a lambda architecture combining real-time (speed layer) and batch processing. The Ingestion Layer uses lightweight agents (SDKs or reverse proxy sidecars) deployed alongside the customer's API servers. Agents capture request/response metadata without impacting API latency (async, non-blocking capture). Events are sent to an Ingestion Gateway (a fleet of stateless HTTP/gRPC receivers) that validates events, enriches them (geo-IP lookup, API key resolution), and publishes to Kafka (partitioned by API key for ordering). The gateway handles backpressure by buffering events in the agent and sending in batches (every 1 second or 1,000 events, whichever comes first).
The Speed Layer consumes events from Kafka and computes real-time aggregations using Apache Flink (or a custom stream processor). Aggregations are computed in tumbling windows (1-minute resolution): request count, error count, latency percentiles (using t-digest for approximate percentile computation), and bytes transferred, grouped by (api_key, endpoint, status_code_class). Aggregated metrics are written to a time-series database (TimescaleDB or ClickHouse) for dashboard queries. The speed layer also feeds the Alerting Engine, which evaluates alert rules against the streaming aggregations.
The Batch Layer writes raw events from Kafka to a data lake (S3 in Parquet format, partitioned by date and api_key). Nightly batch jobs (Spark) compute exact billing aggregates, reconcile with speed layer approximations, and generate billing reports. Historical analytics queries (beyond the 5-minute real-time window) run against the data lake using Presto/Trino for interactive queries or Spark for complex analyses.
Core Components
Stream Processing (Real-time Aggregation)
The Flink stream processor runs 50 task manager instances consuming from 200 Kafka partitions. Each event is keyed by (api_key, endpoint) and aggregated in 1-minute tumbling windows. Metrics computed per window: count, error_count (status >= 400), latency_p50/p95/p99 (t-digest with compression=200 for 1% accuracy), total_request_bytes, total_response_bytes. The t-digest data structure enables accurate percentile estimation with O(1) memory per key, critical when tracking millions of unique (api_key, endpoint) combinations. Window results are emitted at window close and written to TimescaleDB with a composite hypertable partitioned by time and api_key. Late events (within a 5-minute allowed lateness) trigger window re-computation.
Billing Aggregation Pipeline
Billing requires exact counts (not approximations) because revenue depends on it. The batch pipeline (Spark) runs nightly, reading raw events from the S3 data lake and computing exact aggregates per api_key per day: total_requests, requests_by_tier (different endpoints may have different pricing), total_data_transfer_bytes. These aggregates are written to PostgreSQL (billing_usage table) and reconciled against the real-time speed layer's counts. Discrepancies (typically < 0.1% due to late events and t-digest approximation) are resolved in favor of the batch count. The billing system reads from PostgreSQL to generate invoices. Metered billing events are published to Stripe or a custom billing engine for usage-based pricing.
Anomaly Detection & Alerting
The alerting engine evaluates rules against the streaming aggregation output. Rule types include: static threshold (error rate > 5%), dynamic threshold (error rate > 2 standard deviations from 7-day rolling average), rate of change (latency P99 increased > 50% in last 5 minutes), and volume anomaly (request count < 50% of same hour last week, indicating potential outage). Rules are evaluated every 60 seconds against the latest 1-minute aggregation windows. A suppression mechanism prevents alert storms: after an alert fires, it enters a cooldown period (configurable, default 15 minutes) during which repeated threshold violations do not generate new alerts. Notifications are delivered via email, Slack webhook, PagerDuty, or custom webhook.
Database Design
TimescaleDB stores real-time aggregated metrics: api_metrics (time TIMESTAMPTZ, api_key, endpoint, method, status_class, request_count BIGINT, error_count BIGINT, latency_p50 FLOAT, latency_p95 FLOAT, latency_p99 FLOAT, bytes_in BIGINT, bytes_out BIGINT). This is a hypertable partitioned by time (1-hour chunks) with compression enabled on chunks older than 2 hours (10x storage reduction). Continuous aggregates provide pre-computed hourly and daily rollups for dashboard queries spanning longer time ranges.
Raw events are stored in S3 as Parquet files: partitioned by dt={date}/api_key={key}/. Each Parquet file contains columns: timestamp, endpoint, method, status_code, latency_ms, request_bytes, response_bytes, ip_address, user_agent, geo_country, api_key. Columnar storage with Snappy compression achieves 10:1 compression ratio (7.8PB raw → ~800TB stored). PostgreSQL stores billing data: billing_usage (api_key, billing_period_start DATE, billing_period_end DATE, total_requests BIGINT, requests_by_tier JSONB, total_bytes BIGINT, computed_at). Alert configurations: alert_rules (rule_id, api_key, metric, condition, threshold, cooldown_seconds, notification_channels JSONB).
API Design
POST /api/v1/events/batch— Ingest a batch of API events; body contains array of event objects; returns accepted countGET /api/v1/analytics/realtime?api_key={key}&window=5m— Fetch real-time metrics for the last N minutesGET /api/v1/analytics/historical?api_key={key}&start={date}&end={date}&endpoint={path}&granularity=1h— Fetch historical metrics with drill-down filtersGET /api/v1/billing/usage?api_key={key}&period=2025-01— Fetch billing usage for a period; returns exact request counts and data transfer
Scaling & Bottlenecks
Kafka ingestion at 1M events/sec (1GB/sec) requires a well-provisioned cluster: 50 brokers with 20 partitions per topic (200 partitions total), each broker handling 20K events/sec. The ingestion gateway fleet (stateless HTTP servers) scales horizontally; 100 instances handle peak load with headroom. Agent-side batching (1-second or 1,000-event buffers) reduces the number of HTTP connections by 100x compared to per-event sending.
TimescaleDB query performance is the dashboard bottleneck. Real-time queries (last 5 minutes) scan at most 5 one-minute chunks — sub-second at the per-api_key level. Historical queries spanning months require continuous aggregate materialization (pre-computed hourly/daily rollups). Without continuous aggregates, a query spanning 90 days would scan 2,160 chunks; with daily aggregates, it scans 90 rows. Index on (api_key, time DESC) ensures single-customer queries are efficient. Cross-customer analytics (platform-wide metrics) use a separate materialized view aggregated without api_key grouping.
Key Trade-offs
- 100% event capture vs sampling: Full capture ensures billing accuracy and enables drill-down to individual requests, but requires 7.8PB storage for 90-day retention — Parquet columnar storage with compression makes this economically viable
- t-digest approximate percentiles vs exact percentiles: t-digest provides 1% accuracy with O(1) memory per key, enabling real-time P99 computation across millions of keys — exact percentiles would require storing all latency values, which is infeasible at 1M events/sec
- Lambda architecture (speed + batch layers) vs kappa (stream only): The batch layer provides billing-grade accuracy that reconciles stream processing approximations — for non-billing use cases, a pure streaming approach would be simpler
- TimescaleDB vs ClickHouse for metrics storage: TimescaleDB provides PostgreSQL compatibility (easier operations, SQL familiarity) with good compression; ClickHouse offers 2-5x better query performance on analytical workloads but requires specialized operational expertise — TimescaleDB chosen for operational simplicity
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.