System Design: Social Media Analytics Platform

Requirements

Functional Requirements:

Track engagement metrics (likes, comments, shares, reach, impressions) across multiple social platforms
Provide real-time dashboards with customizable date ranges and filters
Generate audience demographic insights (age, gender, location, interests)
Support scheduled report generation and export (PDF, CSV)
Competitive benchmarking: compare metrics against competitor accounts
Sentiment analysis on comments and mentions

Non-Functional Requirements:

Ingest 10 billion events per day from platform APIs and webhooks
Dashboard queries must return within 3 seconds for any date range
Support 100K concurrent dashboard users
Data freshness: real-time metrics updated within 60 seconds; historical reports within 1 hour
99.9% availability for dashboard reads

Scale Estimation

10 billion events/day equals approximately 115K events/sec ingested. Each event averages 500 bytes of structured data (platform, account_id, post_id, metric_type, value, timestamp), totaling 5TB raw data per day. After aggregation into hourly and daily rollups, the compressed data is approximately 500GB/day. With 2 years of retention, total storage is roughly 365TB compressed. Dashboard queries scan pre-aggregated data — a typical dashboard loads 30 metrics across 30 days = 900 rollup rows, well within sub-second query time on a columnar store. The sentiment analysis pipeline processes approximately 2 billion text items/day (comments + mentions), requiring roughly 40 GPU inference nodes running a DistilBERT classifier.

High-Level Architecture

The analytics platform follows a Lambda architecture with a speed layer for real-time metrics and a batch layer for historical aggregation. The ingestion layer consists of platform-specific Connector Services that poll social media APIs (Facebook Graph API, Twitter API v2, Instagram Basic Display API, TikTok Research API) on configurable intervals (every 60 seconds for real-time accounts, every 15 minutes for standard). Each connector normalizes the platform-specific response into a unified event schema and publishes to a Kafka cluster partitioned by account_id.

The speed layer: a Flink streaming job consumes from Kafka, computes real-time counters (running totals of likes, comments, shares per post and per account per hour), and writes to a Redis cluster. The batch layer: an hourly Spark job reads from the Kafka-backed data lake (S3 in Parquet format), computes aggregated rollups (hourly, daily, weekly, monthly) grouped by account, post, platform, and metric type, and writes to ClickHouse — a columnar analytical database optimized for fast aggregation queries. The serving layer: a Dashboard API reads from Redis for real-time data (last 24 hours) and ClickHouse for historical data, merging results seamlessly.

Core Components

Multi-Platform Ingestion Service

Each social platform has a dedicated Connector Service handling API authentication (OAuth 2.0 token refresh), rate limit management (respecting per-endpoint rate limits with token bucket algorithms), pagination (cursor-based for large result sets), and error handling (exponential backoff with jitter for transient failures). The connectors run as Kubernetes deployments with horizontal scaling per platform. A Connector Manager Service orchestrates which accounts to poll and at what frequency, backed by a PostgreSQL configuration database. Webhook receivers (for platforms supporting push like Facebook) bypass polling entirely and ingest events directly.

Real-Time Aggregation Pipeline

The Flink streaming job maintains windowed counters using RocksDB state. For each event, it updates multiple materialized views: per-post metrics (total likes/comments/shares), per-account hourly metrics (engagement rate, reach, impressions), and cross-account trending metrics. The job uses event-time processing with watermarks (allowing 5 minutes of late data) to handle out-of-order events from different platform APIs. Results are written to Redis hashes keyed by metrics:{account_id}:{date}:{hour} with fields for each metric type. A separate Flink job runs the sentiment analysis pipeline: text events are batched (batch size 64) and sent to a Triton Inference Server hosting a DistilBERT model, with results aggregated into sentiment scores per post and per account.

Dashboard Query Engine

The Dashboard API implements a query planner that routes requests to the optimal data source based on the requested time range. Last 24 hours: query Redis for real-time counters. Last 7-90 days: query ClickHouse pre-aggregated daily rollups. Beyond 90 days: query ClickHouse monthly rollups. The query planner merges results from multiple sources transparently. For complex queries (e.g., engagement rate trend with audience demographics overlay), the API constructs a ClickHouse SQL query with JOINs across the metrics and demographics tables. Results are cached in a Memcached layer with a 60-second TTL for identical query parameters.

Database Design

ClickHouse is the primary analytical store with a MergeTree engine. The core table engagement_metrics has columns: account_id (UInt64), platform (Enum8), post_id (String), metric_type (Enum8), value (Float64), timestamp (DateTime), and is partitioned by toYYYYMM(timestamp) with an ORDER BY (account_id, platform, metric_type, timestamp) for efficient range scans. A separate audience_demographics table stores daily snapshots of follower demographics per account. A posts table stores post metadata (content text, media type, publish time) for correlation analysis. ClickHouse's columnar compression achieves 10:1 compression on this data.

Redis stores real-time counters using hashes with hourly granularity. A TTL of 48 hours on Redis keys ensures automatic cleanup after the batch layer has processed the data. PostgreSQL stores user accounts, dashboard configurations, scheduled reports, and platform API credentials (encrypted with AES-256). A separate Elasticsearch cluster indexes post content and comments for full-text search and sentiment filtering.

API Design

GET /api/v1/metrics?account_id={id}&metrics=likes,comments,reach&start={date}&end={date}&granularity=day — Fetch aggregated metrics for an account over a date range
GET /api/v1/posts/{post_id}/metrics?metrics=likes,comments,shares,impressions — Fetch metrics for a specific post
GET /api/v1/audience?account_id={id}&dimensions=age,gender,location — Fetch audience demographic breakdown
POST /api/v1/reports/schedule — Schedule a recurring report; body contains account_ids, metrics, frequency, format (PDF/CSV), recipients

Scaling & Bottlenecks

The ingestion layer is bottlenecked by social platform API rate limits, not by the system's own throughput. Facebook's Graph API allows 200 calls/hour per user token; at 100K tracked accounts, the connector needs 100K tokens refreshed and rotating across 500+ IP addresses to avoid throttling. A Token Pool Service manages OAuth token lifecycle and distributes API calls across tokens. For platforms with webhook support, the system can handle 500K events/sec through the webhook receiver fleet without rate limit concerns.

ClickHouse query performance degrades on high-cardinality GROUP BY queries (e.g., metrics grouped by 10M unique post_ids). The solution is materialized views in ClickHouse that pre-aggregate data at common query dimensions (account + day + metric_type) during insertion. These materialized views reduce query scan from millions of rows to thousands. For cross-account competitive benchmarking queries that scan data across multiple accounts, a dedicated ClickHouse cluster with higher memory allocation (256GB per node) handles these heavy queries separately from the dashboard serving cluster.

Key Trade-offs

Lambda architecture vs. Kappa: Lambda (separate batch + speed layers) adds operational complexity but guarantees correctness through batch recomputation — pure streaming (Kappa) is simpler but harder to debug and correct data quality issues retroactively
ClickHouse vs. Druid for OLAP: ClickHouse offers simpler SQL-based querying and easier operations; Druid provides better real-time ingestion — ClickHouse wins for this use case where real-time is handled by Redis
Polling vs. webhooks: Webhooks provide instant data but not all platforms support them and they require idempotent event handling; polling is universal but introduces latency proportional to poll interval
Pre-aggregated materialized views vs. raw queries: Materialized views trade storage for query speed — at 10:1 raw-to-aggregated ratio, the additional storage cost is minimal compared to the query performance improvement