SYSTEM_DESIGN

System Design: Viral Content Detection

System design for detecting viral content in real-time, covering engagement velocity tracking, predictive virality models, and automated response systems for social platforms handling billions of interactions.

16 min readUpdated Jan 15, 2025
system-designviral-detectionreal-timemachine-learning

Requirements

Functional Requirements:

  • Detect posts that are going viral within minutes of the initial surge
  • Classify viral content by type (organic, coordinated, spam-driven, breaking news)
  • Trigger automated responses: CDN pre-warming, content moderation escalation, and creator notifications
  • Provide a real-time viral content dashboard for the operations team
  • Generate post-mortem virality reports with engagement curves and amplification paths
  • Support configurable virality thresholds per content type and region

Non-Functional Requirements:

  • Monitor engagement signals across 2 billion posts in the active window
  • Detect virality within 3 minutes of the engagement surge beginning
  • Process 500K engagement events/sec (likes, shares, comments, views)
  • False positive rate below 5% for viral classification
  • 99.9% availability — missed viral detection has direct revenue and safety impact

Scale Estimation

500K engagement events/sec translates to 43 billion events/day. Each event is approximately 200 bytes (post_id, user_id, event_type, timestamp, metadata), totaling 8.6TB raw event data per day. The active monitoring window contains 2 billion posts; maintaining per-post counters in memory requires approximately 64GB (2B posts × 32 bytes for a counter struct with post_id, count, first_event_ts, last_event_ts). Only 0.01% of posts go viral (200K posts/day), so the system must efficiently filter the 99.99% of posts that show normal engagement patterns. The predictive model runs inference on candidate posts — approximately 1M posts/day enter the candidate pool, requiring 12 inferences/sec.

High-Level Architecture

The viral detection system is a three-stage pipeline: signal collection, velocity computation, and classification. Stage 1 (Signal Collection): all engagement events from the platform flow into a Kafka cluster on a engagements topic partitioned by post_id. A Flink streaming job consumes these events and maintains per-post engagement counters in RocksDB-backed state. The counters track total engagements, engagements per minute (sliding 5-minute window), unique user count (HyperLogLog sketch), and share depth (how many levels of resharing).

Stage 2 (Velocity Detection): every 30 seconds, the Flink job evaluates each active post against velocity thresholds. A post enters the candidate pool if its engagement rate exceeds 10x its predicted baseline (based on the creator's historical average engagement rate and the post's age). The baseline prediction comes from a lightweight gradient boosting model that takes creator follower count, post age, content type, and time of day as features. Candidates are published to a viral-candidates Kafka topic.

Stage 3 (Classification & Response): a Viral Classification Service consumes candidates and runs a deeper analysis — a neural network model examines the engagement graph (who is sharing, are they real accounts, geographic spread pattern) to classify the virality type. Based on classification, automated responses are triggered: CDN Pre-Warmer pre-caches the content at all edge PoPs; Moderation Escalation routes the content for priority human review; Creator Notification Service sends a push notification to the content creator.

Core Components

Engagement Velocity Engine

The velocity engine is the core detection component, implemented as a Flink stateful streaming application. For each post_id, it maintains a state object containing: total_engagements (Long), minute_buckets (circular buffer of 60 one-minute counters for the last hour), unique_users (HyperLogLog with 14-bit precision, ~0.8% error), share_tree_depth (Int), and first_engagement_timestamp. The velocity metric is computed as: velocity = engagements_last_5min / max(1, engagements_previous_5min). A velocity above 3.0 for two consecutive evaluations triggers candidate emission. State for posts older than 48 hours is evicted via TTL to bound memory usage.

Virality Prediction Model

The prediction model runs on candidate posts to forecast whether the current surge will sustain. Features include: current velocity (engagements/min), acceleration (second derivative of engagement rate), unique user ratio (unique engagers / total engagements — low ratio indicates bot activity), geographic entropy (Shannon entropy of engager locations — high entropy suggests organic spread), share depth distribution, and creator historical virality rate. The model is an XGBoost ensemble trained on 6 months of historical viral events (~120K positive examples). It outputs a virality probability and an estimated peak engagement count. Posts with probability > 0.7 are classified as viral.

Automated Response Orchestrator

When a post is confirmed viral, the Orchestrator triggers a configurable sequence of actions. For CDN pre-warming: it calls the CDN API (Cloudflare or Fastly) to push the content to all 200+ edge PoPs, reducing origin load before the traffic spike hits. For content moderation: it creates a Priority 1 review ticket in the moderation queue, attaching the virality classification and engagement graph visualization. For creator notifications: it sends a real-time push via Firebase Cloud Messaging with engagement stats. For ad systems: it signals the ad ranking pipeline to increase ad load on the viral post's page/screen. Each action is idempotent and tracked in a PostgreSQL actions log to prevent duplicate triggers.

Database Design

Real-time engagement state lives in Flink's RocksDB state backend, checkpointed to S3 every 60 seconds. This is the primary data store for the velocity engine — no external database is in the hot path. Viral event records (confirmed viral posts with classification, peak metrics, timeline) are written to PostgreSQL with columns: post_id, detected_at, peak_velocity, total_engagements_at_peak, virality_type, classification_confidence, response_actions_taken (JSONB), and resolved_at. A TimescaleDB hypertable stores the minute-by-minute engagement time series for each viral post, enabling post-mortem curve analysis.

The creator baseline model uses a Redis hash per creator: creator_stats:{user_id} with fields avg_engagement_rate, median_engagement_rate, post_count, and last_viral_post_date. These are updated by a daily batch job that processes the previous day's engagement data from the data warehouse. Historical viral events for model training are stored in Parquet files on S3, partitioned by date, and processed by a weekly Spark training pipeline that retrains the XGBoost model.

API Design

  • GET /api/v1/viral/active?limit=50&sort=velocity — Fetch currently viral posts sorted by engagement velocity; returns post_id, velocity, total_engagements, virality_type, detected_at
  • GET /api/v1/viral/{post_id}/timeline?granularity=minute — Fetch minute-by-minute engagement timeline for a viral post
  • POST /api/v1/viral/thresholds — Update virality detection thresholds; body contains velocity_multiplier, min_engagements, candidate_confidence; applied without pipeline restart
  • GET /api/v1/viral/report?start={date}&end={date} — Generate a virality report for the date range with top viral posts, categories, and response effectiveness metrics

Scaling & Bottlenecks

The primary bottleneck is maintaining per-post state for 2 billion active posts in the Flink job. RocksDB handles this efficiently with on-disk state and an LRU block cache — only hot posts (those receiving engagements recently) are in memory. With 64GB block cache per TaskManager and 100 TaskManagers, the system keeps the most active 500M posts' state in memory. Cold posts (no engagement in the last hour) are served from SSD-backed RocksDB with sub-millisecond reads. Checkpointing 2 billion state entries takes approximately 45 seconds with incremental checkpoints (only changed state since the last checkpoint is uploaded to S3).

Kafka throughput at 500K events/sec requires careful partition design. With 500 partitions on the engagements topic and an average message size of 200 bytes, each partition handles 1K events/sec — well within Kafka's single-partition throughput. The Flink job's parallelism matches the partition count (500 task slots) to ensure each partition is processed by exactly one Flink subtask, maintaining per-post state locality.

Key Trade-offs

  • Velocity-based detection vs. absolute thresholds: Velocity (rate of change) detects emerging virality for both small and large accounts, while absolute thresholds would miss viral posts from small creators — the trade-off is higher computational cost for maintaining per-post rolling windows
  • HyperLogLog for unique users vs. exact count: HLL provides 0.8% error with 16KB memory per post vs. a full set that could require MB per post — the approximation is acceptable for virality detection
  • Flink state in RocksDB vs. external Redis: Co-locating state with the processing engine eliminates network round-trips and provides exactly-once guarantees — the trade-off is that state is not queryable externally without a separate sink
  • 3-minute detection window vs. faster: Faster detection (sub-minute) requires higher evaluation frequency and more false positives from transient spikes — 3 minutes balances detection speed with confidence

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.