System Design: Trending Hashtags System

Requirements

Functional Requirements:

Detect trending hashtags in real-time from user posts across the platform
Surface trends at global, country, and city levels
Display trending hashtags with context (related topic, tweet/post count)
Detect and filter manipulated/spam trends
Provide a historical trends API for analytics
Support custom trending for events (sports, elections, product launches)

Non-Functional Requirements:

Process 50,000 posts/sec, each potentially containing multiple hashtags
Trending list must update within 60 seconds of a real shift in conversation
99.9% availability; trending must never show stale data beyond 2 minutes
Spam/manipulation detection must block 99% of coordinated trending attacks

Scale Estimation

At 50,000 posts/sec with an average of 1.5 hashtags per post, the system processes 75,000 hashtag events/sec. Over a 1-hour trending window, that is 270M hashtag events to aggregate. The hashtag cardinality (unique hashtags in any window) is approximately 10M. Storage for the count-min sketch at this cardinality: with 5 hash functions and 1M counters per row, each sketch is 5MB — trivially fits in memory. Historical trend data (hourly snapshots of top 1,000 trends per 200 countries) generates approximately 200K records/hour, or 1.7B records/year requiring roughly 500GB in compressed columnar storage.

High-Level Architecture

The trending system is a stream processing pipeline with three stages: ingestion, aggregation, and serving. The ingestion layer receives hashtag events from the main Content Service via Kafka. A Hashtag Extraction Service parses posts to extract hashtags, normalizes them (lowercase, remove special characters, merge known aliases like #COVID and #Covid19), and publishes normalized hashtag events to a hashtags-normalized Kafka topic partitioned by hashtag (consistent hashing ensures all events for the same hashtag land on the same partition).

The aggregation layer runs on Apache Flink with tumbling and sliding windows. A Flink job maintains a count-min sketch per geographic region per time window. Every 30 seconds, the job emits the top-K hashtags (using a min-heap of size K=50) for each region. These candidates are passed to an Anomaly Detection Service that compares current velocity against the hashtag's historical baseline (7-day moving average at the same hour) — a hashtag is trending only if its current rate exceeds 3x the baseline. The serving layer stores the final trending lists in Redis sorted sets keyed by region, refreshed every 30 seconds, and served via a lightweight Trending API.

Core Components

Stream Aggregation Engine (Flink)

The Flink cluster runs a stateful streaming job that maintains per-hashtag counters in RocksDB-backed state. The job uses sliding windows of 1 hour with a slide interval of 30 seconds. For each window evaluation, the job extracts the top 200 hashtags per region using a priority queue. Flink's checkpointing (every 30 seconds to S3) ensures exactly-once processing semantics — if a worker fails, it resumes from the last checkpoint without double-counting. The job is parallelized with 200 task slots across 50 TaskManagers, partitioned by hashtag hash to ensure co-location of counts.

Anomaly Detection & Velocity Scoring

Not every high-count hashtag is trending — a consistently popular hashtag like #love should not appear on the trending list. The Anomaly Detection Service computes a trending score using: score = (current_rate - baseline_rate) / stddev(baseline). A score above 3.0 (z-score) qualifies as trending. The baseline is computed from a 7-day lookback stored in a time-series database (TimescaleDB). The service also applies a minimum volume threshold (at least 1,000 mentions in the window) to filter noise. For breaking events (no historical baseline), a separate heuristic triggers if a previously unseen hashtag exceeds 5,000 mentions in 15 minutes.

Spam & Manipulation Filter

Coordinated trending attacks (botnets posting the same hashtag) are detected using several signals: account age distribution (if >50% of posts come from accounts created in the last 7 days, flag as suspicious), posting rate per account (more than 10 posts with the same hashtag in an hour), text similarity (identical or near-identical post text using SimHash), and network analysis (accounts following each other in tight clusters detected via graph community algorithms). A random forest classifier trained on historical manipulation campaigns scores each trending candidate — scores above 0.7 trigger automatic suppression, with the candidate logged for human review.

Database Design

Real-time trending data lives entirely in Redis. Each region has a sorted set trending:{region_code} with hashtag as member and trending score as the score. A separate hash trending_meta:{hashtag} stores metadata: post_count, related_topic (extracted via LDA topic modeling), representative_post_ids, and first_seen_timestamp. This Redis data is refreshed every 30 seconds by the Flink output consumer. Historical trends are stored in TimescaleDB with a hypertable partitioned by time (1-day chunks) and indexed by region and hashtag. Schema: timestamp, region, hashtag, post_count, trending_score, velocity.

The hashtag alias mapping (merging #Covid, #COVID19, #coronavirus into a canonical form) is stored in a PostgreSQL table with columns: alias_hashtag, canonical_hashtag, confidence_score. This table is loaded into an in-memory Trie structure in the Hashtag Extraction Service for O(k) lookup where k is hashtag length. New aliases are detected automatically by a batch job that clusters co-occurring hashtags using Jaccard similarity on their posting user sets.

API Design

GET /api/v1/trends?region={country_code}&limit=20 — Fetch current trending hashtags for a region; returns hashtag, score, post_count, context
GET /api/v1/trends/history?hashtag={tag}&start={ts}&end={ts}&granularity=hour — Historical trend data for a specific hashtag
GET /api/v1/trends/related?hashtag={tag}&limit=10 — Fetch hashtags frequently co-occurring with the given hashtag
POST /api/v1/trends/custom — Create a custom promoted trend (for advertisers); body contains hashtag, region, start_time, end_time

Scaling & Bottlenecks

The Flink aggregation job must process 75K events/sec with stateful windowed computation. RocksDB state backend handles the 10M unique hashtags per window efficiently, but checkpoint size grows with state — at 10M keys with 100 bytes each, checkpoints are approximately 1GB, completing within the 30-second checkpoint interval. During major global events (Super Bowl, elections), hashtag volume can spike 10x to 750K events/sec. The Flink job auto-scales by monitoring Kafka consumer lag via a Prometheus metric; when lag exceeds 100K events, Kubernetes HPA scales the TaskManager deployment.

The Redis serving layer handles read traffic from the trending API at approximately 50K requests/sec. A single Redis node handles this comfortably, but for geographic distribution, Redis Cluster with read replicas in 5 regions ensures sub-10ms latency globally. Cache invalidation is not a concern because the data is overwritten every 30 seconds — there is no stale cache problem, only a bounded staleness of 30 seconds by design.

Key Trade-offs

Count-min sketch vs. exact counts: The sketch uses O(1) memory per hashtag lookup with a bounded overcount error — exact counts would require O(n) memory for 10M unique hashtags, which is feasible but the sketch is simpler and faster
Sliding window vs. tumbling window: Sliding windows provide smoother trend transitions but require more state (overlapping windows); tumbling windows are cheaper but create abrupt trend changes at window boundaries
Z-score anomaly detection vs. absolute thresholds: Z-score adapts to each hashtag's natural volume, preventing always-popular hashtags from dominating — the trade-off is requiring 7 days of historical data before reliable detection
Flink over Kafka Streams: Flink provides richer windowing semantics and exactly-once processing via checkpoints, but adds operational complexity of managing a separate cluster vs. Kafka Streams running within the application