System Design: News Aggregator (Google News-style)

Requirements

Functional Requirements:

Crawl and index news articles from 50,000 news sources (RSS feeds, web pages, APIs)
Deduplicate and cluster related articles covering the same story (story grouping)
Personalized news feed: rank articles based on user preferences, reading history, and trending topics
Breaking news detection: surface rapidly developing stories within 5 minutes of first publication
Full-text search across all indexed articles
Support multiple languages with language detection and cross-language story clustering

Non-Functional Requirements:

Index 5 million new articles per day from 50,000 sources
New articles from top sources appear in feeds within 5 minutes of publication
Search results for trending topics returned in under 200ms
Personalized feed generated in under 500ms for each user
Support 100 million daily active readers

Scale Estimation

50k sources × 100 articles/day average = 5M articles/day = 57.8 articles/second. Each article averages 5 KB (text content, not images) = 290 KB/second of text ingestion. Articles stored for 5 years: 5M × 365 × 5 × 5 KB = 45 TB of text — very manageable. For 100M DAU, personalized feeds are requested at ~1,157 feed requests/second (each user checking news ~10 times/day). Each feed generation involves querying a ranking model with ~50 candidate articles in <500ms. User reading history: 100M users × 20 articles/day = 2B read events/day = 23k events/second — written to Kafka for feed model updates.

High-Level Architecture

The platform separates ingestion, processing, serving, and personalization into independent services. Ingestion: a distributed web crawler (Scrapy-based, 1,000 crawler nodes) fetches RSS feeds on a schedule (popular sources every 15 minutes, others every 2 hours) and full article pages from linked URLs. Fetched articles are queued in Kafka. A deduplication service (Bloom filter + simhash) detects near-duplicate articles (same story, different source) before indexing.

Processing: an NLP pipeline processes each article — language detection (fastText), named entity extraction (spaCy), topic classification (BERT fine-tuned on news categories), and sentiment analysis. Processed articles are stored in Elasticsearch (for full-text search) and PostgreSQL (for structured metadata). A story clustering service groups related articles using TF-IDF similarity on article embeddings (SBERT) — articles within a similarity threshold are assigned to the same story cluster. Breaking news detection: articles publishing at a rate >3× baseline within 15 minutes are flagged as breaking.

Personalization: the ranking model (LambdaMART or a neural ranking model) scores candidate articles for each user based on: topic affinity (learned from reading history), recency (exponential decay), source trust score, and trending score. Candidates are retrieved from a pre-filtered article set (top 1,000 articles per topic per day, cached in Redis). Ranking runs in-process using a TorchServe endpoint for the model.

Core Components

Distributed Crawler

The crawler is a distributed system with 1,000 Scrapy worker nodes coordinated by a central task scheduler (Celery + Redis). The scheduler maintains a priority queue of source URLs ordered by: freshness urgency (top news sources fetched every 15 min), last successful fetch time, and source traffic importance. Crawl politeness: respect robots.txt, rate-limit requests to each domain (max 1 request/5 seconds per domain), and use conditional GET (If-Modified-Since header) to skip unchanged feeds. Crawled content is deduplicated using a distributed Bloom filter (100M entries, 1% false positive rate) to avoid reprocessing identical content. New articles pass to the NLP pipeline via Kafka.

Story Clustering Service

Story clustering identifies that "Biden signs infrastructure bill" from CNN and "President signs $1T infrastructure legislation" from Reuters are the same story. The service uses sentence-BERT (SBERT) to generate 768-dimension article embeddings. New articles are compared against embeddings of articles published in the last 72 hours using an ANN index (FAISS) — find top-5 most similar articles by cosine similarity. If similarity > 0.85, assign to the same story cluster; otherwise, create a new cluster. The canonical story headline is the most-read version within the cluster. Story clusters are updated as new articles join. The FAISS index is rebuilt every 30 minutes from the last 72 hours of embeddings (350k articles × 768 dims × 4 bytes = 1 GB index).

Breaking News Detection Service

The service monitors article publication rates per story cluster. A sliding window counter (1-minute and 15-minute windows) tracks publication rate per topic and per story cluster using Redis sliding window. A story cluster whose publication rate spikes to >3× its 7-day average triggers a "breaking" flag. Breaking stories are injected into all users' feeds at the top position, overriding personalization for 30 minutes. The service also monitors social media velocity (Twitter API) as a leading indicator — a topic trending on Twitter with matching articles in the index is pre-emptively flagged as emerging. Breaking news webhooks notify mobile apps for push notification dispatch.

Database Design

PostgreSQL: sources (source_id, name, rss_url, language, country, trust_score, fetch_frequency_mins), articles (article_id, source_id, url, title, published_at, language, topic_ids[], story_cluster_id, embedding_s3_key, content_hash), story_clusters (cluster_id, headline, topic, first_published_at, article_count, is_breaking). Elasticsearch: article_content (article_id, title, body_text, entities[], topics[], published_at) — for full-text search and topic aggregations. Redis: topic:{topic_id}:top_articles (sorted set of article_ids by combined recency+engagement score, refreshed every 5 min), user:{user_id}:feed_cache (pre-generated feed, TTL 15 min), breaking:active (set of currently breaking story_cluster_ids). Cassandra: user_reads (user_id, article_id, read_at, time_spent_ms) — append-only reading history.

API Design

GET /feed?user_id={id}&page={n} — returns personalized feed; served from Redis feed_cache (TTL 15 min) or generated fresh using ranking model
GET /stories/{story_cluster_id} — returns all articles in a story cluster, sorted by source trust score and recency
GET /search?q={query}&lang={lang}&from={date}&to={date} — full-text search via Elasticsearch, paginated
GET /topics/{topic_id}/feed — returns top articles for a topic; served from Redis sorted set
POST /users/{user_id}/reads — body: {article_id, time_spent_ms}, records reading event to Kafka; updates user preference model async

Scaling & Bottlenecks

Crawler throughput: 50k sources × fetch_every_15_min = 50k / 900 = 55.6 fetches/second average. With 1,000 crawler nodes, each handles 0.056 fetches/second — far below capacity. The bottleneck is actually RSS parse and NLP pipeline throughput: 57.8 articles/second × NLP processing (language detection + NER + topic classification + SBERT embedding) at ~100ms per article = 5.78 parallel NLP workers needed. A 10-node NLP cluster provides 3× headroom for burst traffic.

Feed generation at 1,157 requests/second: a 15-minute cache (Redis) means only 7,778 cache misses/hour for each unique user. With 100M DAU, only newly active users (opening the app for the first time in >15 min) trigger fresh feed generation — roughly 110k/minute = 1,833/second. A 20-node ranking API fleet handles this. Pre-generating feeds for the top 1M most active users during off-peak hours (3-4 AM local time) further reduces live generation load.

Key Trade-offs

Aggressive crawling vs. politeness: More frequent crawling ensures fresher content but risks overloading news source servers and getting IP-blocked; respecting crawl delays (and using conditional GET to skip unchanged feeds) is both ethical and practical for sustainable access.
Story clustering threshold: A high similarity threshold (0.9) creates many small clusters (accurate but misses related coverage); a low threshold (0.7) merges loosely related stories. Domain-specific thresholds (0.85 for politics, 0.75 for sports) better match editorial judgment.
Real-time personalization vs. cached feeds: Real-time ranking on every feed request provides the freshest personalization signal but adds 100-200ms latency and 20× server load; 15-minute cached feeds are 10× cheaper and imperceptibly less fresh for typical users who check news every few hours.
Aggregator liability for source content: Displaying article snippets in the aggregator raises copyright concerns ("hot news" doctrine); showing only headlines with deep links to source pages is legally safer but reduces user engagement. Most platforms use a conservative snippet policy (2-3 sentences max) with prominent source attribution.