System Design: Web Crawler

Requirements

Functional Requirements:

Crawl the open web starting from seed URLs, following links to discover new pages
Respect robots.txt rules and crawl-delay directives per domain
Deduplicate pages: avoid re-crawling identical content regardless of URL
Extract and index page content, metadata, and outgoing links
Support recrawl scheduling: important pages crawled frequently, others less so
Store crawled content for downstream indexing (search engine, ML training corpus)

Non-Functional Requirements:

Crawl at least 1B pages, targeting 100M pages/day throughput
Politeness: no more than 1 request/second per domain by default; respect crawl-delay
Deduplication must scale to 100B URL hashes without false negatives
Fault-tolerant: worker failures must not cause URL loss or re-crawl storms
Crawled content stored durably with original URL, timestamp, and HTTP metadata

Scale Estimation

100M pages/day = ~1,157 pages/second. Average page size: 200KB HTML = 20TB of HTML/day. Extracted links per page: average 50 outgoing links = 5B URL discoveries/day. URL deduplication set: 100B URLs × 8 bytes each for hash = 800GB — too large for pure in-memory storage. After deduplication, net new URLs added to frontier: ~10% of discovered = 500M new URLs/day. DNS lookups: 1,157 pages/second × 100% new domains (worst case) = 1,157 DNS lookups/second (cached heavily in practice).

High-Level Architecture

The crawler is organized around three core systems: the URL Frontier, the Fetcher Fleet, and the Content Pipeline. The URL Frontier manages the prioritized queue of URLs to crawl. The Fetcher Fleet is a distributed set of workers that fetch URLs from the Frontier, obey politeness rules, and emit fetched pages to the Content Pipeline. The Content Pipeline extracts links, deduplicates content, and writes to durable storage.

URL Frontier design is the most critical architectural decision. The Frontier must: prioritize high-value URLs (PageRank-estimated, recency signals), enforce politeness (rate-limit per domain), and scale to billions of pending URLs. The design uses a two-level queue structure: a front queue (priority queues by URL importance) and a back queue (one queue per domain to enforce per-domain rate limiting). The URL Dispatcher routes URLs from front queues to back queues by domain, ensuring each domain's back queue is drained at the configured rate.

Politeness is enforced at the back queue level: each domain has a back queue and a last-fetch timestamp. The Fetcher only pulls from a domain's queue when now() - last_fetch_timestamp >= crawl_delay. The robots.txt for each domain is fetched once per crawl cycle and cached; all URL candidates for a domain are filtered against its robots.txt rules before being added to the Frontier.

Core Components

URL Frontier

A distributed system combining Redis (for hot/small priority queues) and Kafka (for large URL queues with durable storage). Front queues are Redis sorted sets scored by URL priority (computed from estimated PageRank, domain authority, and last-modified hints). Back queues (per-domain rate-limiting queues) use Kafka topics partitioned by domain hash, with consumer groups for each Fetcher region. A dedicated Frontier Manager service reads from front queues, enforces robots.txt and politeness, and routes URLs to the appropriate domain partition in Kafka. The frontier state (pending URL count, per-domain last-crawl timestamps) is persisted to PostgreSQL for recovery after system restarts.

Fetcher Fleet

Stateless worker processes that consume URL assignments from Kafka, perform HTTP fetches, and emit results. Each Fetcher: resolves DNS (with a local DNS cache to reduce resolver load), checks the domain's per-IP politeness timer (stored in a shared Redis hash), makes the HTTP request with a crawler User-Agent that respects robots.txt, and handles redirects (following up to 3 hops). Fetch results (HTML body, HTTP status, headers, final URL after redirects, fetch timestamp) are published to a Kafka crawled_pages topic. Failed fetches (timeouts, 5xx, DNS failures) are reported to the Retry Manager with a back-off schedule.

Deduplication Service

Operates at two levels: (1) URL deduplication: before adding a URL to the Frontier, check if it has been crawled or is pending. A distributed Bloom filter (partitioned across Redis nodes, 100B capacity, 1% false positive rate) provides O(1) URL seen-checks. False positives cause missed crawls of rare URLs — acceptable. Definitive deduplication uses a key-value store (DynamoDB or RocksDB) mapping URL hash → last_crawl_timestamp. (2) Content deduplication: after fetching, compute SimHash of the page content. Near-duplicate pages (SimHash distance < 3) are identified by a SimHash index and marked as duplicates without being indexed — preventing mirror sites and boilerplate pages from polluting the index.

Database Design

Frontier state in PostgreSQL: crawl_frontier (url_hash BIGINT PK, url TEXT, domain, priority FLOAT, added_at, last_crawled_at nullable, crawl_count INT, status ENUM(pending, in_flight, crawled, failed)). This table is the authoritative URL registry — the Bloom filter and Kafka queues are derived from it. Too large to hold in PostgreSQL at 100B URLs: only active/pending URLs (< 1B) are kept here; completed and cold URLs are archived to S3 Parquet files partitioned by crawl date.

Crawled content: raw HTML stored in S3 (s3://crawl-data/{date}/{url_hash}.html.gz). Metadata in a Parquet store (Hive/S3): {url, final_url, fetch_timestamp, http_status, content_type, content_length, simhash, outlinks[]}. This Parquet store is consumed by downstream indexing pipelines. robots.txt cache: robots_cache (domain, fetched_at, rules JSONB, crawl_delay_seconds) in Redis with a 24-hour TTL.

API Design

POST /api/v1/frontier/seed — internal API to add seed URLs; body: {urls[], priority}; idempotent (duplicates ignored).

GET /api/v1/frontier/stats — returns current frontier size, per-domain queue depths, and crawl rate metrics.

GET /api/v1/crawl/{urlHash}/content — returns the most recently crawled content and metadata for a URL hash.

PUT /api/v1/frontier/recrawl-policy — updates the recrawl frequency policy for a domain or URL pattern.

Scaling & Bottlenecks

The Frontier's URL dispatch is the central bottleneck. Routing 1,157 URLs/second from priority queues to per-domain back queues while respecting politeness requires the Frontier Manager to maintain per-domain state in memory. A single Frontier Manager handles ~10k active domains; for 100M unique domains, multiple Frontier Manager instances partition the domain space using consistent hashing.

DNS resolution is a hidden bottleneck. At 1,157 fetches/second with diverse domains, DNS resolution can be 20-100ms per lookup. The Fetcher fleet uses a local caching DNS resolver (bind or dnscache) with aggressive TTL caching. Most crawl traffic is concentrated on a small number of high-link-count domains (Wikipedia, Reddit, news sites), so cache hit rates are high in practice despite a large domain namespace.

Content deduplication via SimHash requires comparing against a large SimHash corpus. A SimHash index using Hamming distance lookup (implemented as a hash table of 64-bit values with bit-flip enumeration) handles billions of hashes. This is partitioned across multiple SimHash servers by the high-order bits of the hash. A 1% miss rate on near-duplicate detection is acceptable — some duplicate pages in the index are tolerable.

Key Trade-offs

Breadth-first vs. priority-based crawl: Pure BFS processes URLs in discovery order, which may spend too much time on low-quality pages; priority-based crawling (PageRank estimate, domain authority) focuses on high-value pages first but requires periodic re-prioritization as the web graph is discovered.
Bloom filter false positives: A 1% false positive rate on the URL Bloom filter means 1% of genuinely new URLs are skipped. This is acceptable for a general web crawler but unacceptable for a compliance crawler (broken links checker); the rate is configurable and traded against memory cost.
Centralized vs. distributed frontier: A centralized frontier (single master) is simple but limits throughput to one machine's capacity; a partitioned frontier (domain-sharded) scales horizontally but requires consistent hashing and adds complexity to politeness enforcement (a domain's rate limit state must be co-located with its queue).
Recrawl freshness vs. crawl efficiency: Recrawling all pages frequently keeps the index fresh but wastes bandwidth on rarely updated content; an adaptive recrawl schedule (change detection from Last-Modified headers, ETags, and historical change frequency) directs recrawl resources efficiently but requires tracking page change history.