System Design: Price Comparison Engine

Requirements

Functional Requirements:

Crawl product prices from 5,000+ online retailers
Match and deduplicate products across retailers (entity resolution)
Display price comparisons sorted by total cost (price + shipping)
Price history charts showing trends over 90 days
Price drop alerts: notify users when a tracked product drops below threshold
Affiliate link integration for revenue generation

Non-Functional Requirements:

Crawl 50M product pages daily; update prices at least every 4 hours
Search across 200M matched products with sub-200ms latency
Price data freshness: 95% of prices updated within the last 6 hours
99.9% availability for the consumer-facing search and comparison pages
Handle retailer anti-bot measures gracefully (respect robots.txt, rotate proxies)

Scale Estimation

50M pages crawled daily = 579 pages/sec sustained. Parsed product data: 50M × 2KB = 100GB/day of raw price data. Matched product database: 200M unique products × 5 retailer prices average = 1B price records. Storage: 200M products × 3KB metadata = 600GB; 1B price records × 100 bytes = 100GB. Price history (90 days × 6 updates/day × 200M products) = 108B data points at 20 bytes each = 2.16TB. Search QPS: 5M DAU × 3 searches = 15M/day = 174 QPS.

High-Level Architecture

The system has three major pipelines: Crawl, Match, and Serve. The Crawl Pipeline: a distributed web crawler (Scrapy cluster on Kubernetes) fetches product pages from 5,000 retailers. Each retailer has a custom parser (CSS/XPath selectors) that extracts product name, price, availability, shipping cost, and product identifiers (UPC, EAN, MPN). Parsed data is published to Kafka topic raw-prices.

The Match Pipeline consumes raw price data and resolves it to canonical products. Entity resolution uses a multi-signal matching algorithm: exact UPC/EAN match (highest confidence), title + brand fuzzy match (TF-IDF cosine similarity > 0.85), and image similarity (perceptual hash comparison). Matched prices are written to PostgreSQL and the Elasticsearch search index.

The Serve Pipeline handles user-facing queries: search requests go to Elasticsearch with custom scoring (price competitiveness, retailer reliability, data freshness). Price history is served from a time-series database (TimescaleDB). Price alerts are processed by a Kafka consumer that checks each new price against user-configured thresholds.

Core Components

Distributed Web Crawler

The crawler runs on a Kubernetes cluster with 200 worker pods. A URL Frontier (Redis sorted set) manages the crawl schedule: each URL is scored by next_crawl_time, computed from the retailer's crawl frequency (high-traffic retailers every 2 hours, long-tail retailers every 12 hours). Workers pull URLs from the frontier, fetch pages via a rotating proxy pool (10,000 residential proxies), parse content using retailer-specific extractors, and publish results to Kafka. Politeness controls: per-domain rate limiting (max 2 requests/sec/domain) enforced by a Redis token bucket; robots.txt compliance checked before each request; crawl delays respected.

Entity Resolution Service

Product matching across retailers is the core intellectual property of a price comparison engine. The matching pipeline: (1) Exact match on standardized product identifiers (UPC, EAN, ISBN) — matches ~40% of products instantly; (2) Fuzzy match on normalized title + brand: titles are cleaned (remove retailer-specific prefixes, normalize units), tokenized, and compared using TF-IDF cosine similarity; (3) For remaining unmatched products, image-based matching using perceptual hashing (pHash) identifies visually identical products. Matched products are assigned to a canonical product ID. A manual review queue handles low-confidence matches (similarity 0.7-0.85).

Price Alert System

Users set price alerts: 'notify me when product X drops below $Y.' Alerts are stored in PostgreSQL: (alert_id, user_id, product_id, threshold_price, channel ENUM('email', 'push', 'sms'), active BOOLEAN). When a new price is published to Kafka, a Price Alert Consumer joins the price event against the alerts table (pre-loaded in a Redis hash keyed by product_id for O(1) lookup). Matching alerts are enqueued in a notification queue (SQS) for delivery. To prevent alert fatigue, each alert has a cooldown period of 24 hours between notifications.

Database Design

PostgreSQL schema: products table (product_id UUID, canonical_name, brand, category_id, upc VARCHAR, ean VARCHAR, image_url, created_at). retailer_prices table (price_id, product_id FK, retailer_id FK, price DECIMAL, shipping_cost DECIMAL, currency, url, affiliate_url, availability BOOLEAN, last_crawled_at, created_at). retailers table (retailer_id, name, domain, crawl_frequency_hours, reliability_score).

TimescaleDB hypertable for price history: price_history (product_id, retailer_id, price, timestamp) with automatic partitioning by time (daily chunks). Retention policy: raw data for 90 days, then downsampled to daily min/max/avg and retained for 2 years. Elasticsearch index: denormalized product documents with nested retailer_prices array, enabling queries like 'find products where any retailer_price < $50 in category Electronics.'

API Design

GET /api/v1/products/search?q={query}&category={id}&price_max=100&sort=price_asc — Search products with best price; returns product with all retailer prices sorted by total cost
GET /api/v1/products/{product_id}/prices — All current retailer prices for a product, sorted by total cost (price + shipping)
GET /api/v1/products/{product_id}/price-history?period=90d&retailer={id} — Price history chart data from TimescaleDB
POST /api/v1/alerts — Create a price alert; body contains product_id, threshold_price, notification_channel

Scaling & Bottlenecks

The crawl pipeline is constrained by politeness limits (2 req/sec/domain) not system throughput. Scaling crawl coverage requires: (1) parallel crawling across many domains (200 workers × 2 req/sec = 400 domains crawled simultaneously); (2) prioritizing high-value pages (products with many watchers or high traffic categories); (3) using retailer APIs (affiliate feeds) where available instead of HTML crawling — 30% of retailers provide structured data feeds that are 10x more efficient to process.

Entity resolution is the computational bottleneck. Fuzzy matching 50M new price records against 200M canonical products requires efficient candidate generation. The system uses locality-sensitive hashing (LSH) to pre-filter candidates: only products in the same LSH bucket (similar title hash) are compared via full cosine similarity. This reduces comparisons from O(n²) to O(n × k) where k is the average bucket size (~50). LSH indexes are rebuilt nightly.

Key Trade-offs

Custom parsers per retailer over generic extraction: Higher accuracy (99% vs 80%) but requires maintenance as retailers change their HTML — a parser health monitor alerts when extraction rates drop below threshold
4-hour price refresh cycle: Balances crawl infrastructure cost with data freshness — affiliate revenue depends on sending users to correct prices, so freshness directly impacts revenue
LSH for entity resolution over exact matching only: Captures 30% more product matches than UPC-only matching, but introduces false positives (~2%) that require periodic manual review
TimescaleDB for price history over plain PostgreSQL: Native time-series features (automatic partitioning, downsampling, retention policies) simplify operations, but adds another database technology to maintain