SYSTEM_DESIGN

System Design: Search Ranking System

Design a search ranking system that combines text relevance, user signals, and machine learning models to deliver personalized, high-quality results at scale. Covers Learning to Rank, feature engineering, and online experimentation.

15 min readUpdated Jan 15, 2025
system-designsearch-rankingmachine-learninglearning-to-rankpagerankrecommendation

Requirements

Functional Requirements:

  • Rank search results by combining textual relevance, popularity, recency, and personalization signals
  • Support Learning to Rank (LTR) models trained on click-through data
  • Enable A/B testing of different ranking models simultaneously
  • Re-rank results within 20ms budget
  • Incorporate real-time signals (trending content, breaking news boost)
  • Support business rules (sponsored results, blacklisted URLs, freshness boosts)

Non-Functional Requirements:

  • Re-ranking latency under 20ms p99 for top-200 candidates
  • Model update cycle: daily for batch models, hourly for online models
  • Support 50,000 ranking QPS
  • Zero-downtime model deployment
  • Explainability: log which signals drove each result's score

Scale Estimation

At 50,000 QPS with 200 candidates per query, the ranking system scores 10 million (query, document) pairs per second. Each feature vector has ~200 features. A gradient-boosted tree model (XGBoost, 1,000 trees x 6 leaves) scores one candidate in ~50 microseconds on CPU. Scoring 200 candidates takes ~10ms per query thread. With 8 ranking threads per server, one server handles ~1,600 QPS. For 50,000 QPS, ~32 ranking servers suffice, plus headroom.

High-Level Architecture

Ranking is a multi-stage pipeline layered on top of first-stage retrieval. Stage 1 (retrieval) uses BM25 or a learned dense retrieval model (bi-encoder) to fetch the top-1,000 candidates from the index. Stage 2 (pre-ranking) uses a lightweight point-wise model to reduce 1,000 candidates to 200. Stage 3 (ranking) applies the full LTR model with 200+ features. Stage 4 (post-ranking) applies business rules, diversity filters, and personalization overlays.

A Feature Store is central to the architecture. It provides low-latency access to pre-computed document features (PageRank, CTR, domain authority, content freshness) and real-time query features (query intent, user session context). Document features are computed offline (batch jobs) and cached in a Redis cluster. Query features are computed inline. The ranking server fetches features for all candidates in a single batch call to minimize latency. Features are joined against query context and passed to the LTR model.

A Model Serving Service hosts the LTR models. Models are versioned artifacts (ONNX or XGBoost native format) loaded into memory. The serving service supports hot-swap model updates: new models are loaded alongside old ones, traffic is gradually shifted via a canary deployment, and full rollout happens after metric validation. A separate Experiment Framework routes a fraction of traffic to shadow models for offline evaluation before promotion.

Core Components

Learning to Rank Model

LambdaMART is the standard LTR algorithm. It optimizes NDCG (Normalized Discounted Cumulative Gain) directly using gradient boosting. Training data consists of (query, document, relevance_label) triples where relevance labels are derived from click-through logs using techniques like propensity scoring to correct for position bias. Features include: TF-IDF score, BM25 score, document PageRank, URL depth, title match score, anchor text match, site quality score, content freshness (days since publish), CTR over 7/30/90 day windows, and user affinity score.

Feature Store

The Feature Store separates feature computation from feature serving. Offline features (PageRank, historical CTR, domain trust score) are computed by Spark batch jobs nightly and written to a columnar store (Apache Parquet on S3) and a Redis Cluster (for low-latency serving). Online features (session CTR, query-specific popularity spike) are computed by a Flink streaming job and written to Redis with short TTLs. The ranking server queries Redis with a multi-get for all candidate document IDs in parallel, assembling feature vectors in <5ms.

A/B Experiment Framework

The experiment framework hashes user IDs into experiment buckets. Each bucket maps to a model configuration (model version, feature weights, business rule settings). Ranking decisions are logged with experiment IDs, enabling offline analysis of NDCG, CTR, and session success rate per experiment. Statistical significance testing (t-test, Mann-Whitney) runs automatically after sufficient traffic accumulates. Guardrail metrics (bounce rate, query reformulation rate) trigger automatic experiment pauses if degraded.

Database Design

Feature data is stored in two systems: Redis Cluster for low-latency online serving (document features keyed by docID, TTL 24h for batch features, 1h for streaming features) and Apache Hive / BigQuery for offline feature pipeline storage and training data generation. Training datasets are stored as Parquet files on S3, partitioned by date. Model artifacts are stored in an MLflow model registry, versioned by experiment run ID. Ranking logs (query, results, features, scores) are written to an append-only Kafka topic and archived to S3 for training data generation.

API Design

Scaling & Bottlenecks

Feature fetching is the dominant latency contributor. Fetching 200 features for 200 candidates requires a Redis multi-get of up to 200 keys. Redis Cluster with pipelining can handle this in 2–5ms. The alternative is embedding features directly in the search index (as doc values), eliminating the Redis round-trip at the cost of index storage and update flexibility. Hybrid approaches store high-volatility features (CTR, inventory) in Redis and low-volatility features (PageRank, domain trust) in the index.

Model inference is CPU-bound. XGBoost can be compiled with AVX2 instructions for SIMD-accelerated tree traversal, reducing inference time by 3–5x. For neural re-rankers (cross-encoders), GPU inference is required to meet latency budgets. Batching multiple queries together on a GPU increases throughput but adds queuing latency. A typical trade-off is to use XGBoost for the primary ranking stage and a small DistilBERT cross-encoder only for the top-20 candidates in a final re-ranking stage.

Key Trade-offs

  • Model complexity vs. latency: Deeper neural models improve NDCG but push re-ranking latency over 20ms; tree models are faster but capture less semantic relevance
  • Feature freshness vs. pipeline complexity: Real-time features improve ranking for trending queries but require streaming infrastructure with strict SLAs
  • Global vs. personalized ranking: Global ranking is simpler and avoids filter bubbles; personalization improves CTR but risks over-specialization
  • Exploration vs. exploitation: Pure exploitation of known good results maximizes short-term CTR but prevents discovering better content; epsilon-greedy or UCB bandit strategies balance both

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.