System Design: Recommendation Engine

Requirements

Functional Requirements:

Return a ranked list of N personalized item recommendations for a given user in real time
Handle multiple recommendation surfaces: home feed, similar items, search re-ranking, email campaigns
Support cold start: provide reasonable recommendations for new users and new items
Incorporate diverse signals: explicit ratings, implicit behavior (clicks, dwell time, purchases), and content features
Allow real-time signal incorporation: if a user clicks an item, immediately reduce its probability of being re-recommended
Support filtering: exclude already-purchased items, respect content availability, apply business rules (promotion boosts)

Non-Functional Requirements:

API latency under 100ms at the 99th percentile for online recommendation serving
Support 50 million daily active users and a catalog of 100 million items
Recommendation freshness: model retrained at least daily; user interactions incorporated within 1 hour
99.9% availability; degraded mode serves popular-items fallback if personalization is unavailable
Offline evaluation: NDCG@10 and recall@100 tracked for every model version

Scale Estimation

50 million users * 10 recommendation requests/day = 500 million requests/day = 5,800 requests/second. Candidate generation must retrieve 1,000 candidates per request from 100 million items in under 30ms — requiring approximate nearest neighbor (ANN) vector search over 100 million item embeddings. Ranking model inference: 1,000 candidates * 50 features each * 1 million requests = large-scale computation requiring GPU-accelerated ranking.*

High-Level Architecture

The recommendation system uses a two-stage architecture: candidate generation (retrieve a few thousand relevant items from 100 million) followed by ranking (score each candidate with a detailed feature-rich model). This decoupling allows the retrieval stage to optimize for recall (find all relevant items cheaply) and the ranking stage to optimize for precision (select the best N from candidates using expensive features).

Candidate generation uses multiple retrieval strategies in parallel: (1) ANN search over item embeddings using user embedding as query (collaborative filtering signals), (2) content-based filtering using item attribute similarity, (3) session-based retrieval using the last 10 items interacted with as context, and (4) popularity-based retrieval for diversity and cold-start. Candidates from all strategies are merged and deduplicated (~1,000 unique candidates) before passing to the ranker.

The ranker is a deep neural network (DCN-V2 or DLRM architecture) that takes the (user, item, context) feature vector and outputs a relevance score. It runs as a GPU-accelerated TorchServe instance. The top-N scored items are filtered by business rules (availability, already-seen suppression, diversity constraints) and returned to the client. The entire pipeline — feature fetch, candidate generation, ranking, filtering — completes in under 100ms.

Core Components

Embedding Generation & ANN Index

User and item embeddings (128-dimensional vectors) are computed nightly via matrix factorization (ALS or Neural CF) on the interaction matrix. Item embeddings are indexed in a FAISS HNSW index (Hierarchical Navigable Small World graph) allowing approximate nearest neighbor search over 100 million items with recall@100 of 98% in under 5ms. The FAISS index is loaded into memory on retrieval servers (100 million * 128 * 4 bytes = 51 GB per index copy). Updates to item embeddings trigger an incremental index rebuild nightly.

Real-Time Feature Service

Online inference requires feature freshness within seconds. A real-time feature service aggregates: user session features (last 10 items viewed, current query context) from Redis, user long-term features (embedding, age, location) from the feature store, and item features (embedding, popularity score, freshness) from a pre-computed feature table. This service responds in under 5ms by reading from in-memory caches for the most frequently accessed entities.

Ranking Model Serving

The ranker runs on GPU-equipped TorchServe instances. Batching is critical for GPU efficiency: requests are grouped into batches of 32–128 (user, item) pairs and scored in a single forward pass. Dynamic batching collects requests within a 5ms window before dispatching, trading individual request latency for throughput. The model is served in TorchScript with FP16 quantization, reducing memory footprint by 50% and inference time by 40% vs. FP32.

Database Design

User-item interaction log: Kafka topic partitioned by user_id, consumed by Spark batch jobs for daily model retraining and by Flink for real-time feature updates. User embedding table: DynamoDB with key = user_id, value = 128-float embedding vector, last_updated. Item embedding table: DynamoDB with key = item_id, value = embedding + content features. Interaction history for session context: Redis SORTED SET per user_id with interaction timestamps as scores and item_ids as values, capped at 50 recent items.

API Design

GET /recommendations/users/{user_id}?surface=home_feed&n=20&filters={json} — Return personalized recommendations for a user on a specified surface. GET /recommendations/items/{item_id}/similar?n=20 — Return items similar to a given item (item-to-item recommendations). POST /recommendations/batch — Batch recommendation generation for up to 1,000 users (for email campaigns, offline processing). POST /events/interactions — Log a user interaction (click, purchase, skip) for real-time feature updates.

Scaling & Bottlenecks

The FAISS ANN index is memory-intensive: 51 GB for 100 million items limits the number of retrieval server replicas. Index sharding (partition the 100 million items across 5 servers, search all shards in parallel, merge results) allows horizontal scaling while keeping per-server memory under 12 GB. Index updates require rebuilding the HNSW graph, which takes 2 hours for 100 million items; a shadow index build with atomic swap enables zero-downtime updates.

Ranking model GPU memory constrains the maximum model size. DLRM models with large embedding tables (sparse features with millions of unique values) don't fit in GPU memory; distributing embedding tables across CPU memory with GPU-resident dense layers (embedding table disaggregation) enables models 10x larger than GPU memory. For the latency budget, prefetching features from the feature store concurrently with candidate generation (parallel execution) saves 15–20ms off the critical path.

Key Trade-offs

Two-stage vs. single-stage ranking: Two-stage (retrieve + rank) scales to 100 million items by using cheap retrieval and expensive ranking selectively; single-stage runs one model over all items which is infeasible at scale but avoids retrieval recall errors.
Model accuracy vs. latency: Larger ranking models (more layers, wider embeddings) improve recommendation quality but add serving latency; model distillation compresses large teacher models into smaller student models with <5% quality loss.
Exploitation vs. exploration: Purely exploitative systems recommend items similar to past behavior, creating filter bubbles; epsilon-greedy or Thompson sampling exploration injects diversity and discovers new user interests at the cost of short-term relevance.
Freshness vs. stability: Models retrained more frequently incorporate recent trends but risk training instability from small data shifts; ensemble approaches (blend a fresh model with a stable baseline) balance recency and reliability.