System Design: Food Recommendation Engine

Requirements

Functional Requirements:

Personalized restaurant ranking on the home feed for each user
"You might also like" item suggestions within a restaurant menu
Cuisine and dish discovery for users exploring new food types
Re-order suggestions based on past order history and time-of-day patterns
Diet-aware filtering (vegetarian, vegan, gluten-free, halal, kosher)
Trending and popular items in the user's area

Non-Functional Requirements:

Serve personalized recommendations for 50M DAU with sub-100ms latency
Update user preference models within 30 minutes of new order completion
Handle cold-start problem for new users with zero order history
99.9% availability for the recommendation service
Support A/B testing of recommendation algorithms with traffic splitting

Scale Estimation

50M DAU × 5 app opens per day = 250M recommendation requests/day = 2,900 requests/sec. Each request requires scoring 200-500 candidate restaurants from a pool of ~50K restaurants in the user's delivery radius. That is 250M × 300 average candidates = 75B candidate scorings/day. User feature vectors: 50M users × 256-dimensional embedding = 12.8GB for the user embedding table. Restaurant feature vectors: 900K restaurants × 256 dimensions = 230MB. The item-level recommendation (suggesting within a menu) adds another 72M menu items to score across all requests.

High-Level Architecture

The recommendation engine follows the standard retrieval-ranking-reranking pipeline used in industrial recommendation systems. The pipeline executes on every app open or home feed refresh. Stage 1 (Retrieval): multiple retrieval channels run in parallel to generate a broad candidate set of 1,000-2,000 restaurants. Stage 2 (Ranking): a deep learning model scores each candidate on predicted user engagement (click probability, order probability, predicted satisfaction). Stage 3 (Reranking): business logic adjustments for diversity, freshness, sponsored placements, and policy constraints.

The ML infrastructure sits on a standard feature store + model serving stack. Features are divided into real-time features (user's current location, time of day, weather, items in cart) stored in Redis, near-real-time features (user's last 5 orders, rolling 30-day cuisine preferences) updated via Flink and stored in Redis, and batch features (user embedding, long-term preference vector, lifetime order count) computed daily via Spark and stored in a feature store (Redis-backed with PostgreSQL as the offline store). The ranking model (a multi-task learning model predicting click, order, and satisfaction simultaneously) is served via a TensorFlow Serving cluster with GPU inference.

The feedback loop is critical: every impression, click, order, rating, and reorder event is logged to Kafka and consumed by the ML training pipeline (daily retraining on Spark/PyTorch) and the near-real-time feature update pipeline (Flink). This ensures the model adapts to changing user preferences, seasonal trends, and new restaurants.

Core Components

Multi-Channel Retrieval System

Retrieval uses four parallel channels, each returning ~500 candidates: (1) Collaborative Filtering — an approximate nearest neighbor (ANN) search using FAISS over user embeddings finds users with similar ordering patterns, then retrieves their recently ordered restaurants; (2) Content-Based — restaurants matching the user's explicit cuisine preferences and dietary filters, scored by menu similarity to past orders using TF-IDF over menu item descriptions; (3) Popularity-Based — trending restaurants in the user's H3 zone, weighted by recency and order volume (handles cold start); (4) Contextual — time-of-day and day-of-week patterns from the user's history (e.g., sushi on Friday nights, coffee shops on weekday mornings). Each channel returns scored candidates that are merged and deduplicated into a unified candidate set.

Deep Ranking Model

The ranking model is a multi-task neural network with shared bottom layers and task-specific heads. The shared layers take as input: user features (embedding, demographics, order history summary, active subscription), restaurant features (embedding, cuisine type, price tier, average rating, delivery time estimate, current surge status), contextual features (time of day, day of week, weather, device type), and cross features (user-restaurant interaction history, user-cuisine affinity scores). The model architecture uses a Deep & Cross Network (DCN-V2) structure with 6 cross layers and 4 deep layers (512→256→128→64 units). Three task heads predict: P(click), P(order|click), and P(5-star rating|order). The final ranking score is a weighted combination: 0.3×P(click) + 0.5×P(order|click) + 0.2×P(5-star). The model runs on TensorFlow Serving with batch inference (batching 32 requests), achieving p99 latency of 25ms for scoring 500 candidates.

Cold Start Handler

New users with zero order history cannot leverage collaborative filtering or personal history features. The cold start handler uses a fallback strategy: (1) During onboarding, users select cuisine preferences from a curated list — these are mapped to a synthetic user embedding positioned at the centroid of users with similar stated preferences; (2) For the first 5 orders, the system heavily weights the popularity-based retrieval channel and contextual signals (location, time); (3) After each order, the user embedding is updated via an online learning mechanism (gradient descent on the embedding with the new order as a positive sample), rapidly personalizing the recommendations. The transition from cold start to personalized recommendations is typically complete after 8-10 orders.

Database Design

User profiles and preference data are stored in PostgreSQL: users table with user_id, dietary_preferences (JSONB array), cuisine_affinity_scores (JSONB map from cuisine to score), onboarding_selections, created_at. Order history is in a separate orders table joined for feature computation. The ML feature store uses a dual-layer architecture: Redis for online features (sub-millisecond lookups during model inference) and S3/Parquet for offline features (used during daily model retraining). Redis stores user embeddings as byte arrays under user_emb:{user_id} and restaurant embeddings under rest_emb:{restaurant_id}.

The FAISS index for ANN search over user embeddings is rebuilt daily on a GPU instance and loaded into memory on the retrieval servers. The index uses IVF-PQ (Inverted File with Product Quantization) for memory efficiency: 50M 256-dimensional float32 embeddings would normally require 12.8GB, but IVF-PQ compresses this to ~3.2GB with <5% recall degradation at top-100. The index is partitioned into 4096 Voronoi cells, with nprobe=64 during search (querying 64 of 4096 cells) for the accuracy/speed trade-off.

API Design

GET /api/v1/recommendations/feed?lat={lat}&lng={lng}&limit=30&cursor={token} — Get personalized restaurant feed for the authenticated user; returns ranked restaurant list with scores and explanation tags
GET /api/v1/recommendations/restaurant/{restaurant_id}/items?context=reorder|explore — Get recommended menu items within a restaurant; context flag switches between reorder suggestions and exploration
POST /api/v1/recommendations/feedback — Log explicit user feedback (dismiss, not interested, wrong cuisine); body contains restaurant_id, feedback_type, session_id
GET /api/v1/recommendations/trending?lat={lat}&lng={lng}&cuisine={type} — Get trending restaurants/items in an area, optionally filtered by cuisine

Scaling & Bottlenecks

The ranking model inference is the latency bottleneck. Scoring 500 candidates through a deep neural network with GPU inference takes ~25ms in batch mode. To keep total API latency under 100ms, the retrieval stage must complete within 40ms and reranking within 10ms, leaving ~25ms for network overhead and serialization. The retrieval stage achieves this by running all four channels in parallel with a 40ms timeout — any channel that misses the deadline is excluded, and its candidates are omitted (graceful degradation). GPU inference servers are provisioned at 70% utilization with auto-scaling triggered at 80%.

The FAISS index rebuild is a batch process that takes 2 hours for 50M embeddings on a single GPU. During the rebuild window, the old index continues serving (stale by up to 24 hours for the collaborative filtering channel). For users who placed orders since the last rebuild, the near-real-time Flink pipeline updates their embedding in Redis, and a separate lightweight ANN index (HNSW, covering only the 2M users with recent activity) is used to supplement the main FAISS index.

Key Trade-offs

Multi-channel retrieval over single-model end-to-end: Separate retrieval channels provide interpretability (which channel drove a recommendation), easier debugging, and graceful degradation — but the merging heuristic between channels is manually tuned and suboptimal compared to a learned retrieval model
Multi-task learning over separate models per objective: A single model predicting click, order, and satisfaction simultaneously shares representations and trains more efficiently, but task conflicts (optimizing for clicks may hurt satisfaction) require careful loss weighting — the 0.3/0.5/0.2 weights are tuned via online A/B tests
Daily batch retraining over online learning: Daily retraining on the full dataset produces stable, well-calibrated models, but the 24-hour staleness means trending restaurants and new user preferences are not reflected immediately — the near-real-time feature updates via Flink partially compensate
IVF-PQ over exact nearest neighbor for ANN: Product quantization reduces memory by 4x and search time by 100x, but introduces recall degradation — at nprobe=64 the recall@100 is 95%, meaning 5% of truly relevant restaurants may be missed in retrieval (corrected by the other 3 retrieval channels)