System Design: DoorDash

Requirements

Functional Requirements:

Three-sided marketplace connecting consumers, merchants (restaurants), and Dashers (delivery drivers)
Consumers search for merchants, browse menus, place and track orders
Merchants receive orders on tablets, manage menus, and update item availability in real-time
Dashers accept delivery offers, navigate routes, and confirm pickups and drop-offs
Support for DashPass subscription with free delivery and reduced fees
Scheduling orders for future delivery windows

Non-Functional Requirements:

37M consumers, 700K merchants, 7M Dashers; ~6M orders/day
Order placement latency under 2 seconds end-to-end
99.99% availability for the ordering critical path
Sub-minute Dasher assignment for 95% of orders
Accurate ETA predictions within a 5-minute window for 90% of deliveries

Scale Estimation

6M orders/day = 70 orders/sec average, peaking at ~500 orders/sec during Friday dinner rush. Each order involves 3 actors (consumer, merchant, Dasher) with 15+ API calls each throughout the lifecycle, yielding roughly 45M API calls/day per million orders = 270M total API calls/day = ~3,100 requests/sec sustained. Dasher location pings at 5-second intervals from 1M active Dashers during peak = 200K location updates/sec. Menu catalog: 700K merchants × 60 average items = 42M menu items requiring real-time availability tracking.

High-Level Architecture

DoorDash's architecture is built on a Kotlin/gRPC microservices platform running on Kubernetes (migrated from a Python monolith in 2019-2021). The API Gateway layer uses an in-house service called Unified Gateway that handles authentication, rate limiting, and request routing. Core domain services include: Consumer Service, Merchant Service, Dasher Service, Order Service, Delivery Service (logistics), Search Service, and Payment Service. Inter-service communication uses gRPC for synchronous calls and Apache Kafka for event-driven flows.

The critical ordering path follows: Consumer places order → Order Service validates cart, calculates pricing with the Pricing Service, and authorizes payment → Order is persisted in CockroachDB (DoorDash's primary OLTP database, chosen for strong consistency and horizontal scalability) → Merchant Service pushes the order to the restaurant's tablet via a persistent connection → Delivery Service enters the order into the assignment pool. The Delivery Service runs a matching algorithm called STRP (Smart Time-to-Ready Prediction) that predicts when food will be ready and dispatches a Dasher to arrive at the restaurant just as the food is prepared, minimizing both Dasher wait time and food cooling.

The real-time layer uses Apache Flink for stream processing — consuming Kafka events for live order tracking, ETA recalculation, and anomaly detection (e.g., detecting when a Dasher has been stationary for too long). WebSocket connections to all three actor types flow through a dedicated Gateway Service backed by Redis Pub/Sub for message fan-out.

Core Components

Delivery Optimization (STRP)

DoorDash's core competitive advantage is the STRP system — Smart Time-to-Ready Prediction. Instead of assigning a Dasher immediately when an order is placed, STRP predicts how long the restaurant will take to prepare the food using an ML model trained on historical prep times, current order volume at the restaurant, item complexity, and time of day. It then times the Dasher dispatch so the Dasher arrives within 2 minutes of food being ready. This reduces Dasher idle time by 30% and improves food quality. The model is a gradient-boosted decision tree (XGBoost) retrained daily on Spark, with real-time feature updates flowing through Flink.

Merchant Integration Platform

Merchants interact with DoorDash through three channels: the DoorDash tablet app, POS integration (direct API connection to restaurant POS systems like Toast, Square, or Clover), and the Merchant Portal web app. The tablet uses a persistent gRPC stream for real-time order push. POS integration uses webhooks — DoorDash sends orders to the POS system via HTTPS POST and receives status updates via callback URLs. A Merchant Availability Service tracks real-time item availability: when a merchant marks an item as sold out on their POS, a webhook fires to update the Search index within 30 seconds.

Consumer Search & Discovery

Search is powered by Apache Lucene via a custom search infrastructure (DoorDash moved off Elasticsearch to a bespoke solution for performance). The search index combines merchant metadata, menu items, cuisine tags, and real-time signals (current delivery time estimate, merchant busy status, promotional offers). Ranking uses a two-tower neural network: one tower encodes user preferences (order history embeddings, location, time features), the other encodes merchant features. The dot product of the two towers produces a relevance score. Results are further reranked by a business logic layer that injects sponsored placements and DashPass-eligible merchants.

Database Design

DoorDash migrated to CockroachDB as its primary OLTP store for strong consistency and multi-region deployment without manual sharding. The Orders table uses a UUID primary key with columns: order_id, consumer_id, merchant_id, dasher_id, status (enum), items (JSONB), subtotal, fees, tip, total, delivery_address (PostGIS geometry), created_at, and scheduled_for. CockroachDB's SERIALIZABLE isolation level eliminates the need for application-level locking on order state transitions.

Dasher locations are stored in a Redis geospatial index (GEOADD/GEORADIUS) for real-time proximity queries, with location history streaming to Apache Druid for analytics. Menu data lives in CockroachDB with a materialized view pattern: the canonical menu is in the Merchant Service's tables, and a denormalized read-optimized copy is maintained in the Search index. A CDC pipeline (Debezium on CockroachDB's changefeed) keeps the search index synchronized within 5 seconds of a menu update.

API Design

GET /api/v1/search?query={q}&lat={lat}&lng={lng}&filters={cuisine,price}&page_token={token} — Search merchants and menu items with geo-filtering
POST /api/v1/orders — Place an order; body contains merchant_id, items, delivery_address, payment_token, tip_amount, scheduled_time (optional)
GET /api/v1/deliveries/{delivery_id}/status — SSE stream of delivery status updates including Dasher location, ETA, and order progress
PATCH /api/v1/merchants/{merchant_id}/menu/items/{item_id} — Update item availability or price; triggers search index refresh

Scaling & Bottlenecks

The dinner rush creates extreme load spikes — DoorDash sees 8x traffic amplification between 5-7 PM Friday compared to Tuesday 2 PM baseline. The system handles this through aggressive pre-computation: merchant cards (the UI tile showing name, image, delivery time, rating) are pre-rendered and cached in a CDN-backed edge cache (Fastly) with a 2-minute TTL. Delivery time estimates are pre-computed for popular origin-destination pairs using a geospatial grid (H3 hexagons at resolution 7, ~1.2km per hex) and cached in Redis.

CockroachDB scaling is managed via range-based sharding with automatic rebalancing. Write-hot ranges (e.g., a popular merchant receiving many orders simultaneously) are detected and split automatically. For the Dasher matching service, each metro area runs as an independent shard to bound the matching problem size — a Dasher in Chicago never competes with orders in New York. Cross-region order transfers (when a consumer near a metro boundary orders from a merchant in an adjacent region) are handled by a routing layer that assigns the order to the merchant's region.

Key Trade-offs

CockroachDB over PostgreSQL: Strong consistency and built-in horizontal scaling eliminate manual sharding headaches, but CockroachDB has higher per-query latency (~5ms vs ~1ms for local PostgreSQL) — mitigated with aggressive caching
Delayed Dasher dispatch (STRP) over immediate assignment: Waiting to dispatch until food is nearly ready improves Dasher utilization and food quality, but risks no Dasher being available when food is actually ready — a fallback immediate-dispatch mode triggers if STRP confidence is low
gRPC over REST for internal services: gRPC's binary serialization and HTTP/2 multiplexing reduce inter-service latency by 40% compared to JSON/REST, but debugging is harder without human-readable payloads — solved by gRPC reflection and request logging middleware
Custom search over Elasticsearch: Building a bespoke Lucene-based search gave DoorDash fine-grained control over ranking and real-time index updates, at the cost of significant engineering investment to maintain