System Design: Waze Navigation

Requirements

Functional Requirements:

Users receive turn-by-turn navigation with live traffic-aware routing
Users can report incidents: accidents, police, road hazards, closures
Community members can edit the map (add roads, fix errors, update speed limits)
System re-routes automatically when faster alternatives are detected
Carpooling feature connects drivers going the same direction
Alert notifications for reported hazards on the current route

Non-Functional Requirements:

Route computation returns results in under 1 second at 95th percentile
Incident reports visible to affected drivers within 10 seconds
99.9% availability; navigation must degrade gracefully without internet
Handle 140 million monthly active users; 20 million concurrent at peak
Traffic model updated continuously from 500,000+ active probe vehicles

Scale Estimation

Waze has ~140 million MAU; on a busy commute morning ~20 million users navigate simultaneously. Each active navigation session emits pings every 2 seconds (~10 million pings/second). Incident reports: ~3 million/day globally, peaking at ~500/second during rush hour. Map edits: ~50,000/day from the Waze Map Editor community. Traffic segment updates computed every 2 minutes across ~100 million road segments worldwide means ~833,000 segment updates/second during batch recomputation windows.

High-Level Architecture

Waze's architecture centers on a bidirectional data flow: client apps both consume navigation data and produce probe/incident data back to the server. This creates a self-improving feedback loop. The platform is organized around three services: the Routing Service, the Real-Time Event Service, and the Map Management Service.

The Routing Service holds the full road graph in memory across a sharded cluster. Routing queries are hashed by origin geography to a specific shard (each shard covers a country or sub-region), ensuring the relevant graph partition is always in RAM. Routes are computed using a modified A* algorithm that incorporates current and predicted traffic speeds, user preferences (avoid tolls, highways), and historical patterns.*

The Real-Time Event Service processes the firehose of incoming pings and incident reports from client apps via Kafka. A stream processing topology (Apache Flink) performs map-matching to snap GPS traces to road segments, aggregates speeds per segment, and detects slowdowns using a threshold alert. Confirmed incidents (validated by multiple independent reporters) are broadcast to all clients currently navigating near the incident via Server-Sent Events.

Core Components

Probe Data Pipeline

Every active Waze client sends compressed GPS traces (lat, lng, speed, heading, timestamp) every 2 seconds. These land in Kafka partitioned by geohash prefix. A Flink job reads each partition, snaps traces to the road graph using a Hidden Markov Model map-matcher, and updates a Redis sorted set per road segment with a sliding 5-minute speed window. The P50 speed in that window becomes the current segment speed used for routing ETA calculations.

Incident Crowdsourcing Engine

Users report incidents via a one-tap UI. Reports carry type (accident, hazard, police, closure), exact GPS coordinates, and direction of travel. The Incident Service stores reports in Cassandra with a TTL matching the incident type's typical duration (accident = 2 hours, police = 45 minutes). Validity increases when multiple independent users report the same incident within a geofenced radius. A machine learning classifier detects duplicate and spam reports. Validated incidents are pushed via WebSocket to clients within the incident's affected radius.

Map Editor Pipeline

Waze Map Editor is a web app where community volunteers edit the road graph. Edits go through a validation queue: automated checks (topology consistency, speed limit ranges, one-way logic) run first; large edits require human moderator approval. Approved edits are applied to a staging graph, tested with golden-set routing queries, and if no regression is detected, the updated graph is exported to a binary format and distributed to routing shards via a rolling deployment with zero-downtime swap.

Database Design

The road graph is stored as a custom binary compressed adjacency list loaded into routing server memory (~50 GB per country-shard). Postgres stores the canonical map data with PostGIS extensions for spatial operations during the edit/export pipeline. Incident data lives in Cassandra: partition key is geohash (precision 5, ~5 km cells), clustering key is incident_timestamp DESC, allowing range scans by area and recency. User profiles and gamification points use MySQL with read replicas. Historical probe data is archived to S3 Parquet for traffic model training (TensorFlow-based LSTM for predicting future segment speeds).

API Design

POST /v1/navigation/route — Accepts origin, destination, departure time, and preferences; returns route polyline, maneuver list, ETA, and hazards on route
POST /v1/events/report — Driver submits incident type, coordinates, and direction; returns confirmation and points earned
GET /v1/events/nearby?lat={}&lng={}&radius={} — Returns all active validated incidents within radius; polled or streamed via SSE
POST /v1/navigation/ping — Client sends current position + speed every 2 seconds during active navigation; server returns any new hazards or re-route suggestion

Scaling & Bottlenecks

The routing shards are memory-bound: each holds a full country graph in RAM. Horizontal scaling means adding more shards (splitting large countries like the US into regions). Cross-shard routing (e.g., driving from the US to Canada) requires a hierarchical approach: a meta-shard holds a coarse international graph, and border-crossing routes are assembled from two shard computations.

The probe data ingestion pipeline (10 million pings/second) is Kafka's throughput sweet spot with a suitably large partition count. The bottleneck shifts to the Flink map-matching jobs which are CPU-intensive. Waze mitigates this by only fully map-matching 20% of pings (sampled) and using dead reckoning for the rest — GPS drift is corrected when the next map-matched fix arrives.

Key Trade-offs

Community editing vs. data quality — crowdsourced edits improve coverage rapidly but introduce errors; tiered trust levels (new editors require approval, veterans commit directly) balance speed and quality
Frequent re-routing vs. user annoyance — re-routing too aggressively frustrates drivers; Waze only re-routes if the new route saves ≥2 minutes and the divergence point is ≥500m ahead
Incident broadcast radius — too large wastes bandwidth and causes irrelevant alerts; too small misses drivers on approach; Waze uses segment-travel-time to calculate a 90-second lookahead zone
Offline maps vs. real-time accuracy — Waze sacrifices full offline support to maintain real-time crowdsourced accuracy; a lightweight cached route is available for 30 minutes of connectivity loss