System Design: Driver Location Tracking

Requirements

Functional Requirements:

Ingest real-time GPS location updates from millions of active drivers
Maintain a queryable spatial index of all active driver positions
Support proximity queries: find all drivers within N kilometers of a given point
Stream live location updates to consumers tracking their delivery
Detect geofence entry/exit events (driver arriving at restaurant or delivery address)
Store location history for route replay, dispute resolution, and analytics

Non-Functional Requirements:

Handle 2M concurrent active drivers sending updates every 4 seconds = 500K updates/sec
Proximity query latency under 10ms for 95th percentile
Location update propagation to tracking consumers within 3 seconds end-to-end
99.99% availability for the ingestion and query paths
Location data accuracy: filter out GPS noise while preserving real movement

Scale Estimation

2M active drivers × 1 update every 4 seconds = 500K location ingestion events/sec. Each event payload is ~200 bytes (driver_id, lat, lng, heading, speed, accuracy, timestamp, battery_level) = 100MB/sec raw ingestion throughput. Proximity queries: the dispatch system issues ~50K proximity queries/sec during peak (each order needing nearby drivers). Geofence checks: with 5M active geofences (restaurant locations + delivery addresses), each location update must be checked against relevant geofences = 500K geofence evaluations/sec. Historical storage: 500K events/sec × 200 bytes × 86,400 seconds = 8.6TB/day of location breadcrumbs.

High-Level Architecture

The location tracking system uses a lambda-style architecture with a real-time serving layer and a batch storage layer. The ingestion path receives GPS updates from driver apps, processes them through a streaming pipeline, and updates both the real-time spatial index and the historical store. The architecture separates hot (real-time) and cold (historical) data paths for optimal performance.

The real-time path: Driver App → Location Gateway (UDP/HTTPS) → Kafka topic driver-locations-raw → Flink Stream Processor (noise filtering, geofence detection, geohash computation) → Kafka topic driver-locations-processed → two consumers: (1) Spatial Index Updater writes to Redis geospatial index, and (2) Tracking Fanout Service pushes updates to consumers via WebSocket. The spatial index in Redis is the single source of truth for current driver positions, supporting sub-10ms proximity queries.

The cold path: a Kafka Connect sink writes raw location events to a TimescaleDB cluster (PostgreSQL extension optimized for time-series data) partitioned by day and driver_id. A separate sink writes to S3 in Parquet format for long-term archival and batch analytics. This dual-write pattern ensures both fast queries on recent data and cost-effective long-term storage.

Core Components

Geohash-Based Spatial Index

The spatial index uses Redis with two complementary data structures. Primary: Redis GEO commands (GEOADD/GEOSEARCH) which internally use a sorted set with geohash-encoded scores. The geohash encodes latitude and longitude into a single 52-bit integer using interleaved bit encoding, enabling range queries via sorted set operations. GEOADD drivers {lng} {lat} {driver_id} updates a driver's position; GEOSEARCH drivers FROMLONLAT {lng} {lat} BYRADIUS {km} km COUNT 100 ASC finds the nearest 100 drivers. Secondary: a separate sorted set per geohash prefix at precision level 6 (~1.2 km × 0.6 km cells) enables cell-based lookups for the dispatch system. When a driver's geohash-6 prefix changes, they are removed from the old cell's set and added to the new one. This dual structure supports both radius queries (primary) and grid-based partitioned queries (secondary) with different performance characteristics.

GPS Noise Filter & Map Matching

Raw GPS data from mobile devices contains significant noise — urban canyons cause multipath errors, tunnels create gaps, and building interiors produce wild jumps. The Flink stream processor applies a Kalman filter to each driver's location stream, maintaining per-driver state (estimated position, velocity, and uncertainty covariance matrix). The filter rejects updates that imply physically impossible movement (acceleration >15 m/s², speed >200 km/h for cars, >60 km/h for bikes). After filtering, a lightweight map-matching algorithm snaps the position to the nearest road segment using a local R-tree index of the road network. The map-matched position is used for ETA calculation; the raw (filtered) position is used for the consumer-facing tracking dot (since map-matched positions can look unnatural on the map).

Geofence Detection Engine

Geofences are circular regions defined around restaurant pickup zones (50m radius) and delivery drop-off zones (100m radius). The engine maintains an in-memory R-tree index of all active geofences (roughly 5M at any time). When a processed location update arrives, the engine queries the R-tree for geofences within the driver's vicinity (200m radius query). For each nearby geofence, it checks if the driver has crossed the boundary since the last update using a simple point-in-circle test. Entry/exit events are emitted to a Kafka topic consumed by the Order Service to trigger automatic status transitions (e.g., driver entered restaurant zone → update order status to DRIVER_AT_RESTAURANT). The R-tree is partitioned by city and loaded into each Flink task manager for the corresponding geographic partition.

Database Design

The real-time spatial index in Redis uses a single GEO key active-drivers containing all 2M active driver positions. Redis GEO internally stores each member in a sorted set scored by its 52-bit geohash, making radius queries O(log(N) + M) where N is total drivers and M is the result set. Driver metadata (status, vehicle type, current orders) is stored in a separate Redis hash driver:{id} co-located on the same Redis cluster for locality.

Historical locations are stored in TimescaleDB with a hypertable driver_locations partitioned by time (1-day chunks) and space (driver_id range). Columns: timestamp (TIMESTAMPTZ), driver_id (BIGINT), location (PostGIS GEOMETRY POINT), heading (SMALLINT), speed (REAL), accuracy (REAL), battery (SMALLINT), raw_lat (DOUBLE), raw_lng (DOUBLE). A compression policy compresses chunks older than 7 days using TimescaleDB's columnar compression, achieving ~10x compression ratio. A retention policy drops chunks older than 90 days; pre-90-day data lives in S3 Parquet files for regulatory compliance (retained for 3 years).

API Design

POST /api/v1/locations/batch — Driver submits buffered location updates; body is an array of {lat, lng, heading, speed, accuracy, timestamp} objects; returns acknowledgment with server_timestamp for clock sync
GET /api/v1/locations/nearby?lat={lat}&lng={lng}&radius_km={r}&limit={n}&vehicle_type={type} — Find nearby available drivers; returns array of {driver_id, lat, lng, distance_km, heading, speed}
GET /api/v1/locations/driver/{driver_id}/history?start={ts}&end={ts} — Retrieve location history for a specific driver in a time range; returns GeoJSON LineString
POST /api/v1/geofences — Create a geofence; body contains center_lat, center_lng, radius_m, type (pickup/dropoff), delivery_id, expiry_timestamp

Scaling & Bottlenecks

The Redis spatial index is the critical bottleneck. A single Redis instance can handle approximately 100K GEOADD operations/sec and 50K GEOSEARCH operations/sec. With 500K location updates/sec and 50K proximity queries/sec, the system requires a Redis Cluster with at least 10 shards. However, Redis GEO commands operate on a single key and cannot be sharded across nodes — the entire active-drivers key must reside on one shard. The solution is geographic partitioning: each city gets its own Redis key drivers:{city_id}, and the application layer routes queries to the correct key based on the geographic context. Large cities (NYC, London) are further subdivided into zones.

The Flink stream processor maintains per-driver Kalman filter state (position, velocity, covariance matrix = ~200 bytes per driver). With 2M drivers, this is 400MB of state per Flink task manager. Flink's RocksDB state backend handles this efficiently, but checkpoint serialization during state snapshots can cause processing pauses. This is mitigated by using incremental checkpoints (only changed state is serialized) and checkpoint intervals of 30 seconds.

Key Trade-offs

Geohash over QuadTree for spatial indexing: Geohash encodes 2D coordinates into a 1D sortable integer compatible with Redis sorted sets, enabling O(log n) lookups — QuadTree provides better spatial partitioning but requires a custom data structure not available in Redis
Kalman filter over simple windowed average for noise reduction: Kalman filter incorporates velocity and uncertainty estimates for much better tracking accuracy, but requires per-driver state maintenance (400MB across fleet) — the accuracy improvement is worth the memory cost
UDP for location ingestion over TCP: UDP eliminates connection overhead and head-of-line blocking, reducing battery drain and latency — packet loss is acceptable since the next update arrives in 4 seconds; a small buffer on the client resends critical updates
TimescaleDB over pure Cassandra for historical data: TimescaleDB provides SQL query compatibility and excellent compression for time-series location data, but its write throughput (200K inserts/sec per node) is lower than Cassandra — mitigated by buffering writes in Kafka and using a 5-node cluster