System Design: Connected Vehicle Platform

Requirements

Functional Requirements:

Ingest real-time telemetry from vehicle ECUs: GPS location, speed, fuel/battery level, engine diagnostics (OBD-II), and sensor data
Remote commands: lock/unlock doors, pre-condition cabin temperature, emergency stop (fleet context)
Over-the-air (OTA) firmware and software updates for vehicle ECUs and infotainment systems
Fleet management: real-time vehicle tracking, trip history, driver behavior scoring
Predictive maintenance: alert operators when vehicle components are predicted to fail based on telemetry trends
Regulatory compliance: data retention, privacy controls (GDPR), and cybersecurity standards (ISO/SAE 21434)

Non-Functional Requirements:

Support 10 million connected vehicles ingesting telemetry at 1-10 Hz
Command delivery latency under 2 seconds to vehicle over LTE/5G
OTA update delivery to 1 million vehicles within 4 hours (staged rollout)
Vehicle telemetry retained for 24 months for compliance
Security: end-to-end encryption, certificate-based device identity, zero-trust architecture

Scale Estimation

10M vehicles × 5 messages/second average = 50M messages/second. Each message averages 300 bytes (GPS + 10 OBD signals): 15 GB/second of ingestion. Over 24 months of retention at 15 GB/second: 15 × 86,400 × 730 = ~946 PB — only feasible with aggressive compression and downsampling. In practice, compress 5-Hz telemetry to 1-Hz after 30 days and to 1-per-minute after 6 months, reducing storage to ~50 PB at 24 months. MQTT connection state: 10M vehicles × 2 KB = 20 GB of MQTT session state. OTA update: 500 MB average firmware × 1M vehicles = 500 TB of download traffic per OTA campaign.

High-Level Architecture

The vehicle platform uses a cellular (LTE/5G) + MQTT architecture for telemetry and command channels. Each vehicle has a Telematics Control Unit (TCU) that maintains a persistent MQTT connection to the cloud. Telemetry is published by the vehicle at regular intervals; commands are subscribed to by the vehicle on a dedicated command topic.

The ingestion tier: a massively scaled MQTT broker cluster (EMQ X or VerneMQ, 1,000+ nodes) receives vehicle connections. The broker cluster uses MQTT v5 with QoS 1 for reliable telemetry and QoS 2 for commands (exactly-once delivery). Inbound telemetry is published to Apache Kafka (one topic per telemetry type: GPS, diagnostics, events). Downstream consumers: a GPS track writer (TimescaleDB), a real-time analytics engine (Flink), and an S3 archive writer (Parquet).

The command path is a separate critical-path service. Vehicle commands (lock, pre-condition, OTA trigger) are received from the fleet management API, written to a PostgreSQL command log (for audit), and published to the vehicle's MQTT command topic. Command acknowledgment: the vehicle publishes a command_ack event within 30 seconds (timeout triggers a retry with exponential backoff, up to 3 retries). If the vehicle is offline, the command is queued in Redis with a 24-hour TTL and delivered on next connection.

Core Components

TCU Telemetry Publisher

The TCU is the vehicle's cloud connectivity gateway. It runs a lightweight MQTT client (Paho C library) on a dedicated embedded Linux module. The TCU aggregates signals from vehicle CANbus (J1939/OBD-II) at 100Hz, downsamples to 5 Hz, serializes with Protobuf (compact binary format, 5× smaller than JSON), and publishes to vehicles/{vin}/telemetry/gps, vehicles/{vin}/telemetry/diagnostics, and vehicles/{vin}/events. The TCU implements a local queue (128 MB NVMe storage) for messages when cellular connectivity is lost — draining the queue on reconnection in order. Certificate-based authentication: the TCU holds a device certificate in a Hardware Security Module (HSM) embedded in the module — private keys are never exposed to software.

OTA Update Service

OTA updates use a delta update strategy: only the diff between the current and new firmware version is transmitted (binary diffing via BSDiff), reducing download size by 60-80% for incremental updates. The update service: (1) operator uploads new firmware to S3 and creates an update campaign targeting a device group; (2) the service signs the firmware package with the platform's code-signing key; (3) a staged rollout distributes the update to 1% of vehicles first, monitors for error rates, and automatically progresses to 10%, 50%, 100% after 4-hour observation windows; (4) the vehicle TCU receives the update trigger via MQTT, downloads from S3 (via pre-signed URL), verifies the signature, and applies the update in a background thread; (5) the TCU reboots into the new firmware and reports the update result.

Predictive Maintenance Engine

The engine consumes telemetry from Kafka and maintains a per-vehicle feature store (Redis: rolling statistics like battery cycle count, average engine temperature, brake pressure variance). An ML model (gradient boosted trees, retrained weekly on labeled failure data) scores the probability of component failure in the next 30 days. Features include: mileage, engine temperature statistics, vibration anomaly scores, oil pressure trends, and battery capacity degradation rate. Predictions above a threshold trigger a maintenance alert: a push notification to the fleet operator's mobile app and an entry in the maintenance scheduling system. Prediction accuracy is monitored by comparing alerts against actual repair records in the maintenance log.

Database Design

PostgreSQL: vehicles (vin, make, model, year, fleet_id, tcm_cert_serial, software_version, status), fleets (fleet_id, org_id, name, vehicle_count), commands (command_id, vin, type, payload_json, status, issued_at, acked_at, issued_by), ota_campaigns (campaign_id, fleet_id, firmware_version, rollout_pct, status, created_at). TimescaleDB: vehicle_gps (vin, lat, lon, speed_kmh, heading, accuracy_m, recorded_at) — hypertable, 90-day hot retention, then S3 export. vehicle_diagnostics (vin, battery_pct, fuel_pct, engine_temp_c, fault_codes[], recorded_at) — compressed after 30 days. Redis: vehicle:{vin}:shadow (last known state), vehicle:{vin}:command_queue (pending offline commands), maintenance:{vin}:score (latest ML prediction score). S3: long-term telemetry archive (Parquet), firmware packages (signed, versioned), trip summaries (pre-aggregated for reporting).

API Design

GET /vehicles/{vin}/location — returns current GPS location from Redis shadow; falls back to TimescaleDB for last-known if vehicle is offline
POST /vehicles/{vin}/commands — body: {type, parameters} (e.g., {type: "lock_doors"}), writes to command log, publishes to vehicle MQTT topic, returns command_id
GET /commands/{command_id}/status — returns command delivery status (pending/delivered/acknowledged/timed_out)
POST /ota/campaigns — body: {fleet_id, firmware_version, rollout_schedule}, creates staged OTA campaign
GET /vehicles/{vin}/trips?from={ts}&to={ts} — returns trip history (start/end location, distance, duration, avg speed) from TimescaleDB
GET /vehicles/{vin}/maintenance/prediction — returns component health scores and failure probability forecasts

Scaling & Bottlenecks

MQTT broker fleet: 10M vehicles × 2 KB session state = 20 GB in-memory session state distributed across 1,000 broker nodes. Broker node failure requires session state recovery from the distributed store (Redis) — recovery time under 10 seconds per node. MQTT connection rebalancing on node failure: use consistent hashing to limit reconnection storms to the fraction of vehicles assigned to the failed node (1/1,000 = 0.01% = 10k vehicles reconnecting simultaneously — manageable).

Kafka ingestion at 50M messages/second requires ~250 Kafka brokers (each handling 200k messages/second) with 1,500 partitions across 3 AZs. This is at the upper end of Kafka's practical scale — consider Apache Pulsar (which separates compute and storage, enabling more horizontal scaling) for this message rate. The TimescaleDB GPS write path: 5 Hz × 10M vehicles = 50M writes/second — batch writes are critical (1-second batching reduces effective write throughput to 50M rows/second in bulk inserts, achievable on a 100-node TimescaleDB cluster).

Key Trade-offs

MQTT vs. HTTP for vehicle telemetry: MQTT's persistent connections avoid TCP handshake overhead for high-frequency telemetry but require stable cellular connections; HTTP/2 with server-sent events is a viable alternative for fleets with intermittent connectivity.
5 Hz telemetry vs. on-event reporting: Constant 5 Hz telemetry provides rich behavioral data for ML models but generates 50M messages/second; event-driven reporting (publish only when a signal changes beyond a threshold) reduces volume by 90% for stable driving conditions but loses time-series fidelity for analytics.
Edge processing on TCU vs. cloud-only: Running anomaly detection on the TCU reduces cloud telemetry volume and enables immediate local alerts (engine overheat warning before cloud ack) but limits model complexity; a hybrid approach (simple threshold detection on TCU, complex ML in cloud) balances responsiveness and accuracy.
OTA staged rollout vs. fleet-wide push: Staged rollout (1% → 10% → 100%) catches critical bugs before fleet-wide impact but means some vehicles run old software for hours — acceptable for feature updates, potentially unacceptable for critical security patches (fast-track security patches to 100% within 1 hour).