System Design: IoT Device Management System

Requirements

Functional Requirements:

Manage device lifecycle: register, provision, activate, decommission, and replace devices
Remote configuration: push configuration updates to individual devices or device groups
Firmware/software OTA updates with rollback capability
Device health monitoring: connectivity status, uptime, error rates, resource utilization
Certificate lifecycle management: issue, renew, and revoke X.509 device certificates
Audit logging: complete history of all management operations on each device

Non-Functional Requirements:

Support 50 million devices across 10,000 enterprise customers (multi-tenant)
Configuration push to 1 million devices within 5 minutes
Certificate expiry notifications 30 days in advance; auto-renewal for online devices
Device connectivity status update latency under 30 seconds (device goes offline → platform shows offline)
All management operations are multi-tenant isolated

Scale Estimation

50M devices with 10% online at any moment = 5M connected devices maintaining MQTT or WebSocket connections. Configuration updates: a mass configuration push to 1M devices in 5 minutes requires delivering 1M/300 = 3,333 configuration messages/second. Each config message averages 5 KB (JSON config document): 16.7 MB/second of outbound configuration traffic — trivial. OTA updates are the large bandwidth events: 500 MB × 1M devices = 500 TB per campaign, delivered via CDN. Certificate management: 50M × 1 cert/year renewal = 1.37 certs/second renewal rate — minimal. Device health telemetry: 5M online devices × 1 heartbeat/30 seconds = 167k heartbeats/second.

High-Level Architecture

The device management system is a multi-tenant SaaS platform. Each enterprise customer (tenant) has an isolated namespace for their devices. The platform provides: a REST API for operators, a device-facing MQTT/HTTP interface for devices, a background job system for bulk operations, and a monitoring dashboard.

Device connectivity layer: devices connect via MQTT using X.509 certificates for mutual TLS authentication. A distributed MQTT broker cluster routes device messages to the appropriate tenant's topic namespace. A heartbeat monitor service tracks connection/disconnection events from MQTT and updates device connectivity status in Redis (with a 30-second TTL per device — if a heartbeat is missed, the device is considered offline after TTL expiry). Status changes are published to Kafka for consumption by the health monitoring and alerting services.

Configuration management: configuration documents are versioned in PostgreSQL. When an operator updates a device group's configuration, a diff is computed between the current and new configuration, and a configuration job is created. The job dispatcher publishes configuration push messages to the device's MQTT topic. For large group pushes (1M+ devices), the dispatcher uses a fan-out queue (SQS FIFO) to rate-limit delivery and track completion per device. Each device acknowledges the configuration receipt by publishing a config_ack message with the new configuration version.

Core Components

Device Registry

The device registry is the authoritative source of truth for all device metadata. PostgreSQL schema: devices (device_id, tenant_id, serial_number, model_id, group_ids[], firmware_version, config_version, status, registration_status, first_seen_at, last_seen_at), device_models (model_id, manufacturer, hw_revision, supported_protocols[], default_config_json), device_groups (group_id, tenant_id, name, config_template_json, firmware_target_version). The registry is queried by all other services. To avoid making it a bottleneck, hot device metadata (connectivity status, firmware version, config version) is cached in Redis per device with a 5-minute TTL, updated on each status change event.

Certificate Management Service

The service operates a private PKI for each tenant. On device registration, the service issues a device X.509 certificate signed by the tenant's intermediate CA (which in turn is signed by the platform root CA). Certificate issuance uses a 2-year validity period with a 90-day renewal window. Renewal process: 30 days before expiry, the service sends a renewal command to the device's MQTT command topic; the device generates a new key pair, sends a CSR (Certificate Signing Request) to the service's HTTPS endpoint, and receives the new certificate. Revocation: on device decommission, the service adds the certificate serial to the tenant's CRL (Certificate Revocation List) and publishes the updated CRL to S3 (fetched by MQTT brokers on each TLS handshake). Certificate transparency logs are maintained for audit.

Bulk Operation Dispatcher

For operations targeting large device groups (mass config push, fleet-wide OTA), the dispatcher manages throttled, resumable bulk operations. A bulk operation is created with a target query (e.g., all devices in group G running firmware v1.2) and resolves to a list of target device IDs at job creation time. The dispatcher uses a cursor-based batch processor: process 1,000 device IDs per batch, publish messages to device MQTT topics, wait for ACKs (with timeout), mark completed devices, and advance the cursor. Job state (cursor position, per-device status: pending/delivered/acknowledged/failed) is stored in Redis (for fast updates) and flushed to PostgreSQL hourly. Operators see real-time progress (e.g., "750,000 / 1,000,000 devices updated") on the dashboard.

Database Design

PostgreSQL (per-tenant schema for isolation): devices, device_models, device_groups, configurations (config_id, device_id or group_id, config_json, version, created_by, created_at), certificates (cert_id, device_id, serial, issued_at, expires_at, revoked_at), bulk_jobs (job_id, type[ota/config/command], target_query_json, total_count, completed_count, failed_count, status, created_at), audit_log (log_id, tenant_id, device_id, action, actor_id, details_json, occurred_at) — append-only. Redis Cluster: device:{device_id}:status (online/offline, TTL 30s), device:{device_id}:meta (firmware_version, config_version, TTL 5m), job:{job_id}:progress (hash of device_id → status for active bulk jobs). S3: firmware binaries (tenant-scoped), CRL files (public, per tenant CA), audit log exports.

API Design

POST /devices/register — body: {serial_number, model_id, group_ids[]}, validates CSR, issues certificate, creates device record, returns {device_id, certificate_pem, mqtt_endpoint}
GET /devices?group_id={g}&status={online/offline}&firmware_version={v} — filtered device list with pagination; served from PostgreSQL with Redis status overlay
POST /devices/{device_id}/config — body: {config_json}, creates versioned config, queues delivery to device, returns config_id
POST /jobs/ota — body: {target_group_id, firmware_version, rollout_pct, schedule}, creates OTA bulk job with staged rollout
GET /jobs/{job_id}/progress — returns bulk job progress from Redis (real-time) or PostgreSQL (historical)
POST /devices/{device_id}/decommission — revokes certificate, archives device record, sends disconnect command

Scaling & Bottlenecks

Configuration push fan-out to 1M devices in 5 minutes: the dispatcher must publish 3,333 MQTT messages/second to unique device topics. A single MQTT broker handles ~100k publishes/second, so the dispatcher can target multiple broker nodes in parallel. The bottleneck is the job state update in Redis — marking 3,333 devices/second as "delivered" = 3,333 Redis writes/second = trivial. The broker must fan out to 5M connected devices' individual topic subscriptions; use wildcard subscriptions sparingly as they have O(N) matching cost.

Heartbeat processing at 167k events/second: each heartbeat sets a Redis TTL for the device's online status key. Redis can handle 1M+ SET/EXPIRE commands/second, so 167k/second is well within capacity. The challenge is the offline detection: Redis TTL expiry does not publish a notification. Use a sorted set with expiry scores (device_id → expected_next_heartbeat_timestamp) and a scanner job that queries for expired entries every 10 seconds — detecting offline devices within 10 + TTL = 40 seconds.

Key Trade-offs

Schema-per-tenant vs. row-level security: Schema per tenant provides the strongest isolation for regulated industries (healthcare IoT, industrial) but complicates cross-tenant operations (e.g., platform-wide firmware compatibility queries); RLS with a tenant_id column is simpler operationally but requires rigorous policy enforcement.
Certificate auto-renewal vs. manual: Auto-renewal eliminates the operational burden of tracking certificate expiry for 50M devices but requires devices to be online at renewal time; manual renewal with 30-day notice works for fleet devices with scheduled maintenance windows.
Bulk job fan-out rate: Pushing to all 1M devices simultaneously maximizes speed but creates an MQTT broker burst that can overload the cluster; throttled dispatch (3,333/second) completes in 5 minutes while keeping broker load predictable.
MQTT vs. LWM2M: MQTT is a general-purpose messaging protocol; LightweightM2M (LWM2M) is an IoT-specific device management protocol with built-in primitives for firmware updates and device configuration. LWM2M reduces custom protocol development at the cost of maturity and ecosystem breadth compared to MQTT.

System Design: IoT Device Management System

Requirements

Scale Estimation

High-Level Architecture

Core Components

Device Registry

Certificate Management Service

Bulk Operation Dispatcher

Database Design

API Design

Scaling & Bottlenecks

Key Trade-offs

Master this topic in our 12-week cohort

System Design: Smart Home Platform

System Design: Fleet Management System

System Design: IoT Data Ingestion Platform

System Design: Connected Vehicle Platform

System Design: Sensor Data Processing Pipeline

System Design: IoT Analytics Platform