SYSTEM_DESIGN

System Design: IoT Device Management System

Design a scalable IoT device management system for enterprise fleets that handles device lifecycle management, remote configuration, firmware updates, health monitoring, and certificate management for millions of devices.

15 min readUpdated Jan 15, 2025
system-designiotdevice-managementcertificate-managementfleet-management

Requirements

Functional Requirements:

  • Manage device lifecycle: register, provision, activate, decommission, and replace devices
  • Remote configuration: push configuration updates to individual devices or device groups
  • Firmware/software OTA updates with rollback capability
  • Device health monitoring: connectivity status, uptime, error rates, resource utilization
  • Certificate lifecycle management: issue, renew, and revoke X.509 device certificates
  • Audit logging: complete history of all management operations on each device

Non-Functional Requirements:

  • Support 50 million devices across 10,000 enterprise customers (multi-tenant)
  • Configuration push to 1 million devices within 5 minutes
  • Certificate expiry notifications 30 days in advance; auto-renewal for online devices
  • Device connectivity status update latency under 30 seconds (device goes offline → platform shows offline)
  • All management operations are multi-tenant isolated

Scale Estimation

50M devices with 10% online at any moment = 5M connected devices maintaining MQTT or WebSocket connections. Configuration updates: a mass configuration push to 1M devices in 5 minutes requires delivering 1M/300 = 3,333 configuration messages/second. Each config message averages 5 KB (JSON config document): 16.7 MB/second of outbound configuration traffic — trivial. OTA updates are the large bandwidth events: 500 MB × 1M devices = 500 TB per campaign, delivered via CDN. Certificate management: 50M × 1 cert/year renewal = 1.37 certs/second renewal rate — minimal. Device health telemetry: 5M online devices × 1 heartbeat/30 seconds = 167k heartbeats/second.

High-Level Architecture

The device management system is a multi-tenant SaaS platform. Each enterprise customer (tenant) has an isolated namespace for their devices. The platform provides: a REST API for operators, a device-facing MQTT/HTTP interface for devices, a background job system for bulk operations, and a monitoring dashboard.

Device connectivity layer: devices connect via MQTT using X.509 certificates for mutual TLS authentication. A distributed MQTT broker cluster routes device messages to the appropriate tenant's topic namespace. A heartbeat monitor service tracks connection/disconnection events from MQTT and updates device connectivity status in Redis (with a 30-second TTL per device — if a heartbeat is missed, the device is considered offline after TTL expiry). Status changes are published to Kafka for consumption by the health monitoring and alerting services.

Configuration management: configuration documents are versioned in PostgreSQL. When an operator updates a device group's configuration, a diff is computed between the current and new configuration, and a configuration job is created. The job dispatcher publishes configuration push messages to the device's MQTT topic. For large group pushes (1M+ devices), the dispatcher uses a fan-out queue (SQS FIFO) to rate-limit delivery and track completion per device. Each device acknowledges the configuration receipt by publishing a config_ack message with the new configuration version.

Core Components

Device Registry

The device registry is the authoritative source of truth for all device metadata. PostgreSQL schema: devices (device_id, tenant_id, serial_number, model_id, group_ids[], firmware_version, config_version, status, registration_status, first_seen_at, last_seen_at), device_models (model_id, manufacturer, hw_revision, supported_protocols[], default_config_json), device_groups (group_id, tenant_id, name, config_template_json, firmware_target_version). The registry is queried by all other services. To avoid making it a bottleneck, hot device metadata (connectivity status, firmware version, config version) is cached in Redis per device with a 5-minute TTL, updated on each status change event.

Certificate Management Service

The service operates a private PKI for each tenant. On device registration, the service issues a device X.509 certificate signed by the tenant's intermediate CA (which in turn is signed by the platform root CA). Certificate issuance uses a 2-year validity period with a 90-day renewal window. Renewal process: 30 days before expiry, the service sends a renewal command to the device's MQTT command topic; the device generates a new key pair, sends a CSR (Certificate Signing Request) to the service's HTTPS endpoint, and receives the new certificate. Revocation: on device decommission, the service adds the certificate serial to the tenant's CRL (Certificate Revocation List) and publishes the updated CRL to S3 (fetched by MQTT brokers on each TLS handshake). Certificate transparency logs are maintained for audit.

Bulk Operation Dispatcher

For operations targeting large device groups (mass config push, fleet-wide OTA), the dispatcher manages throttled, resumable bulk operations. A bulk operation is created with a target query (e.g., all devices in group G running firmware v1.2) and resolves to a list of target device IDs at job creation time. The dispatcher uses a cursor-based batch processor: process 1,000 device IDs per batch, publish messages to device MQTT topics, wait for ACKs (with timeout), mark completed devices, and advance the cursor. Job state (cursor position, per-device status: pending/delivered/acknowledged/failed) is stored in Redis (for fast updates) and flushed to PostgreSQL hourly. Operators see real-time progress (e.g., "750,000 / 1,000,000 devices updated") on the dashboard.

Database Design

PostgreSQL (per-tenant schema for isolation): devices, device_models, device_groups, configurations (config_id, device_id or group_id, config_json, version, created_by, created_at), certificates (cert_id, device_id, serial, issued_at, expires_at, revoked_at), bulk_jobs (job_id, type[ota/config/command], target_query_json, total_count, completed_count, failed_count, status, created_at), audit_log (log_id, tenant_id, device_id, action, actor_id, details_json, occurred_at) — append-only. Redis Cluster: device:{device_id}:status (online/offline, TTL 30s), device:{device_id}:meta (firmware_version, config_version, TTL 5m), job:{job_id}:progress (hash of device_id → status for active bulk jobs). S3: firmware binaries (tenant-scoped), CRL files (public, per tenant CA), audit log exports.

API Design

  • POST /devices/register — body: {serial_number, model_id, group_ids[]}, validates CSR, issues certificate, creates device record, returns {device_id, certificate_pem, mqtt_endpoint}
  • GET /devices?group_id={g}&status={online/offline}&firmware_version={v} — filtered device list with pagination; served from PostgreSQL with Redis status overlay
  • POST /devices/{device_id}/config — body: {config_json}, creates versioned config, queues delivery to device, returns config_id
  • POST /jobs/ota — body: {target_group_id, firmware_version, rollout_pct, schedule}, creates OTA bulk job with staged rollout
  • GET /jobs/{job_id}/progress — returns bulk job progress from Redis (real-time) or PostgreSQL (historical)
  • POST /devices/{device_id}/decommission — revokes certificate, archives device record, sends disconnect command

Scaling & Bottlenecks

Configuration push fan-out to 1M devices in 5 minutes: the dispatcher must publish 3,333 MQTT messages/second to unique device topics. A single MQTT broker handles ~100k publishes/second, so the dispatcher can target multiple broker nodes in parallel. The bottleneck is the job state update in Redis — marking 3,333 devices/second as "delivered" = 3,333 Redis writes/second = trivial. The broker must fan out to 5M connected devices' individual topic subscriptions; use wildcard subscriptions sparingly as they have O(N) matching cost.

Heartbeat processing at 167k events/second: each heartbeat sets a Redis TTL for the device's online status key. Redis can handle 1M+ SET/EXPIRE commands/second, so 167k/second is well within capacity. The challenge is the offline detection: Redis TTL expiry does not publish a notification. Use a sorted set with expiry scores (device_id → expected_next_heartbeat_timestamp) and a scanner job that queries for expired entries every 10 seconds — detecting offline devices within 10 + TTL = 40 seconds.

Key Trade-offs

  • Schema-per-tenant vs. row-level security: Schema per tenant provides the strongest isolation for regulated industries (healthcare IoT, industrial) but complicates cross-tenant operations (e.g., platform-wide firmware compatibility queries); RLS with a tenant_id column is simpler operationally but requires rigorous policy enforcement.
  • Certificate auto-renewal vs. manual: Auto-renewal eliminates the operational burden of tracking certificate expiry for 50M devices but requires devices to be online at renewal time; manual renewal with 30-day notice works for fleet devices with scheduled maintenance windows.
  • Bulk job fan-out rate: Pushing to all 1M devices simultaneously maximizes speed but creates an MQTT broker burst that can overload the cluster; throttled dispatch (3,333/second) completes in 5 minutes while keeping broker load predictable.
  • MQTT vs. LWM2M: MQTT is a general-purpose messaging protocol; LightweightM2M (LWM2M) is an IoT-specific device management protocol with built-in primitives for firmware updates and device configuration. LWM2M reduces custom protocol development at the cost of maturity and ecosystem breadth compared to MQTT.

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.