SYSTEM_DESIGN
System Design: Smart Home Platform
Design a scalable smart home platform that connects, controls, and automates diverse home devices — lights, thermostats, locks, cameras — through a unified API with local and cloud control, automation rules, and voice assistant integration.
Requirements
Functional Requirements:
- Register, authenticate, and control smart home devices: lights, thermostats, door locks, security cameras, and sensors
- Local control: commands execute via local hub even without internet connectivity
- Automation rules: trigger device actions based on time schedules, sensor readings, or device state changes
- Voice assistant integration: expose device controls via Google Home and Amazon Alexa
- Mobile app: real-time device state display with sub-1-second command response
- Camera live streaming and motion-triggered recording
Non-Functional Requirements:
- Local command latency under 50ms (hub-to-device, without cloud round-trip)
- Cloud command latency under 500ms for 99th percentile
- Support 10 million homes, each with up to 200 devices = 2 billion device endpoints
- 99.99% availability for local control; cloud is best-effort
- Privacy: camera footage stored locally by default; cloud backup is opt-in
Scale Estimation
10M homes × 200 devices = 2B device endpoints. At any moment, ~5% of devices are actively reporting state (sensors, thermostats) = 100M active device connections. Device state updates: 100M devices × 1 update/minute average = 1.67M messages/second. Camera motion events: 10M homes × 0.1 motion events/minute = 16,700 clip uploads/minute. Automation rules evaluation: each state change event may trigger evaluation of all rules for that home — average 10 rules per home. At 1.67M state changes/second and 10 rules each = 16.7M rule evaluations/second — must be done in-process or at the edge hub, not in the cloud.
High-Level Architecture
The platform uses a hub-and-cloud architecture. Each home has a local hub (a dedicated device or the homeowner's router/NAS running the hub software) that connects to all home devices via local protocols (Zigbee, Z-Wave, Matter, Wi-Fi). The hub runs a local MQTT broker and an automation rule engine. Device state changes and commands are processed locally first — cloud is informed but not in the critical path for local operations.
The cloud platform provides: remote access (when the user is away from home), cloud-based automation rules (requiring cloud data like weather APIs or calendar), voice assistant integration, firmware updates, and multi-home management. The hub maintains a persistent WebSocket connection to the cloud (using a keep-alive proxy that survives NAT and firewall changes). All hub-to-cloud communication uses this WebSocket tunnel — the hub pushes device state changes to the cloud and receives remote commands from the cloud over the same connection.
Device state is synchronized between the hub's local state store (SQLite) and the cloud's device shadow service (DynamoDB + Redis). The device shadow pattern: the cloud maintains a "reported" state (last known state from the hub) and a "desired" state (what the user wants). When a command is sent from the mobile app, it updates the desired state; the cloud pushes the desired state delta to the hub, which applies it and reports back. This decouples command submission from command execution — the app doesn't wait for device confirmation.
Core Components
Local Hub
The hub runs: a Zigbee/Z-Wave/Matter gateway (protocol-specific radio bridges), a local MQTT broker (Mosquitto), a device driver registry (maps device models to protocol handlers), an automation rule engine (evaluates rules in real time against local state changes), and a WebSocket client (cloud sync). The hub's local SQLite database stores device registry, automation rules, and state history. The hub is the single point of authority for local device state — the cloud is a mirror, not the source of truth for real-time control. Automation rules with latency-sensitive triggers ("if motion detected, turn on light within 50ms") run entirely on the hub without any cloud round-trip.
Device Shadow Service
The cloud device shadow service maintains a digital twin for each device. Each shadow document: {device_id, reported: {brightness: 50, color_temp: 3000}, desired: {brightness: 80}, delta: {brightness: 80}, last_updated: timestamp}. The delta field (desired - reported) is what the hub needs to apply to bring the device to the desired state. When the mobile app sends a command (SET brightness=80), the service updates the desired state, computes the delta, and pushes the delta to the hub over the WebSocket tunnel. The hub applies the command and publishes the new reported state. The shadow service uses DynamoDB for persistent storage (single-digit millisecond reads) and Redis for the active delta cache (used for high-frequency polling from the mobile app).
Voice Assistant Integration
Google Home and Amazon Alexa integrate via the Smart Home API (OAuth 2.0 + device trait fulfillment). The platform exposes a fulfillment webhook endpoint that receives trait queries (QUERY intent — "what's the brightness of the living room light?") and commands (EXECUTE intent — "set living room light to 50%"). The webhook handler maps the assistant's device/trait model to the platform's device shadow service — reading reported state for queries and updating desired state for commands. OAuth token management: users authenticate via the platform's OAuth server, generating tokens scoped to their home devices. Token refresh is handled by the assistant platform (Google/Amazon) automatically.
Database Design
PostgreSQL: homes (home_id, owner_id, hub_serial, subscription_tier, location_lat, location_lon), devices (device_id, home_id, model, protocol, room, display_name, capabilities_json, last_seen_at), users (user_id, email, homes[], notification_prefs), automation_rules (rule_id, home_id, trigger_json, condition_json, action_json, enabled), firmware_releases (firmware_id, model, version, s3_url, release_notes, rollout_pct). DynamoDB: device_shadows (device_id → shadow document). Redis Cluster: shadow:delta:{device_id} (current delta, TTL 1h), hub:{hub_id}:connection (WebSocket server node assignment, TTL 65 seconds). TimescaleDB: device_state_history (device_id, state_json, recorded_at) — last 30 days of state history for analytics and trend display. S3: camera recordings (motion-triggered clips, organized by home_id/device_id/date, AES-256 encrypted).
API Design
GET /homes/{home_id}/devices— returns all devices with current reported state from device shadow; cached in RedisPOST /devices/{device_id}/commands— body:{trait, value}(e.g.,{brightness: 80}), updates desired state in shadow, pushes delta to hub; returns{command_id, status: "pending"}GET /devices/{device_id}/history?from={ts}&to={ts}— returns state history from TimescaleDB; useful for energy consumption charts- WebSocket
/hubs/{hub_id}/tunnel— authenticated hub connection; server pushes desired state deltas, hub pushes reported state updates and telemetry POST /automations— body:{trigger, conditions, actions}, creates automation rule stored in PostgreSQL and synced to hub on next connection
Scaling & Bottlenecks
The WebSocket hub connections (10M hubs maintaining persistent connections) require a horizontally scaled WebSocket gateway (10M connections × 1 KB connection state = 10 GB of connection state). A 100-node WebSocket gateway cluster (each handling 100k connections) works well. Hub connection affinity (each hub routes to the same gateway node via consistent hashing on hub_id) enables in-process routing without inter-node message passing for the common case. Redis stores the hub-to-node assignment as a fallback for cross-node messages (e.g., when a command comes from the mobile app via REST while the hub is connected to a different gateway node).
Device shadow DynamoDB throughput: 2B devices × 1 read/5 minutes average = 6.7M reads/second. This requires DynamoDB with aggressive caching. Only active devices (those with pending deltas or recently updated) are in the Redis shadow delta cache — roughly 1% of 2B devices = 20M active device shadows in Redis at any time. Redis memory: 20M × 500 bytes = 10 GB — manageable.
Key Trade-offs
- Local-first vs. cloud-first architecture: Local-first execution provides sub-50ms control latency and works offline, but adds hub hardware cost and complexity; cloud-first simplifies the architecture but makes the home uncontrollable during internet outages — unacceptable for safety-critical uses (door locks, smoke alarms).
- Matter vs. proprietary protocols: Matter is an open standard enabling cross-vendor device compatibility, but current implementations have higher latency than proprietary protocols (Zigbee, Z-Wave) for time-sensitive automation; offering Matter alongside proprietary support is the pragmatic transition strategy.
- Camera storage cloud vs. local: Local-only storage protects privacy but makes footage inaccessible when the hub is stolen (the most common scenario where footage is needed); opt-in cloud backup with end-to-end encryption addresses both concerns.
- Hub per home vs. hub-less architecture: A dedicated hub enables local control and protocol bridging for non-Wi-Fi devices; a hub-less architecture (all devices connect directly to Wi-Fi/cloud) simplifies setup but requires internet for all control and limits protocol diversity.
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.