System Design: Chaos Engineering Platform

Requirements

Functional Requirements:

Inject various failure types: pod/VM termination, network latency, packet loss, CPU/memory pressure, disk failures, clock skew
Target specific blast radius: single pod, percentage of pods, entire service, or specific availability zone
Provide experiment scheduling: run chaos on a schedule or manually trigger
Implement automatic abort: if a defined SLO is breached (error rate > 1%, latency > 500ms), stop the experiment immediately
Maintain a hypothesis-driven workflow: define expected system behavior before running the experiment
Provide detailed audit trail of all experiments (what failed, when, what impact, who approved)

Non-Functional Requirements:

Chaos injection latency under 5 seconds from experiment trigger to fault active
Automatic rollback (fault removal) within 30 seconds of abort trigger
Role-based access control: only authorized engineers can trigger production chaos
Zero false-positive experiment executions: safeguards prevent chaos during incidents or peak traffic

Scale Estimation

A large deployment: 10,000 pods across 50 services, 5,000 nodes. The chaos platform targets up to 10% of pods per experiment = 1,000 pods simultaneously affected. Steady-state safety checks: poll system health metrics every 5 seconds = 12 checks/minute. With 50 services × 5 SLO metrics each = 250 metrics polled per safety check cycle. Experiment execution rate: 10 experiments/day in production, 100/day in staging. Audit log: 100 experiments/day × 100 events/experiment = 10,000 log entries/day — trivially small.

High-Level Architecture

The chaos platform has three layers: the experiment controller (defines and orchestrates experiments), the fault injector (applies faults to targets), and the safety guardian (monitors for SLO breaches and triggers aborts).

The fault injector runs as a privileged DaemonSet on each Kubernetes node (for pod-level faults) or as an agent on VMs. It receives fault injection commands from the experiment controller via a gRPC API and applies them using OS-level mechanisms: tc (Linux Traffic Control) for network faults, stress-ng for CPU/memory pressure, kill signals for process termination, and iptables for network partition simulation. Network faults use Linux Traffic Control (tc netem) — a kernel subsystem that adds configurable delay, packet loss, and corruption at the network interface level.

The safety guardian runs as an independent process that monitors Prometheus metrics and SLO dashboards. It subscribes to a list of guardrails: if error_rate > threshold OR latency_p99 > threshold, it sends an ABORT signal to the experiment controller. The guardian is architecturally independent from the experiment controller — even if the controller hangs, the guardian can independently command fault injectors to roll back all active faults.

Core Components

Experiment Controller

The controller manages the lifecycle of chaos experiments. Experiment definition: {name, hypothesis, faults: [{type: network-latency, target: service=payments, latency: 200ms, percentage: 50}], duration: 300s, abort_conditions: [{metric: error_rate, threshold: 0.01}], approvers: ["team-lead"]}. Execution: (1) check pre-conditions (no active incidents, traffic is normal, required approvals obtained); (2) take a system state snapshot (current error rates, latency baselines); (3) inject faults via fault injectors; (4) monitor metrics vs. baseline; (5) on experiment completion or abort, remove faults and record results.

Fault Injector Agent

The agent runs as a privileged process on each node. For pod termination: sends SIGKILL to the target container's main process (simulating OOM killer or hardware failure). For network latency: tc qdisc add dev eth0 root netem delay 200ms 50ms distribution normal — adds 200ms ±50ms normally-distributed delay to outgoing packets from the target pod's network namespace. For CPU pressure: runs stress-ng --cpu N in the target container's cgroup. For node-level faults (disk failure, kernel panic): requires VM-level access (cloud provider API calls: AWS: stop instance, GCP: stop VM). All faults are reversible: the agent stores a undo command for each applied fault and executes it on rollback.

Safety Guardian

The guardian polls Prometheus every 5 seconds for SLO indicators: error rate (rate(http_requests_total{status=~"5.."}[1m]) / rate(http_requests_total[1m])), p99 latency (histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))), and service availability (up metric). If any metric breaches its abort threshold, the guardian publishes an ABORT event to a dedicated abort channel (Redis pub/sub). The experiment controller and all fault injectors subscribe to this channel — on receiving ABORT, they immediately begin fault rollback. The guardian's abort signal is idempotent and fire-and-forget — it doesn't need to confirm the controller received the signal (the rollback timeout mechanism handles non-responsive controllers).

Database Design

PostgreSQL stores experiment definitions, runs, and results: experiments (id, name, hypothesis, fault_config JSONB, duration_secs, abort_conditions JSONB, status), experiment_runs (id, experiment_id, started_at, ended_at, trigger: manual/scheduled, triggered_by, outcome: completed/aborted/error, baseline_metrics JSONB, result_metrics JSONB), fault_events (run_id, fault_type, target, applied_at, rolled_back_at, status), approvals (run_id, approver, approved_at). The result_metrics JSONB captures the post-experiment state for comparison with baseline.

Experiment audit logs are also written to an immutable audit store (S3 with object lock or a dedicated audit log service) — these serve compliance requirements (evidence that chaos was authorized, controlled, and rolled back safely).

API Design

Scaling & Bottlenecks

The fault injector must be able to target a large number of pods simultaneously. Injecting faults into 1,000 pods requires 1,000 simultaneous agent commands. The controller fans out commands in parallel (goroutines/async I/O) — the bottleneck is the agent gRPC server's request handling capacity. Each agent handles its local node's pods; with 100 pods per node and 10 nodes targeted, 10 agents each receive 1 command — easily handled.

The safety guardian must react within 5-10 seconds of an SLO breach. Prometheus scrape intervals (15 seconds default) can delay detection. Mitigation: use Prometheus alerting rules with a 0s for duration for chaos-specific abort conditions (trigger immediately, don't require sustained breach) and reduce scrape interval to 5 seconds for SLO metrics during active experiments.

Key Trade-offs

Production vs. staging-only chaos: Running in production validates real resilience but risks customer impact; staging-only is safer but may not reflect production's actual failure modes (traffic patterns, data volumes, third-party dependencies differ)
Scheduled vs. random chaos: Scheduled experiments (Netflix's original Chaos Monkey ran during business hours) ensure engineers are available to respond; random chaos (GameDay unpredictability) tests whether on-call processes work but can catch teams off-guard
Fine-grained vs. coarse-grained faults: Fine-grained faults (latency on specific service-to-service call) test targeted hypotheses; coarse-grained faults (terminate an AZ) test broader resilience but are harder to diagnose
Opt-in vs. opt-out for services: Opt-in (services must register to participate in chaos) is safer but means new services are never tested; opt-out (all services are eligible unless exempted) provides better coverage but requires strong safety guardrails