SYSTEM_DESIGN
System Design: Circuit Breaker Pattern
Design a circuit breaker system that prevents cascading failures in distributed systems by tracking downstream error rates and automatically opening the circuit to fail fast during outages.
Requirements
Functional Requirements:
- Wrap outbound calls to downstream services and track success/failure rates
- Transition circuit state: CLOSED (normal) → OPEN (failing fast) → HALF-OPEN (testing recovery)
- In OPEN state, fail immediately without attempting the downstream call
- In HALF-OPEN state, allow a limited number of probe requests to test recovery
- Configurable thresholds: error rate percentage, minimum request volume, sleep window, probe count
- Provide fallback logic: return cached data or a default response when circuit is open
Non-Functional Requirements:
- Circuit state evaluation adds under 0.5ms overhead per request
- State transitions propagate across all instances within 5 seconds
- Handle 1,000,000 requests/sec across the circuit breaker layer
- Support per-service, per-endpoint, and per-consumer-group circuit breaker instances
Scale Estimation
A microservices deployment: 1,000 services, each making 10 downstream calls/sec on average = 10,000 downstream call attempts/sec per service instance. With 100 instances per service: 1M downstream calls/sec. Each circuit breaker instance tracks a rolling window of the last 10 seconds of calls — at 10,000 calls/sec per instance, the window holds 100,000 data points. Storing each as a bit (success/failure) = 12.5 KB per circuit per instance. With 50 circuits per instance, 625 KB per process — trivial.
High-Level Architecture
The circuit breaker is an in-process library (not a network proxy), minimizing latency. It wraps each downstream client with a state machine. The state machine uses a sliding window (count-based or time-based) to track the recent success/failure ratio. In CLOSED state, all requests are forwarded normally. When the failure rate exceeds the threshold (e.g., 50% of the last 100 requests failed), the circuit transitions to OPEN. In OPEN state, requests fail immediately with a CircuitOpenException — no network call is made. After a configurable sleep window (e.g., 5 seconds), the circuit enters HALF-OPEN and allows N probe requests through. If the probes succeed, the circuit closes; if any probe fails, it re-opens.
For distributed state sharing (so all instances of a service respond consistently), circuit state can be stored in Redis — instances write their local state and read the aggregate across all instances. This enables a globally coordinated OPEN state where all instances stop sending to a failing downstream simultaneously. However, local-only circuit breakers (no shared state) are simpler and still provide the primary benefit (preventing a single instance from hammering a failing downstream).
Core Components
Sliding Window
Two window types: count-based (last N requests) and time-based (last N seconds). Count-based: implemented as a circular buffer of N boolean values (success/failure). On each call result, overwrite the oldest entry and recompute the failure rate. Time-based: implemented as a bucket array where each bucket represents 1 second; calls are tallied in the current bucket; on each request, evict buckets older than the window size and sum the remaining buckets. Time-based windows handle variable traffic rates better — a count-based window during a traffic dip may not have enough recent samples to make a reliable failure rate determination.
State Machine
The state machine transitions are: CLOSED → OPEN when (failure_rate >= threshold AND total_calls >= minimum_volume); OPEN → HALF-OPEN after sleep_window_ms elapses; HALF-OPEN → CLOSED if probe_success_count >= required_probes; HALF-OPEN → OPEN if any probe fails. Transitions are atomic (CAS operations or synchronized blocks) to prevent race conditions where multiple goroutines simultaneously trigger a state change. The HALF-OPEN state uses a semaphore to allow exactly N concurrent probes — additional requests during HALF-OPEN fail fast (don't want to allow unlimited probes that could overwhelm a recovering service).
Fallback Mechanism
Fallbacks execute when the circuit is open or when a request fails. Fallback types: (1) return a cached previous response (stale cache — serve the last known good value); (2) return a default response (empty list, zero count, degraded feature disabled message); (3) call an alternative service (secondary/backup endpoint); (4) throw a defined exception (for callers that must handle the absence explicitly). Fallbacks are registered alongside the circuit breaker: circuitBreaker.execute(primaryCall, fallback). Fallbacks must complete quickly — a slow fallback in the critical path defeats the purpose of circuit breaking.
Database Design
Circuit breaker state is primarily in-process (no database). For distributed coordination, Redis stores: cb:{service}:{instance_id} → {state, failure_count, success_count, last_state_change, next_probe_time}. Each instance writes its local state every 5 seconds. A leader or each instance independently reads the aggregate: if >50% of instances report OPEN, treat the service as globally failing. Redis key TTL = 30 seconds — stale data from crashed instances automatically expires.
For observability, circuit state transitions are emitted as metrics (Prometheus counters/gauges) and events (structured log entries with reason, from_state, to_state). A dashboard shows circuit health across all service-to-service calls — essential for quickly diagnosing cascading failure events in production.
API Design
Scaling & Bottlenecks
The circuit breaker adds overhead on every request: a sliding window update (O(1) amortized) and a state check (O(1)). At 1M requests/sec per process, this overhead is negligible in CPU terms. The primary scaling concern is shared state coordination: synchronizing circuit state across 100+ instances via Redis adds network latency to every state-read operation. Mitigation: use an eventually consistent model — read state from Redis every 5 seconds (cached locally between reads) rather than on every request.
The HALF-OPEN probe traffic must not cause a recovery-in-progress downstream to re-fail. The semaphore limiting concurrent probes is critical: without it, all 100 instances might simultaneously send probe traffic when transitioning to HALF-OPEN, creating a burst that overwhelms the recovering service.
Key Trade-offs
- In-process vs. proxy-based circuit breaker: In-process (library like Resilience4j) has zero network overhead but requires per-language implementation; proxy-based (Envoy circuit breaker) is language-agnostic but adds a network hop
- Count-based vs. time-based windows: Count-based windows react faster during high traffic (N recent calls) but lag during low traffic; time-based windows provide consistent time-horizon evaluation regardless of traffic rate
- Fail fast vs. timeout: Circuit breaking (fail fast when open) is more responsive than simply setting a short timeout — a 100ms timeout still incurs 100ms per request during an outage; circuit breaking reduces this to <1ms
- Global vs. local circuit state: Global state (Redis-shared) ensures all instances react consistently but adds coordination overhead; local state is faster and simpler but means some instances may keep hammering a failing service while others have opened
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.