SYSTEM_DESIGN

System Design: Circuit Breaker Pattern

Design a circuit breaker system that prevents cascading failures in distributed systems by tracking downstream error rates and automatically opening the circuit to fail fast during outages.

14 min readUpdated Jan 15, 2025
system-designcircuit-breakerresiliencehystrixinfrastructure

Requirements

Functional Requirements:

  • Wrap outbound calls to downstream services and track success/failure rates
  • Transition circuit state: CLOSED (normal) → OPEN (failing fast) → HALF-OPEN (testing recovery)
  • In OPEN state, fail immediately without attempting the downstream call
  • In HALF-OPEN state, allow a limited number of probe requests to test recovery
  • Configurable thresholds: error rate percentage, minimum request volume, sleep window, probe count
  • Provide fallback logic: return cached data or a default response when circuit is open

Non-Functional Requirements:

  • Circuit state evaluation adds under 0.5ms overhead per request
  • State transitions propagate across all instances within 5 seconds
  • Handle 1,000,000 requests/sec across the circuit breaker layer
  • Support per-service, per-endpoint, and per-consumer-group circuit breaker instances

Scale Estimation

A microservices deployment: 1,000 services, each making 10 downstream calls/sec on average = 10,000 downstream call attempts/sec per service instance. With 100 instances per service: 1M downstream calls/sec. Each circuit breaker instance tracks a rolling window of the last 10 seconds of calls — at 10,000 calls/sec per instance, the window holds 100,000 data points. Storing each as a bit (success/failure) = 12.5 KB per circuit per instance. With 50 circuits per instance, 625 KB per process — trivial.

High-Level Architecture

The circuit breaker is an in-process library (not a network proxy), minimizing latency. It wraps each downstream client with a state machine. The state machine uses a sliding window (count-based or time-based) to track the recent success/failure ratio. In CLOSED state, all requests are forwarded normally. When the failure rate exceeds the threshold (e.g., 50% of the last 100 requests failed), the circuit transitions to OPEN. In OPEN state, requests fail immediately with a CircuitOpenException — no network call is made. After a configurable sleep window (e.g., 5 seconds), the circuit enters HALF-OPEN and allows N probe requests through. If the probes succeed, the circuit closes; if any probe fails, it re-opens.

For distributed state sharing (so all instances of a service respond consistently), circuit state can be stored in Redis — instances write their local state and read the aggregate across all instances. This enables a globally coordinated OPEN state where all instances stop sending to a failing downstream simultaneously. However, local-only circuit breakers (no shared state) are simpler and still provide the primary benefit (preventing a single instance from hammering a failing downstream).

Core Components

Sliding Window

Two window types: count-based (last N requests) and time-based (last N seconds). Count-based: implemented as a circular buffer of N boolean values (success/failure). On each call result, overwrite the oldest entry and recompute the failure rate. Time-based: implemented as a bucket array where each bucket represents 1 second; calls are tallied in the current bucket; on each request, evict buckets older than the window size and sum the remaining buckets. Time-based windows handle variable traffic rates better — a count-based window during a traffic dip may not have enough recent samples to make a reliable failure rate determination.

State Machine

The state machine transitions are: CLOSED → OPEN when (failure_rate >= threshold AND total_calls >= minimum_volume); OPEN → HALF-OPEN after sleep_window_ms elapses; HALF-OPEN → CLOSED if probe_success_count >= required_probes; HALF-OPEN → OPEN if any probe fails. Transitions are atomic (CAS operations or synchronized blocks) to prevent race conditions where multiple goroutines simultaneously trigger a state change. The HALF-OPEN state uses a semaphore to allow exactly N concurrent probes — additional requests during HALF-OPEN fail fast (don't want to allow unlimited probes that could overwhelm a recovering service).

Fallback Mechanism

Fallbacks execute when the circuit is open or when a request fails. Fallback types: (1) return a cached previous response (stale cache — serve the last known good value); (2) return a default response (empty list, zero count, degraded feature disabled message); (3) call an alternative service (secondary/backup endpoint); (4) throw a defined exception (for callers that must handle the absence explicitly). Fallbacks are registered alongside the circuit breaker: circuitBreaker.execute(primaryCall, fallback). Fallbacks must complete quickly — a slow fallback in the critical path defeats the purpose of circuit breaking.

Database Design

Circuit breaker state is primarily in-process (no database). For distributed coordination, Redis stores: cb:{service}:{instance_id} → {state, failure_count, success_count, last_state_change, next_probe_time}. Each instance writes its local state every 5 seconds. A leader or each instance independently reads the aggregate: if >50% of instances report OPEN, treat the service as globally failing. Redis key TTL = 30 seconds — stale data from crashed instances automatically expires.

For observability, circuit state transitions are emitted as metrics (Prometheus counters/gauges) and events (structured log entries with reason, from_state, to_state). A dashboard shows circuit health across all service-to-service calls — essential for quickly diagnosing cascading failure events in production.

API Design

Scaling & Bottlenecks

The circuit breaker adds overhead on every request: a sliding window update (O(1) amortized) and a state check (O(1)). At 1M requests/sec per process, this overhead is negligible in CPU terms. The primary scaling concern is shared state coordination: synchronizing circuit state across 100+ instances via Redis adds network latency to every state-read operation. Mitigation: use an eventually consistent model — read state from Redis every 5 seconds (cached locally between reads) rather than on every request.

The HALF-OPEN probe traffic must not cause a recovery-in-progress downstream to re-fail. The semaphore limiting concurrent probes is critical: without it, all 100 instances might simultaneously send probe traffic when transitioning to HALF-OPEN, creating a burst that overwhelms the recovering service.

Key Trade-offs

  • In-process vs. proxy-based circuit breaker: In-process (library like Resilience4j) has zero network overhead but requires per-language implementation; proxy-based (Envoy circuit breaker) is language-agnostic but adds a network hop
  • Count-based vs. time-based windows: Count-based windows react faster during high traffic (N recent calls) but lag during low traffic; time-based windows provide consistent time-horizon evaluation regardless of traffic rate
  • Fail fast vs. timeout: Circuit breaking (fail fast when open) is more responsive than simply setting a short timeout — a 100ms timeout still incurs 100ms per request during an outage; circuit breaking reduces this to <1ms
  • Global vs. local circuit state: Global state (Redis-shared) ensures all instances react consistently but adds coordination overhead; local state is faster and simpler but means some instances may keep hammering a failing service while others have opened

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.