SYSTEM_DESIGN
System Design: Distributed Job Scheduler
Design a distributed job scheduler like Airflow or Quartz that executes cron jobs and DAG-based workflows reliably at scale, with exactly-once guarantees, retry logic, and failure recovery.
Requirements
Functional Requirements:
- Schedule jobs to run at specific times (cron expressions) or on a trigger (event-based)
- Execute jobs exactly once even if multiple scheduler nodes are running
- Support DAG-based workflows: job B runs only after job A completes successfully
- Retry failed jobs with configurable backoff (exponential, fixed)
- Provide job history, logs, and status visibility
- Cancel running jobs and skip scheduled executions
Non-Functional Requirements:
- Schedule 100,000 jobs with sub-second trigger accuracy (fire within 1 second of scheduled time)
- Survive scheduler node failure without missing job executions or double-firing
- Job execution latency: time from scheduled trigger to worker start under 2 seconds
- Support 10,000 concurrent job executions across worker nodes
Scale Estimation
100,000 scheduled jobs: most are low-frequency (daily/hourly). Peak: assume 10,000 jobs scheduled within the same minute (e.g., jobs scheduled at :00 of every hour). The scheduler must fire 10,000 jobs in 60 seconds = 167 jobs/sec. Each job dispatch is a message write to a task queue (~200 bytes). Worker nodes: 10,000 concurrent jobs × average 30-second runtime × average 1 CPU core = 10,000 worker cores. Job metadata storage: 100,000 jobs × 5 KB (schedule, config, history) = 500 MB in PostgreSQL.
High-Level Architecture
The scheduler has three layers: the scheduling engine (determines when jobs should fire), the dispatch layer (sends jobs to workers), and the execution layer (workers run jobs and report results).
The scheduling engine uses a time-ordered priority queue (min-heap) of job triggers sorted by next_run_time. A background loop sleeps until the next job is due, pops it from the heap, dispatches it, and re-inserts the job with the next trigger time. In a distributed deployment, multiple scheduler nodes compete — a leader election (via a distributed lock or database row lock) ensures only one scheduler fires each job. The leader holds a lease (30-second TTL) and is the only node polling the priority queue; if the leader crashes, a follower acquires the lease and resumes.
The dispatch layer writes job execution requests to a task queue (Kafka or RabbitMQ). Workers consume from the queue, execute the job, and write results back. The task queue decouples the scheduler from worker availability — if all workers are busy, jobs queue up. Kafka provides durability (jobs are not lost if workers crash) and replayability (a bug-fix reprocessing jobs can rewind the consumer offset).
Core Components
Trigger Engine
The trigger engine maintains an in-memory min-heap of (next_run_time, job_id) pairs. On startup, it loads all active jobs from the database and populates the heap. Cron expressions are parsed into a next_run_time using standard cron libraries (handles timezone, DST transitions). Event-triggered jobs subscribe to an event stream — when the triggering event arrives, the job is enqueued immediately. The engine polls the heap head in a tight loop, sleeping until the next trigger time. Clock drift: the engine uses a monotonic clock for sleep timing but wall clock for cron expression evaluation; NTP synchronization keeps wall clocks within 100ms across nodes.
Exactly-Once Execution
The critical invariant is that each scheduled execution fires exactly once. Implementation: when the trigger engine fires a job, it writes a job_execution record to PostgreSQL with status=PENDING using an idempotency key (job_id + scheduled_time). The INSERT uses INSERT ... ON CONFLICT DO NOTHING — if two scheduler nodes race (briefly both thinking they are leader), only one insert succeeds. The winner dispatches the job; the loser's insert fails, so it skips dispatch. Workers check execution status before starting — if status=RUNNING or status=COMPLETED, they skip (handles duplicate messages from the task queue). This two-layer idempotency (database insert + worker check) ensures exactly-once semantics even under network partitions.
DAG Execution Engine
DAG workflows define dependencies between tasks. A DAG is a directed acyclic graph where each node is a task and edges represent dependencies (task B depends on task A). The engine represents a DAG execution as a collection of task_run records. When a task completes, the engine checks if all upstream dependencies of each downstream task are now complete — if so, it enqueues the downstream task. The topological evaluation runs as a database query: SELECT * FROM task_runs WHERE dag_run_id=X AND status=PENDING AND all upstream tasks completed. Cycle detection is performed at DAG definition time (not at runtime) using DFS.*
Database Design
PostgreSQL stores all scheduler state. Key tables: jobs (id, name, cron_expression, payload, retry_policy, enabled, next_run_time), job_executions (id, job_id, scheduled_time, status, worker_id, started_at, completed_at, result, attempt_count), dag_definitions (id, name, tasks JSONB, dependencies JSONB), task_runs (id, dag_run_id, task_name, status, started_at, completed_at, logs).
The next_run_time column in the jobs table is indexed — the scheduler queries SELECT * FROM jobs WHERE next_run_time <= now() AND enabled=true ORDER BY next_run_time LIMIT 100 every second to find jobs to fire. This query is the hot path; it must complete in milliseconds. With 100,000 jobs, a B-tree index on next_run_time makes this query O(log n + k) where k is the number of due jobs.*
API Design
Scaling & Bottlenecks
The single-leader scheduler creates a throughput ceiling. At 10,000 jobs/minute, the leader must process 167 jobs/sec — each requiring a database INSERT, a Kafka produce, and a heap update. A single PostgreSQL insert takes ~1ms, so 167/sec is trivially achievable. The bottleneck appears at 100,000 jobs/min: 1,667/sec saturates single-node PostgreSQL INSERT throughput. Mitigation: partition the job namespace by consistent hashing — each scheduler node owns a subset of jobs (by hash of job_id) and is the leader for its partition. This provides horizontal scaling with no single-leader bottleneck.
Worker node scaling is straightforward: workers are stateless consumers of the task queue. Add more workers to handle more concurrent jobs. The task queue (Kafka) absorbs bursts — if a batch of 10,000 jobs fires at once, they queue in Kafka and workers drain the queue as capacity allows.
Key Trade-offs
- At-least-once vs. exactly-once execution: At-least-once (retry on failure, accept duplicates) is simpler to implement and more resilient; exactly-once requires idempotent job designs and two-phase commit between scheduler and worker
- Pull vs. push job dispatch: Workers pulling from a task queue (Kafka, SQS) scales better (workers control their own rate) and is more resilient; push (scheduler directly calling workers) is lower latency but requires worker discovery and backpressure handling
- Cron-based vs. DAG-based scheduling: Cron is simple and universal but cannot express dependencies; DAG engines handle complex workflows but add significant operational complexity
- Tight vs. loose trigger accuracy: Sub-second accuracy requires in-memory scheduling (heap + sleep loop); for jobs where ±5 minute accuracy is acceptable, a simple database poll every 60 seconds eliminates the in-memory heap and simplifies distributed coordination
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.