SYSTEM_DESIGN

System Design: Distributed Job Scheduler

Design a distributed job scheduler like Airflow or Quartz that executes cron jobs and DAG-based workflows reliably at scale, with exactly-once guarantees, retry logic, and failure recovery.

17 min readUpdated Jan 15, 2025
system-designjob-schedulercronairflowinfrastructure

Requirements

Functional Requirements:

  • Schedule jobs to run at specific times (cron expressions) or on a trigger (event-based)
  • Execute jobs exactly once even if multiple scheduler nodes are running
  • Support DAG-based workflows: job B runs only after job A completes successfully
  • Retry failed jobs with configurable backoff (exponential, fixed)
  • Provide job history, logs, and status visibility
  • Cancel running jobs and skip scheduled executions

Non-Functional Requirements:

  • Schedule 100,000 jobs with sub-second trigger accuracy (fire within 1 second of scheduled time)
  • Survive scheduler node failure without missing job executions or double-firing
  • Job execution latency: time from scheduled trigger to worker start under 2 seconds
  • Support 10,000 concurrent job executions across worker nodes

Scale Estimation

100,000 scheduled jobs: most are low-frequency (daily/hourly). Peak: assume 10,000 jobs scheduled within the same minute (e.g., jobs scheduled at :00 of every hour). The scheduler must fire 10,000 jobs in 60 seconds = 167 jobs/sec. Each job dispatch is a message write to a task queue (~200 bytes). Worker nodes: 10,000 concurrent jobs × average 30-second runtime × average 1 CPU core = 10,000 worker cores. Job metadata storage: 100,000 jobs × 5 KB (schedule, config, history) = 500 MB in PostgreSQL.

High-Level Architecture

The scheduler has three layers: the scheduling engine (determines when jobs should fire), the dispatch layer (sends jobs to workers), and the execution layer (workers run jobs and report results).

The scheduling engine uses a time-ordered priority queue (min-heap) of job triggers sorted by next_run_time. A background loop sleeps until the next job is due, pops it from the heap, dispatches it, and re-inserts the job with the next trigger time. In a distributed deployment, multiple scheduler nodes compete — a leader election (via a distributed lock or database row lock) ensures only one scheduler fires each job. The leader holds a lease (30-second TTL) and is the only node polling the priority queue; if the leader crashes, a follower acquires the lease and resumes.

The dispatch layer writes job execution requests to a task queue (Kafka or RabbitMQ). Workers consume from the queue, execute the job, and write results back. The task queue decouples the scheduler from worker availability — if all workers are busy, jobs queue up. Kafka provides durability (jobs are not lost if workers crash) and replayability (a bug-fix reprocessing jobs can rewind the consumer offset).

Core Components

Trigger Engine

The trigger engine maintains an in-memory min-heap of (next_run_time, job_id) pairs. On startup, it loads all active jobs from the database and populates the heap. Cron expressions are parsed into a next_run_time using standard cron libraries (handles timezone, DST transitions). Event-triggered jobs subscribe to an event stream — when the triggering event arrives, the job is enqueued immediately. The engine polls the heap head in a tight loop, sleeping until the next trigger time. Clock drift: the engine uses a monotonic clock for sleep timing but wall clock for cron expression evaluation; NTP synchronization keeps wall clocks within 100ms across nodes.

Exactly-Once Execution

The critical invariant is that each scheduled execution fires exactly once. Implementation: when the trigger engine fires a job, it writes a job_execution record to PostgreSQL with status=PENDING using an idempotency key (job_id + scheduled_time). The INSERT uses INSERT ... ON CONFLICT DO NOTHING — if two scheduler nodes race (briefly both thinking they are leader), only one insert succeeds. The winner dispatches the job; the loser's insert fails, so it skips dispatch. Workers check execution status before starting — if status=RUNNING or status=COMPLETED, they skip (handles duplicate messages from the task queue). This two-layer idempotency (database insert + worker check) ensures exactly-once semantics even under network partitions.

DAG Execution Engine

DAG workflows define dependencies between tasks. A DAG is a directed acyclic graph where each node is a task and edges represent dependencies (task B depends on task A). The engine represents a DAG execution as a collection of task_run records. When a task completes, the engine checks if all upstream dependencies of each downstream task are now complete — if so, it enqueues the downstream task. The topological evaluation runs as a database query: SELECT * FROM task_runs WHERE dag_run_id=X AND status=PENDING AND all upstream tasks completed. Cycle detection is performed at DAG definition time (not at runtime) using DFS.*

Database Design

PostgreSQL stores all scheduler state. Key tables: jobs (id, name, cron_expression, payload, retry_policy, enabled, next_run_time), job_executions (id, job_id, scheduled_time, status, worker_id, started_at, completed_at, result, attempt_count), dag_definitions (id, name, tasks JSONB, dependencies JSONB), task_runs (id, dag_run_id, task_name, status, started_at, completed_at, logs).

The next_run_time column in the jobs table is indexed — the scheduler queries SELECT * FROM jobs WHERE next_run_time <= now() AND enabled=true ORDER BY next_run_time LIMIT 100 every second to find jobs to fire. This query is the hot path; it must complete in milliseconds. With 100,000 jobs, a B-tree index on next_run_time makes this query O(log n + k) where k is the number of due jobs.*

API Design

Scaling & Bottlenecks

The single-leader scheduler creates a throughput ceiling. At 10,000 jobs/minute, the leader must process 167 jobs/sec — each requiring a database INSERT, a Kafka produce, and a heap update. A single PostgreSQL insert takes ~1ms, so 167/sec is trivially achievable. The bottleneck appears at 100,000 jobs/min: 1,667/sec saturates single-node PostgreSQL INSERT throughput. Mitigation: partition the job namespace by consistent hashing — each scheduler node owns a subset of jobs (by hash of job_id) and is the leader for its partition. This provides horizontal scaling with no single-leader bottleneck.

Worker node scaling is straightforward: workers are stateless consumers of the task queue. Add more workers to handle more concurrent jobs. The task queue (Kafka) absorbs bursts — if a batch of 10,000 jobs fires at once, they queue in Kafka and workers drain the queue as capacity allows.

Key Trade-offs

  • At-least-once vs. exactly-once execution: At-least-once (retry on failure, accept duplicates) is simpler to implement and more resilient; exactly-once requires idempotent job designs and two-phase commit between scheduler and worker
  • Pull vs. push job dispatch: Workers pulling from a task queue (Kafka, SQS) scales better (workers control their own rate) and is more resilient; push (scheduler directly calling workers) is lower latency but requires worker discovery and backpressure handling
  • Cron-based vs. DAG-based scheduling: Cron is simple and universal but cannot express dependencies; DAG engines handle complex workflows but add significant operational complexity
  • Tight vs. loose trigger accuracy: Sub-second accuracy requires in-memory scheduling (heap + sleep loop); for jobs where ±5 minute accuracy is acceptable, a simple database poll every 60 seconds eliminates the in-memory heap and simplifies distributed coordination

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.