SYSTEM_DESIGN

System Design: Task Queue System

Design a distributed task queue like Celery or Sidekiq that reliably dispatches background jobs to workers with priority queues, retries, dead-letter handling, and at-least-once delivery guarantees.

15 min readUpdated Jan 15, 2025
system-designtask-queueceleryredisinfrastructure

Requirements

Functional Requirements:

  • Enqueue background tasks (function name + arguments) from web servers to worker processes
  • Workers pull tasks, execute them, and acknowledge on success
  • Retry failed tasks with configurable exponential backoff (max attempts, retry intervals)
  • Support priority queues: high-priority tasks execute before low-priority tasks
  • Dead-letter queue: tasks that exhaust retries move to DLQ for inspection
  • Delayed tasks: execute at a specific time in the future (scheduled execution)
  • Task result storage: retrieve the return value of a completed task by task ID

Non-Functional Requirements:

  • Enqueue latency under 5ms p99
  • At-least-once delivery: a task must not be lost even if a worker crashes mid-execution
  • Support 100,000 task enqueues/sec and 50,000 task executions/sec
  • Horizontal worker scaling: add workers without downtime

Scale Estimation

100,000 enqueues/sec × 1 KB average task payload = 100 MB/sec into the queue. For Redis-backed queues (Sidekiq model): each enqueue is an LPUSH (~microseconds), each dequeue is a BRPOPLPUSH (~microseconds with blocking pop). At 100,000/sec, Redis handles this on a single node (Redis processes ~1M simple commands/sec). In-flight tasks: 50,000 workers × average 2-second task duration = 100,000 tasks in-flight at any time. Result storage (if enabled): 100,000 completions/sec × 1 KB result × 24h TTL = ~8 TB of result data — results are typically disabled or given short TTLs.

High-Level Architecture

The task queue system has three components: producers (web servers, API handlers that enqueue tasks), the broker (the queue itself — Redis, RabbitMQ, or Kafka), and workers (processes that consume and execute tasks).

For reliability, the broker must not lose tasks. Redis lists are fast but volatile — a Redis crash without persistence loses all queued tasks. Mitigations: enable Redis AOF persistence (every write is fsynced to disk); use Redis Cluster for HA; or use Kafka as the broker for true durability (tasks are replicated across Kafka brokers before acknowledgment). Kafka adds ~5ms latency vs. Redis's ~0.5ms — the choice depends on durability requirements.

At-least-once delivery requires visibility timeout semantics: when a worker dequeues a task, it moves the task to an in-progress set (BRPOPLPUSH in Redis moves the task atomically from the ready queue to a processing list). If the worker crashes, a reaper process monitors the processing list for tasks that have been in-progress longer than their visibility timeout and re-enqueues them. This ensures crashed-worker tasks are retried without polling overhead.

Core Components

Priority Queue Implementation

Priority queues use separate Redis lists per priority level (high, medium, low) or a Redis sorted set with priority as the score. Multiple-list approach: workers check the high-priority list first (RPOPLPUSH with LLEN check); if empty, check medium; then low. This is simple but workers must poll each list. Sorted set approach: enqueue with score = (priority × large_constant) + timestamp (ensuring FIFO within same priority). Workers ZPOPMIN to dequeue the highest-priority task. Sorted sets have O(log N) enqueue/dequeue vs. O(1) for lists — the overhead is acceptable at 100,000 tasks/sec.

Retry Logic with Exponential Backoff

On task failure, the worker does not immediately re-enqueue. Instead, it calculates the next retry time: retry_at = now + (base_delay × 2^attempt). Commonly: attempt 1 = 30s delay, attempt 2 = 60s, attempt 3 = 120s, ... up to max_delay (e.g., 1 hour). The task is written to a delayed set (Redis sorted set keyed by retry_at timestamp). A scheduler process runs every second, scanning the delayed set for tasks with retry_at <= now and moving them back to the ready queue (ZRANGEBYSCORE + LPUSH + ZREM in a Lua script for atomicity). After max_attempts failures, the task moves to the dead-letter queue — a separate list where operators can inspect failure reasons and manually replay or discard.

Result Backend

Optional result storage allows callers to retrieve task return values. When a worker completes a task, it writes the result (serialized as JSON or MessagePack) to a result store keyed by task_id. Redis is used for short TTL results (minutes to hours) — SET task_result:{task_id} {result} EX {ttl}. For long-lived results, a database (PostgreSQL) or object storage (S3) is used. Callers poll for results: GET task_result:{task_id}. Async result patterns (webhook callbacks, promise chaining) are preferred over polling — the worker calls a callback URL on completion, eliminating the polling overhead.

Database Design

Redis stores all queue state: ready queues (lists), in-progress sets (lists), delayed tasks (sorted set), and short-term results (strings with TTL). PostgreSQL stores task definitions (task type registry, allowed parameters), execution history (task_id, status, attempts, worker_id, started_at, completed_at, error), and the dead-letter queue contents (for operator inspection). Execution history is written asynchronously — worker batches writes every 5 seconds to reduce database load.

For high-throughput environments, execution history is written to ClickHouse instead of PostgreSQL, enabling analytics queries (failure rates by task type, average execution time, worker utilization) that would be too slow on OLTP storage.

API Design

Scaling & Bottlenecks

At 100,000 enqueues/sec, a single Redis node with AOF persistence may bottleneck on fsync latency. Mitigation: use AOF with everysec (fsync once per second rather than every write) to increase write throughput at the cost of up to 1 second of data loss on crash; or partition queues across multiple Redis instances (hash task type to Redis shard). Worker scaling is linear — add more worker processes/containers. Workers are stateless consumers; the broker absorbs elasticity.

The in-progress monitoring (reaper) becomes expensive with many in-flight tasks. At 100,000 in-flight tasks, scanning the processing list every second is O(n). Mitigation: use a sorted set for in-progress tasks keyed by visibility_timeout_at, so the reaper only scans expired entries — ZRANGEBYSCORE in-progress_set 0 now() returns only expired tasks in O(log N + k) time.

Key Trade-offs

  • Redis vs. Kafka broker: Redis is fast (<1ms enqueue) and simple but loses data on crash without persistence; Kafka is durable and replayable but adds 5-10ms latency and operational complexity
  • At-least-once vs. exactly-once: At-least-once is standard (tasks may run twice if worker crashes after execution but before ack); exactly-once requires idempotent tasks + transactional ack, which is very hard to guarantee
  • Push vs. pull task dispatch: Workers pulling from the queue (BRPOPLPUSH) are self-regulating (they pull at their own pace); push would require the broker to track worker capacity — pull is universally used in task queue systems
  • Prefetch count: Workers prefetch multiple tasks (fetch N tasks at once) to reduce broker round trips; high prefetch improves throughput but means a worker crash loses N tasks from its prefetch buffer — usually acceptable since at-least-once guarantees eventual re-execution

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.