System Design: Logistics & Package Tracking

Requirements

Functional Requirements:

Packages are scanned at each facility and the scan event updates tracking status
Customers can look up current status and full scan history by tracking number
Proactive notifications (SMS, email, push) sent at key milestones
Estimated delivery date computed and updated after each scan
Carriers (FedEx, UPS, USPS) integrated via webhook or polling APIs
Proof of delivery captured via signature or photo at final delivery

Non-Functional Requirements:

Scan events reflected in tracking status within 5 seconds of facility scan
Support 500 million tracked packages with 10 billion scan events/year
API response for tracking lookup under 200ms at the 99th percentile
Notification delivery: SMS within 30 seconds of milestone scan
99.95% availability — tracking unavailability damages carrier brand trust

Scale Estimation

500 million packages in active transit at peak (holiday season); each scanned ~20 times = 10 billion scan events/year = ~317 events/second average; peak (pre-Christmas) ~3,000 events/second. Each event: ~300 bytes (tracking_number, facility_id, timestamp, status_code, employee_id) = 900 KB/second average, 900 MB/second peak. Storage: 10 billion × 300 bytes = 3 TB/year for raw scan events. Tracking lookups: 200 million/day (customers checking status) = ~2,300 requests/second, peaking at 10,000/second.

High-Level Architecture

The platform connects two worlds: the physical scan infrastructure (warehouse scanners, handheld RFID readers, driver apps) and the digital customer-facing tracking experience. The core pipeline moves scan events from physical devices into a durable event store, aggregates them into current package status, and triggers downstream notifications and EDD (estimated delivery date) updates.

Scan events arrive from multiple sources: handheld barcode scanners at sorting facilities (batch upload every minute), driver handheld devices (real-time over cellular), and carrier API webhooks (for third-party carrier packages). All sources publish to a Kafka topic (scan_events) which is the system of record for raw scans. A Scan Processor service consumes from Kafka, validates events (checks tracking number format, deduplicates by tracking_number + facility_id + timestamp), and writes to Cassandra (raw scan history) and Redis (current status).

Core Components

Scan Ingestion Pipeline

Facility scanners batch-upload scan events to an S3 bucket every 60 seconds (CSV files). An S3 Lambda trigger reads new files, parses and validates each scan, and publishes to Kafka. Carrier API webhooks post directly to a REST endpoint that validates the HMAC signature and publishes to the same Kafka topic. Driver app scans use a mobile SDK that buffers scans locally and flushes to a REST endpoint over cellular. Deduplication uses a Bloom filter in Redis (keyed by SHA256 of tracking_number + facility_id + timestamp) before Cassandra write.

Status Aggregation Service

For each scan event, the Status Aggregator computes the current package status by applying the scan to a status FSM: LABEL_CREATED → PICKED_UP → IN_TRANSIT (repeating) → OUT_FOR_DELIVERY → DELIVERED | ATTEMPTED_DELIVERY | EXCEPTION. The current status is written to a Redis hash keyed by tracking_number with the full scan history stored in Cassandra. The aggregator also calls the EDD Service to recompute the estimated delivery date based on current location, destination, and historical transit times for this lane.

Notification Service

The Notification Service subscribes to a Kafka topic of milestone scan events (PICKED_UP, OUT_FOR_DELIVERY, DELIVERED, EXCEPTION). For each milestone, it looks up notification preferences for the shipper and recipient, then dispatches: email via SendGrid, SMS via Twilio, push via Firebase Cloud Messaging. Notifications are templated per milestone with personalized tracking URL. Delivery receipts are tracked in PostgreSQL; failed deliveries are retried with exponential backoff up to 3 times.

Database Design

Scan history in Cassandra: partition key = tracking_number, clustering key = scan_timestamp DESC. This enables O(1) lookup of all scans for a package ordered newest-first. Current status in Redis: HSET tracking:{tracking_number} → {status, last_location, edd, last_updated}. Package master data (shipper, recipient, service_type, weight) in PostgreSQL sharded by tracking_number prefix. EDD model outputs stored in Redis with 1-hour TTL; recomputed on each scan. Proof of delivery images stored in S3 with CDN URL stored in Cassandra scan record.

API Design

GET /v1/track/{tracking_number} — Returns current status, last location, EDD, and full scan history; 200ms SLA backed by Redis + Cassandra reads
POST /v1/scans — Internal endpoint for scan ingestion from facility batch uploads and driver apps; accepts array of scan events, returns accepted/rejected counts
POST /v1/webhooks/carrier/{carrier_id} — Receives scan event webhooks from third-party carriers (FedEx, UPS) with HMAC validation
POST /v1/notifications/subscribe — Recipients or shippers subscribe to email/SMS notifications for a tracking number with preferred milestone triggers

Scaling & Bottlenecks

Peak holiday season creates a 10× traffic spike (3,000 scan events/second, 100,000 tracking lookups/second). The Kafka ingestion layer scales by adding partitions and consumer instances. The Redis current-status layer is the hot read path: 100,000 lookups/second is achievable with a Redis cluster (6 shards × 2 replicas, each handling ~8,000 reads/second). For public tracking pages, a CDN (CloudFront) caches tracking responses for 10 seconds, reducing Redis load by ~70% since most customers repeatedly refresh the same page.

Cassandra scan history handles 3,000 writes/second easily across a 6-node cluster. Read latency for full scan history is ~5ms (single partition scan). The EDD computation is the latency risk: calling an ML model for each scan event at 3,000/second requires an async decoupled approach — EDD is computed in a background job and cached in Redis, not in the synchronous scan processing path.

Key Trade-offs

Cassandra vs. DynamoDB for scan history — Cassandra's partition-per-tracking-number model perfectly matches the access pattern; DynamoDB would work equally well with on-demand capacity mode for spiky holiday loads
Polling vs. webhooks for carrier integration — webhooks provide real-time updates but require carrier trust and reliability; polling every 60 seconds is more reliable but adds latency; most carriers now provide webhooks with polling fallback
Cache TTL for current status — 10-second CDN TTL balances freshness (package moves through facilities quickly during crunch) with cost; for OUT_FOR_DELIVERY packages, TTL drops to 30 seconds
EDD as commitment vs. estimate — showing EDD prominently increases customer satisfaction but creates support burden when missed; uncertainty bands (e.g., "by end of day Tuesday") are more accurate but less satisfying