SYSTEM_DESIGN
System Design: Subscription Commerce Platform
System design of a subscription commerce platform handling recurring billing, flexible subscription management, and predictive inventory for curated subscription boxes.
Requirements
Functional Requirements:
- Users subscribe to recurring product deliveries (weekly, monthly, quarterly)
- Flexible subscription management: pause, skip, swap products, change frequency
- Multiple subscription types: curated boxes, replenishment (same product), access-based (membership)
- Recurring billing with automatic payment retry on failure
- Pre-shipment notifications with option to customize upcoming box
- Referral program and gift subscriptions
Non-Functional Requirements:
- Support 5M active subscriptions with 2M billing events/month
- Billing must complete within a 48-hour window (batch processing)
- Zero double-charges — idempotent billing operations
- 99.9% availability for subscription management; 99.99% for billing
- PCI DSS compliant payment token storage
- Involuntary churn recovery: retry failed payments with smart scheduling
Scale Estimation
5M active subscriptions. Monthly billing: 2M billing events (some subscriptions are weekly/quarterly). Billing window: 2M payments in 48 hours = 11.6 payments/sec average, but batched into 8-hour processing windows = 69 payments/sec. Subscription management operations: 500K changes/month (pause, skip, swap) = 0.2 ops/sec. Product catalog for subscription boxes: 10K SKUs. Order generation: 2M orders/month = 0.77 orders/sec. Storage: 5M subscriptions × 2KB = 10GB; billing history over 3 years = 72M records × 500 bytes = 36GB.
High-Level Architecture
The subscription platform has three core subsystems: Subscription Management, Billing Engine, and Fulfillment Orchestrator. The Subscription Management Service handles the lifecycle: creation, modification, pausing, and cancellation. It stores subscription state in PostgreSQL and exposes APIs for self-service management. A Subscription Scheduler (cron-based) runs daily, identifying subscriptions due for their next cycle and creating billing jobs.
The Billing Engine processes billing jobs in batch: it reads due subscriptions, generates invoices, and charges payment methods via Stripe (using stored payment tokens). The engine implements a retry strategy for failed payments: retry after 3 days, then 7 days, then 14 days with increasing urgency in dunning emails. After 3 failed retries, the subscription is paused (not cancelled) to allow recovery.
The Fulfillment Orchestrator converts confirmed billing cycles into orders. For curated boxes, it runs a Box Curation algorithm that selects products based on subscriber preferences, past boxes (avoiding repeats), and current inventory levels. Generated orders are sent to the Order Management System for warehouse fulfillment.
Core Components
Subscription Lifecycle Manager
The lifecycle manager implements a state machine: active → paused → active, active → cancelled, active → past_due (payment failed). State transitions emit events to Kafka for downstream processing. Key operations: skip_next_cycle (sets a skip flag on the next billing date, advances to the following cycle), swap_product (for replenishment subscriptions, changes the product_id effective next cycle), change_frequency (recalculates the billing schedule). Each modification records the previous state for undo capability (1-click undo within 24 hours).
Idempotent Billing Engine
The billing engine ensures zero double-charges using idempotency keys. Each billing cycle generates a unique key: billing:{subscription_id}:{cycle_date}. Before processing, the engine checks if this key exists in the payments table (UNIQUE constraint). If it exists and the status is 'success', the charge is skipped. If it exists with 'failed', a retry is attempted. The Stripe charge API is called with this idempotency key, which Stripe also deduplicates on their side. This double-layer idempotency prevents duplicates even in edge cases (network timeout after successful charge but before recording the result).
Smart Dunning System
Payment failures cause involuntary churn — the #1 churn driver for subscription businesses. The dunning system uses ML to optimize retry timing. A gradient-boosted model trained on historical retry outcomes predicts the best retry time based on: day of week (payday patterns), card type (debit cards more likely to fail near month-end), failure reason (insufficient funds vs. expired card — different strategies), and customer tenure. Retries are scheduled at the predicted optimal time rather than fixed intervals. A/B tests show this recovers 15% more failed payments than fixed retry schedules.
Database Design
PostgreSQL schema: subscriptions table (subscription_id UUID PK, user_id, plan_id FK, status ENUM('active', 'paused', 'past_due', 'cancelled'), frequency ENUM('weekly', 'biweekly', 'monthly', 'quarterly'), next_billing_date DATE, payment_token_id FK, shipping_address_id FK, preferences JSONB, created_at, cancelled_at). billing_cycles table (cycle_id UUID PK, subscription_id FK, billing_date DATE, amount DECIMAL, currency, payment_status ENUM('pending', 'success', 'failed', 'retrying'), stripe_charge_id, retry_count INT, next_retry_at TIMESTAMP, idempotency_key VARCHAR UNIQUE). subscription_items table (item_id, subscription_id FK, product_id, quantity, added_at).
Payment tokens are stored in a PCI-compliant vault (Stripe's tokenization — the platform stores only the Stripe payment_method_id, never raw card numbers). A separate analytics table in ClickHouse tracks subscription metrics: MRR (monthly recurring revenue), churn rate, LTV (lifetime value), and cohort retention curves.
API Design
POST /api/v1/subscriptions— Create a subscription; body contains plan_id, payment_method, shipping_address, preferences; returns subscription_idPATCH /api/v1/subscriptions/{id}— Modify subscription; body can contain frequency, preferences, shipping_address, or action (pause/resume/skip)GET /api/v1/subscriptions/{id}/upcoming— Preview next box contents and billing date; allows customization before cutoffPOST /api/v1/subscriptions/{id}/cancel— Cancel subscription; body contains reason (used for churn analysis); triggers retention offer flow
Scaling & Bottlenecks
The billing batch window is the primary scaling challenge. Processing 2M payments in 8 hours at 69/sec with each Stripe API call taking 1-2 seconds requires parallelism. The system uses 50 billing worker processes consuming from a Kafka topic billing-jobs (50 partitions). Each worker processes charges sequentially, achieving ~25 charges/sec aggregate per worker (accounting for Stripe API latency and retries). 50 workers × 25/sec = 1,250 charges/sec capacity — well above the 69/sec requirement, providing headroom for retry processing.
Subscription schedule computation (determining which subscriptions are due today) runs as a daily cron job querying WHERE next_billing_date <= CURRENT_DATE AND status = 'active'. With a B-tree index on (status, next_billing_date), this query scans efficiently. The result set (up to 200K subscriptions due on a given day) is published to the billing-jobs Kafka topic in batches of 1,000.
Key Trade-offs
- Batch billing over real-time: Processing all billing in a scheduled window simplifies retry logic and allows overnight processing when payment provider rates are lower, but delays revenue recognition by up to 24 hours
- Double idempotency (app + Stripe): Belt-and-suspenders approach prevents double-charging even in extreme edge cases, at the cost of additional complexity and an extra database lookup per charge
- ML-optimized retry timing over fixed schedule: Recovers 15% more failed payments but requires training data and model maintenance — falls back to fixed schedule for new customer segments with insufficient data
- Pause over cancel for payment failures: Preserves the subscription relationship and allows recovery, but may confuse customers who expect failed subscriptions to auto-cancel
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.