System Design: Podcast Platform

Requirements

Functional Requirements:

Podcast creators upload episodes (MP3/AAC, up to 2 hours); platform hosts audio and generates RSS feeds
Listeners subscribe to podcasts, get new episode notifications, and stream or download episodes
Playback tracking: resume position sync across devices; per-episode completion status
Transcription: auto-generate searchable transcripts for all episodes
Analytics for creators: downloads, listener retention curves, geographic reach, listening platforms
Dynamic ad insertion: insert audio ads at pre/mid/post positions based on listener targeting

Non-Functional Requirements:

Support 500 million monthly listeners and 5 million podcast shows
Audio streaming latency under 2 seconds for episode start
Episode upload processed and available within 5 minutes
RSS feed updates reflected within 2 minutes of creator publish
Download/play count reporting accurate to within 0.5% (IAB compliance)

Scale Estimation

500M monthly listeners × 8 episodes/month average × 45 minutes average = 180B minutes/month of audio consumed. At 128 Kbps average = 1,350 Gbps aggregate egress — almost entirely from CDN. Storage: 5M shows × 100 episodes average × 45 minutes × 128 Kbps = ~200 PB of audio storage. Play tracking: 180B plays/month / (30 × 86,400) = 69.4k play events/second. RSS feed reads: RSS aggregators and podcast apps poll feeds on a schedule; a popular podcast's RSS feed gets ~500k reads/day = 5.8 reads/second per popular show — trivial per show but significant at 5M shows × 5 reads/day = 289 reads/second total at minimum.

High-Level Architecture

The platform separates the creator-facing management plane from the listener-facing consumption plane. Creators interact with a management API for episode upload, show configuration, and analytics. Listeners interact with a consumption API for discovery, subscriptions, streaming, and progress sync.

Audio pipeline: creators upload audio files to a multipart S3 upload endpoint. A transcoding service processes the upload: generates multiple bitrate versions (32 Kbps for bandwidth-limited listeners, 128 Kbps standard, 320 Kbps premium), creates chapters from embedded ID3 tags, and triggers the transcription service. Transcribed text is indexed in Elasticsearch for search. Processed audio is served from S3 via CloudFront CDN.

RSS feed management: podcast apps and aggregators consume standard RSS 2.0 + iTunes podcast extension feeds. Each show has a canonical RSS feed URL. The RSS service generates feeds dynamically from the episode database; feeds are cached at the CDN edge (5-minute TTL) and invalidated on episode publish via cache API. The RSS generator applies IAB podcast measurement guidelines: redirect audio file URLs through the platform's tracking domain (which logs the request and 302-redirects to the CDN URL) to count accurate plays.

Dynamic ad insertion (DAI): the platform supports server-side ad insertion (SSAI). Instead of serving a single monolithic MP3, the audio is stored as separate segments (intro, content segments, ad marker positions). At stream request time, the ad server selects ads for the listener based on targeting (geography, device, declared interests), stitches the audio segments with the selected ads, and returns a unified audio stream. DAI allows targeting without requiring client-side ad tracking (cookie-free).

Core Components

Episode Processing Pipeline

On upload, the raw audio is written to S3 and a processing job is queued in SQS. The processor: (1) runs FFprobe to extract duration, bitrate, and chapter markers; (2) normalizes loudness to -16 LUFS (podcast standard); (3) transcodes to 3 quality variants; (4) generates a waveform visualization (JSON array of amplitude samples for player display); (5) sends audio to Whisper (self-hosted ASR model) for transcription — a 45-minute episode transcribes in ~3 minutes on GPU; (6) updates episode status to "published" and invalidates the RSS feed cache. The transcript is stored as a WebVTT file (for in-player caption display) and as plain text in Elasticsearch.

Play Tracking and IAB Compliance

Accurate play counting is critical for advertiser billing and monetization. IAB Podcast Technical Specification requires: count a "download" only when ≥1 MB or the entire file (if <1 MB) has been transferred; deduplicate requests from the same user-agent and IP within 24 hours (bot/scraper filtering). The tracking service logs every HTTP request for audio files from the redirect domain (not the CDN). Each log entry records: episode_id, IP address (hashed, not stored raw — GDPR), user-agent, bytes_transferred, request_timestamp. A batch deduplication job (running every 15 minutes) applies IAB deduplication rules to produce clean download counts. These counts are written to the creator analytics tables. Raw log data is retained for 60 days for audit; aggregate counts are permanent.

Creator Analytics Service

Creators access analytics via a dashboard showing: episode downloads over time, listener geographic distribution, listening app breakdown (Apple Podcasts, Spotify, Google Podcasts), audience retention curves (what % of listeners heard each 30-second segment — derived from range requests against the CDN), and demographics (estimated from geographic and device data). The retention curve requires parsing HTTP range request headers from CDN access logs to determine which audio segments each listener actually played. CDN access logs are streamed to S3 every 5 minutes, processed by a Spark job that aggregates range requests into per-second play probability arrays, and stored as pre-computed analytics in ClickHouse.

Database Design

PostgreSQL: shows (show_id, creator_id, title, description, rss_url, category, explicit, status), episodes (episode_id, show_id, title, description, audio_url, duration_seconds, published_at, season, episode_number, chapters_json), subscriptions (user_id, show_id, subscribed_at, notification_enabled). Redis: user:{user_id}:progress:{episode_id} (playback position in seconds, TTL 90 days), rss:{show_id}:cache (rendered RSS XML, TTL 5 min), episode:{episode_id}:download_count (real-time counter, synced to DB every 5 min). Cassandra: play_events (episode_id, user_id, played_at, duration_played_seconds, client_type) — append-only, time-windowed retention. Elasticsearch: episode_transcripts (episode_id, show_id, transcript_text, segments_json). ClickHouse: episode_analytics (episode_id, date, downloads, unique_listeners, avg_retention_pct, geographic_distribution_json).

API Design

POST /episodes/upload-url — returns S3 pre-signed multipart upload URL and episode_id; upload goes directly to S3
GET /shows/{show_id}/rss — returns RSS 2.0 feed with audio redirect URLs; cached at CDN for 5 minutes
GET /episodes/{episode_id}/stream?quality={128|320} — returns CDN URL for audio; for SSAI, triggers ad selection and returns stitched stream URL
PUT /users/{user_id}/progress/{episode_id} — body: {position_seconds}, upserts playback position in Redis; called every 30 seconds during playback
GET /analytics/episodes/{episode_id} — returns analytics from ClickHouse: downloads, retention curve, geographic data

Scaling & Bottlenecks

RSS feed generation is a scalability concern when a major podcast releases an episode and thousands of apps poll simultaneously. The 5-minute CDN cache absorbs this: a single cache miss triggers one RSS generation; all 10k simultaneous requests in the 5-minute window are served from cache. The CDN cache warm-up on episode publish (one CDN invalidation API call + one warm-up request) costs negligible compute.

Play tracking deduplication: 69.4k play events/second through the redirect domain requires 69.4k HTTP log writes/second. Using a streaming log ingestion (Kinesis/Kafka) rather than DB writes ensures no bottleneck. The deduplication batch job (15-minute batches) processes 62M events per batch (69.4k × 900 seconds), running as a distributed Spark job in ~2 minutes — well within the 15-minute window.

Key Trade-offs

Platform-hosted RSS vs. redirect-only: Hosting and generating RSS feeds on the platform (rather than just redirecting to creator-hosted RSS) gives the platform more control (ad insertion, analytics), but creators lose portability; offering both modes (full hosting and redirect-only tracking) accommodates all creator types.
SSAI vs. client-side ad insertion: SSAI (server-side) enables ad insertion on any device without client SDK requirements and prevents ad blocking; client-side insertion allows richer ad formats (interactive) and simpler server infrastructure. SSAI is the industry direction for programmatic podcast advertising.
Whisper self-hosted vs. managed ASR: Self-hosted Whisper (GPU cluster) is 10× cheaper than managed ASR APIs at scale but requires ML infrastructure management; for a startup, managed ASR is operationally simpler; for a platform transcribing 5M episodes/year × 45 minutes, the cost difference ($0.001/minute vs. $0.01/minute) saves $20M/year — justifying self-hosting investment.
Waveform generation cost vs. UX value: Pre-generating waveform amplitude arrays (for the player scrubber visualization) adds ~5 seconds to the processing pipeline and 10 KB of storage per episode; the UX improvement for scrubbing is significant for educational/informational podcasts where listeners seek specific sections.