System Design: E-Book Platform

Requirements

Functional Requirements:

Publishers upload e-books in EPUB/PDF format; the platform converts and stores them in a DRM-protected format
Readers purchase and download books to multiple devices (e-reader, tablet, phone, desktop)
Reading position, bookmarks, highlights, and notes sync across all devices in real-time
Full-text search within a book and across the user's library
Recommendation engine suggesting books based on reading history, genre preferences, and purchase patterns
Sample reading (first 10% of any book) without purchase

Non-Functional Requirements:

50 million active readers, 12 million books in catalog
Sync latency under 5 seconds for reading position across devices
99.99% availability for book serving; purchase flow requires strong consistency
DRM must prevent unauthorized copying while remaining transparent to legitimate users
Support book sizes up to 500MB (illustrated textbooks with embedded media)

Scale Estimation

50M active readers, 20M DAU. Average user reads 30 minutes/day with a position sync every page turn (~every 60 seconds) = 20M × 30 = 600M sync events/day = 6,944/sec. Book downloads: 5M downloads/day at average 5MB per book = 25TB egress/day. Catalog storage: 12M books × 20MB average (after conversion and DRM packaging) = 240TB. Annotations: 50M users × 200 highlights/notes average = 10B annotations. Purchase transactions: 500K/day = 6/sec. Search index: 12M books × 100K words average = 1.2 trillion tokens indexed.

High-Level Architecture

The platform consists of four major subsystems. The Publishing Pipeline ingests books from publishers via an upload portal or bulk API. EPUB files are validated against the EPUB 3.2 spec, converted to the platform's internal format (a segmented container optimized for streaming page delivery), and packaged with DRM encryption. Each book is split into segments (chapters or fixed-size chunks) and encrypted with AES-256-CTR using a per-book content key. The content key is itself encrypted with a user-specific device key during download — this enables multi-device access while preventing key sharing.

The Reading Service handles book delivery and rendering. When a reader opens a book, the client requests the book manifest (table of contents, segment metadata) from the Book API. Segments are fetched on-demand as the reader navigates — only the current chapter and one adjacent chapter are downloaded, minimizing initial load time. On e-ink devices with limited connectivity, the entire book is pre-downloaded. The DRM client library (embedded in the reader app) decrypts segments using the device-bound key, renders the content, and manages page layout (reflowable EPUB or fixed-layout PDF).

The Sync Service maintains reading state across devices. Each reading event (page turn, bookmark, highlight, note) is sent to a sync endpoint and stored in DynamoDB with a vector clock for conflict resolution. When a user opens a book on a different device, the client fetches the latest state and applies it. Conflicts (e.g., two different bookmarks on the same passage from two devices) are resolved using last-writer-wins for position and merge for annotations.

Core Components

DRM & Content Protection

The DRM system uses a license server model. When a user purchases a book, a license record is created in the License Service (PostgreSQL) linking user_id, book_id, and entitlement details (download limit, expiry for rentals). When the reader app requests a download, the License Service generates a device-specific license containing the content decryption key encrypted with the device's public key (each device generates an RSA-2048 keypair during registration). The content itself is encrypted with AES-256-CTR; the initialization vector is derived from the segment index, enabling random-access decryption without decrypting preceding segments. Device limits (e.g., max 6 devices per account) are enforced at the license issuance layer.

Annotation & Sync Engine

Annotations (highlights, bookmarks, notes) are stored in DynamoDB with the schema: PK=user_id#book_id, SK=annotation_id. Each annotation includes type, position (CFI — Canonical Fragment Identifier per the EPUB spec), content (for notes), color (for highlights), and a vector clock for conflict resolution. Sync uses a pull-based model: the client sends its local vector clock, the server returns all annotations with newer vector clock entries, and the client merges them. For offline reading, annotations are queued locally and synced on reconnection. The sync protocol handles deletions via tombstones with a 30-day retention window.

Recommendation Engine

The recommendation system uses collaborative filtering (item-based) combined with content-based features. The item-based model computes book similarity using co-purchase matrices (users who bought X also bought Y) processed nightly on a Spark cluster. Content-based features include genre, author, page count, publication date, and extracted topic vectors (TF-IDF on book descriptions). A blending layer combines both signals using a learned weighted model (logistic regression). Cold-start books (new releases) are boosted using publisher metadata and editorial curation. Recommendations are pre-computed for active users and cached in Redis; personalized book rows are served via a Feed API.

Database Design

The book catalog is stored in PostgreSQL: books (book_id UUID PK, title, author_ids ARRAY, publisher_id, isbn, language, page_count, file_size, format, drm_content_key_encrypted, s3_manifest_path, published_date, created_at). A full-text search index in Elasticsearch indexes title, author, description, and extracted keywords. User library data: user_books (user_id, book_id, purchased_at, last_read_at, reading_progress_pct, status ENUM(purchased, sample, rental, returned)) stored in DynamoDB for fast per-user library lookups.

Purchase transactions use a separate PostgreSQL instance with ACID guarantees: transactions (txn_id, user_id, book_id, amount, currency, payment_method_token, status, created_at). This database is the source of truth for entitlements and is replicated synchronously to a standby for zero data loss. Reading analytics (pages read, time spent, completion rates) flow through Kafka to a data warehouse (Redshift) for publisher reporting and recommendation model training.

API Design

POST /api/v1/books/{book_id}/purchase — Purchase a book; body contains payment_token; returns license_id and download_url
GET /api/v1/books/{book_id}/manifest — Fetch book manifest (TOC, segment URLs, metadata) for an entitled user
PUT /api/v1/sync/{book_id}/position — Update reading position; body contains cfi_position, page_number, progress_pct, device_id
GET /api/v1/sync/{book_id}/annotations?since={vector_clock} — Fetch annotations newer than the client's vector clock

Scaling & Bottlenecks

The sync service is the primary bottleneck at 6,944 position updates/sec. DynamoDB handles this write-heavy workload with auto-scaling provisioned capacity. The write pattern is highly partitioned (each user's data is independent), enabling linear scaling. However, popular books with millions of concurrent readers create hot partitions in analytics pipelines; a Kafka intermediate layer buffers these events. Read-path optimization: book segment delivery is CDN-cached since encrypted segments are identical for all users (per-user decryption happens client-side), achieving 90%+ cache hit rates.

The publishing pipeline processes 10K new books/day. EPUB validation and format conversion take 30 seconds per book on average; DRM packaging adds 10 seconds. A fleet of 50 workers handles the steady-state load. Large illustrated books (500MB) can take 5 minutes; these are processed on high-memory instances. The Elasticsearch search index handles 12M documents with 6 shards across 3 nodes; index updates from new publications are near-real-time (1-second refresh interval).

Key Trade-offs

Segment-based delivery vs full-book download: Streaming segments reduces initial load time to under 2 seconds but requires connectivity during reading — offline mode pre-downloads the entire book, offering both experiences
Vector clocks vs last-writer-wins for sync: Vector clocks preserve concurrent annotations from multiple devices (merge instead of overwrite) but add complexity to the sync protocol — justified because losing a user's highlight is unacceptable
Per-book encryption keys vs per-user keys: Per-book keys mean CDN-cached encrypted content works for all users (only the key delivery is personalized), enabling massive CDN efficiency — per-user encryption would require unique encrypted copies per user
EPUB-based rendering vs proprietary format: EPUB is an open standard with broad tooling support, but its rendering inconsistencies across devices require extensive testing — a proprietary format would provide pixel-perfect consistency at the cost of ecosystem lock-in