SYSTEM_DESIGN

System Design: Video Search Engine

System design of a video search engine covering multi-modal indexing (text, speech, visual), inverted index architecture, ranking models, and query understanding for searching billions of videos.

17 min readUpdated Jan 15, 2025
system-designvideo-searchelasticsearchnlpinformation-retrieval

Requirements

Functional Requirements:

  • Full-text search over video titles, descriptions, and tags
  • Search over auto-generated transcripts (speech-to-text)
  • Visual search: find videos containing specific objects, scenes, or text (OCR)
  • Autocomplete and query suggestions as the user types
  • Filters by duration, upload date, resolution, category, and creator
  • Search within a specific channel or playlist

Non-Functional Requirements:

  • Index 500 million videos with real-time indexing of new uploads within 60 seconds
  • Search latency under 200ms (p99) for text queries
  • Support 50,000 search queries per second at peak
  • Relevance: top 3 results satisfy the query for 80%+ of searches (measured by click-through)
  • Handle typos, synonyms, and multi-language queries

Scale Estimation

500M videos indexed. Average metadata per video: 200 bytes title + 1KB description + 500 bytes tags + 10KB transcript (5 minutes average at 30 words/min) ≈ 12KB text per video. Total index data: 500M × 12KB = 6TB of raw text. Inverted index with positional data: ~3x raw text = 18TB index size. At 50K queries/sec with average 3 index lookups per query = 150K index reads/sec. Autocomplete: 200K keystrokes/sec (each keystroke triggers a prefix query). New videos indexed: 500K/day = 6/sec, each requiring transcript processing and index update.

High-Level Architecture

The search system has three layers: Ingestion, Index, and Query. The Ingestion Layer processes new and updated videos. When a video is uploaded, a Kafka event triggers the Search Ingestion Pipeline: (1) Metadata Indexer extracts title, description, tags from the Video Metadata Service; (2) Transcript Generator runs a speech-to-text model (Whisper) on the audio track, producing timestamped text; (3) Visual Analyzer runs object detection (YOLO) and OCR (Tesseract/PaddleOCR) on sampled frames, producing visual labels and on-screen text; (4) Embedding Generator produces a dense vector embedding of the video using a multi-modal model (CLIP). All extracted data is written to the Search Index.

The Index Layer uses a hybrid architecture: a traditional inverted index (Elasticsearch cluster) for text search and a vector index (FAISS or Milvus) for semantic search. The inverted index stores documents with fields: title (boost 3x), description (boost 1x), tags (boost 2x), transcript (boost 0.5x), visual_labels (boost 1x). The vector index stores 768-dimensional CLIP embeddings for semantic similarity retrieval. Both indexes are sharded by video_id hash across 100+ nodes.

The Query Layer handles search requests. A Query Understanding module parses the query: spell correction (using a noisy channel model with a video-domain vocabulary), synonym expansion, language detection, and intent classification (navigational — looking for a specific video, vs informational — exploring a topic). The processed query is dispatched to both the inverted index (BM25 scoring) and the vector index (cosine similarity). Results are merged using Reciprocal Rank Fusion and re-ranked by a learning-to-rank model (LambdaMART) incorporating engagement signals.

Core Components

Multi-Modal Ingestion Pipeline

The ingestion pipeline runs as a set of Kafka consumers, each handling a stage. The Transcript Generator is the most compute-intensive: running OpenAI's Whisper large model on a GPU processes audio at 10x real-time speed (a 5-minute video transcribed in 30 seconds). For 500K new videos/day, this requires ~3,000 GPU-hours/day of transcription compute. Visual analysis (YOLO + OCR) runs on separate GPU workers, sampling 1 frame per 10 seconds of video. Each frame is processed in ~100ms, producing object labels (e.g., "car", "person", "text: SALE 50% OFF"). Results are aggregated per video and written to the index.

Hybrid Search Engine

The search engine combines sparse retrieval (BM25 on the inverted index) and dense retrieval (ANN on vector embeddings). For a query like "how to change a tire", BM25 matches videos with exact keyword matches in titles and descriptions, while dense retrieval finds semantically related videos (e.g., a video titled "DIY car maintenance" that does not contain the exact phrase). The two result sets are merged using Reciprocal Rank Fusion (RRF): each result's score is 1/(k + rank_in_list), summed across lists. The merged top 100 candidates are re-ranked by the LTR model.

Autocomplete Service

Autocomplete serves suggestions as the user types, with sub-50ms latency. The system uses a prefix trie built from the top 100M historical queries, weighted by frequency and recency. The trie is stored in memory across a cluster of autocomplete servers, partitioned by the first two characters of the query. When a user types "how to ch", the server traverses the trie to find the top 10 completions by weighted frequency. Personalization is layered on top: the user's recent searches boost matching completions. The trie is rebuilt hourly from the query log.

Database Design

The primary search index is an Elasticsearch cluster with 100+ data nodes, 3 master nodes, and 20+ coordinating nodes. Index settings: 200 primary shards (each ~90GB), 1 replica per shard. The document schema includes fields: video_id, title (text with custom analyzer for stemming and synonyms), description (text), tags (keyword array), transcript (text with positional data for timestamp linking), visual_labels (keyword array), upload_date, duration, view_count, creator_id, category, language. A separate dense vector field stores the CLIP embedding for knn search.

Query logs are stored in Kafka (7-day retention) and S3 (permanent). A Spark job processes query logs daily to update the autocomplete trie, extract query-click pairs for LTR training, and compute query-level metrics (CTR, zero-result rate). The LTR model training pipeline uses XGBoost on 200+ features (BM25 score, embedding similarity, video popularity, freshness, creator authority, query-title exact match ratio).

API Design

  • GET /api/v1/search?q={query}&filters={json}&sort={relevance|date|views}&offset=0&limit=20 — Full-text search with filters and pagination
  • GET /api/v1/search/autocomplete?prefix={text}&limit=10 — Autocomplete suggestions
  • GET /api/v1/search/visual?image_url={url}&limit=20 — Visual similarity search using an uploaded image or frame
  • GET /api/v1/search/transcript?video_id={id}&q={query} — Search within a video's transcript, returns matching timestamps

Scaling & Bottlenecks

The Elasticsearch cluster is the primary scaling challenge. At 50K queries/sec across 200 shards, each shard handles 250 queries/sec. Elasticsearch's performance degrades with large shards (>50GB), so shard sizing must balance query latency against cluster management overhead. Hot-warm-cold architecture is used: new/popular videos are on hot nodes (NVMe SSDs), older content migrates to warm nodes (HDD), and very old content to cold storage (snapshot to S3, restored on demand). ILM (Index Lifecycle Management) automates this tiering.

The transcript processing pipeline is the ingestion bottleneck. Running Whisper on 500K videos/day requires significant GPU fleet. Optimization: use Whisper small for low-engagement videos (5x faster, slightly lower accuracy) and Whisper large only for videos that gain traction (deferred re-transcription). This reduces GPU usage by 80% while maintaining quality for videos that matter. A priority queue processes high-view-count videos first.

Key Trade-offs

  • Hybrid search (BM25 + vectors) vs BM25 only: Adding semantic search catches queries where exact keyword matching fails (15% of queries), but doubles index storage and query latency — the relevance improvement justifies the cost
  • Multi-modal indexing (text + speech + visual) vs text-only: Transcript and visual labels unlock searching for content not described in the title/description, covering 30% more of the video's content — compute cost of Whisper + YOLO is significant but worthwhile
  • Real-time indexing vs batch: Indexing new videos within 60 seconds ensures fresh content is discoverable immediately; batch indexing (hourly) would be simpler but delays discoverability
  • Prefix trie vs Elasticsearch suggest: A custom trie gives sub-10ms autocomplete latency vs ~50ms for Elasticsearch's completion suggester — the dedicated service is worth the operational overhead for the most latency-sensitive user interaction

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.