SYSTEM_DESIGN

System Design: Vector Search Engine

Design a vector search engine that performs high-throughput approximate nearest neighbor search over billion-scale embedding collections. Covers HNSW indexing, product quantization, distributed sharding, and hybrid search combining vector and keyword retrieval.

14 min readUpdated Jan 15, 2025
system-designvector-searchannembeddingssemantic-search

Requirements

Functional Requirements:

  • Index billions of high-dimensional vectors (128–1536 dimensions) with associated metadata
  • Return top-K approximate nearest neighbors for a query vector in under 50ms
  • Support filtered search: find top-K nearest neighbors where metadata satisfies a predicate (e.g., language=en AND category=sports)
  • Support hybrid search: combine vector similarity score with BM25 keyword relevance score
  • Allow real-time index updates: insert, update, and delete vectors without taking the index offline
  • Namespaced indices: each tenant or use case gets an isolated namespace

Non-Functional Requirements:

  • Query throughput of 10,000 QPS per index with sub-50ms p99 latency
  • 99% recall@10 (99% of true top-10 nearest neighbors are in the returned results)
  • Support indices up to 1 billion vectors of 1536 dimensions
  • Linear horizontal scalability: adding nodes must proportionally increase throughput
  • Data durability: index contents must survive node failures without data loss

Scale Estimation

1 billion vectors * 1536 dimensions * 4 bytes (FP32) = 6 TB raw vector data. With product quantization (PQ) compressing to 1/32 of original size: 192 GB compressed. An HNSW index on 1 billion vectors requires ~10 bytes of graph structure per vector = 10 GB for the graph. Total in-memory footprint: ~200 GB — requiring a cluster of 5 servers with 48 GB RAM each (with OS and overhead). At 10,000 QPS: each query executes in <50ms, so the cluster handles 500,000 concurrent queries/second across shards.

High-Level Architecture

The vector search engine is organized as a distributed system with two planes: the Index Plane (storing and searching vectors) and the Coordination Plane (routing, metadata, and consistency). Vectors are sharded across multiple index nodes using a consistent hash ring on the vector ID. Each shard holds a subset of the total vector collection and its HNSW graph. Query requests are fan-out from a query router to all relevant shards in parallel; each shard returns its local top-K candidates; the router merges and re-ranks the candidates globally to produce the final top-K.

HNSW (Hierarchical Navigable Small World) is the primary index structure: a multi-layer proximity graph where each node connects to its M nearest neighbors at each layer. Query traversal starts at the top layer (fewest nodes, coarsest graph), greedily navigating toward the query vector, then descending to finer layers until reaching the base layer where the final candidates are collected. HNSW achieves 98%+ recall with logarithmic query time complexity.

For filtered search, a post-filtering approach runs the ANN query first (retrieving ef_search candidates) and then applies metadata filters to the candidate set. Pre-filtering (filtering the vector space by metadata before searching) is used when filters are highly selective (>95% of vectors excluded) by building separate sub-indices per filter combination. Hybrid search combines the ANN score (cosine similarity) and BM25 score using Reciprocal Rank Fusion (RRF) to produce a unified ranking.

Core Components

HNSW Index Engine

The HNSW index is built using hnswlib or nmslib with parameters: M=16 (connections per node), ef_construction=200 (search depth during build), ef_search=100 (search depth during query). These parameters trade off index build time, memory, and recall. The index is built in-memory and persisted to disk as a binary file for durability. Incremental insertions add new vectors to the HNSW graph in O(log N) time without rebuilding the entire index. Deletions mark vectors as deleted (soft delete) and are filtered from results; periodic compaction rebuilds the index to reclaim space from deleted entries.

Product Quantization (PQ) Compression

Product quantization divides each 1536-dimensional vector into 96 sub-vectors of 16 dimensions each. Each sub-vector is quantized to one of 256 centroids (8-bit code). The compressed representation is 96 bytes per vector (vs. 6,144 bytes uncompressed) — a 64x compression ratio. Distance computations use lookup tables of pre-computed sub-vector distances, enabling distance computation over compressed vectors in 1 microsecond vs. 10 microseconds for raw vectors. Recall degrades by 2–5% due to quantization error, acceptable for most applications.

Distributed Query Router

The query router maintains a shard map: each shard is responsible for a range of vector IDs. For a query, the router dispatches the query vector to all shards in parallel (or only the relevant shards for namespace-partitioned queries). Each shard returns its top-K candidates with cosine similarity scores. The router collects all candidates, removes duplicates, re-scores using exact cosine similarity (since PQ introduces approximation), and returns the final top-K. Shard failure handling: if a shard doesn't respond within 40ms, the router returns results from the available shards with a partial result flag.

Database Design

Vector metadata (non-vector fields used for filtering) is stored in a separate metadata store (Elasticsearch or PostgreSQL) keyed by vector ID. The metadata store handles complex predicate filters (range queries, term filters, regex). For filtered ANN search, the metadata store returns a list of candidate IDs matching the filter, and the vector engine restricts its HNSW search to this candidate list using a filtered graph traversal. A separate Redis hash maps vector IDs to their shard locations for the router.

API Design

POST /indices/{index_name}/vectors — Insert or update vectors (batch up to 10,000 vectors per call) with metadata. POST /indices/{index_name}/query — Query top-K nearest neighbors for a vector with optional metadata filter and hybrid search config. DELETE /indices/{index_name}/vectors/{vector_id} — Soft-delete a vector from the index. GET /indices/{index_name}/stats — Return index size, shard distribution, memory usage, and query latency percentiles.

Scaling & Bottlenecks

Memory is the primary bottleneck: HNSW requires keeping the entire graph in RAM for fast traversal. Disk-based ANN indices (DiskANN) use SSD-resident graphs with a small in-memory cache for hot nodes, enabling indices larger than available RAM at the cost of 3–5x higher query latency (5ms for NVMe SSD access vs. 0.1ms for RAM). For cost optimization in lower-QPS use cases, DiskANN with NVMe SSDs can serve 1 billion vectors on commodity hardware.

Index update throughput is limited by HNSW graph construction: inserting 1 million new vectors per hour requires 280 inserts/second, each taking ~1ms = within single-node capacity. For bulk updates (loading 100 million new vectors), parallel index building on multiple CPU cores followed by index merging achieves 10 million inserts/hour. Merge of two HNSW indices is not supported natively; the workaround is to rebuild the index from scratch during off-peak hours with atomic swap.

Key Trade-offs

  • HNSW vs. IVF (Inverted File Index): HNSW provides higher recall at lower query latency; IVF uses less memory and builds faster but requires scanning multiple inverted lists for high recall, making it slower at equivalent recall.
  • Exact vs. approximate search: Exact KNN guarantees 100% recall but has O(N) complexity; HNSW approximates with O(log N) complexity and 98%+ recall — the 2% recall loss is acceptable in most semantic search applications.
  • Pre-filtering vs. post-filtering: Pre-filtering is exact for highly selective filters but requires building sub-indices; post-filtering is approximate (may miss results filtered out of candidates) but requires no special index structures.
  • Single index vs. sharded index: A single index node is simpler and avoids cross-shard merge overhead; sharding is required when the index exceeds single-node memory capacity and for horizontal throughput scaling.

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.