System Design: Embeddings Generation Pipeline

Requirements

Functional Requirements:

Generate embeddings for text (up to 8,192 tokens), images (up to 10 MB), and text-image pairs
Support multiple embedding models simultaneously (OpenAI text-embedding-3-large, BGE-M3, CLIP)
Process both real-time requests (single document, <200ms) and batch jobs (millions of documents)
Detect and skip duplicate content: same content should not be re-embedded unnecessarily
Store generated embeddings with provenance: which model version, which input hash, when generated
Trigger downstream index updates when embeddings are refreshed due to model version changes

Non-Functional Requirements:

Batch throughput: 10 million documents per day
Real-time latency: embedding generation under 100ms for text <512 tokens
Embedding storage: 1 billion embeddings of 1536 dimensions must be queryable
Model update cycles: re-embedding the full corpus when a new model is deployed within 7 days
99.9% availability for the real-time embedding API

Scale Estimation

10 million documents/day = 116 docs/second. Text-embedding-3-large processes 2,048 tokens in 50ms on an A10 GPU; batch size of 64 gives 1,280 docs/second per GPU. One A10 GPU suffices for batch with headroom; 2 GPUs provide redundancy. Full corpus re-embedding (1 billion documents): at 1,280 docs/second = 9 days on 1 GPU. With 8 GPUs: ~1 day. Storage: 1 billion * 1536 * 4 bytes = 6 TB for FP32; 3 TB for FP16.

High-Level Architecture

The embeddings pipeline operates in two modes: streaming (real-time single-document embedding) and batch (bulk embedding generation). The streaming path is a synchronous microservice: receive document → hash content → check cache → run model inference → store embedding → return. The batch path is an asynchronous distributed job: read documents from S3/database → partition into shards → distribute to GPU workers → write embeddings to object store → update vector database index.

An embedding cache (Redis, keyed by {model_id}:{content_hash}) stores recently generated embeddings for up to 7 days. Before invoking the model, the pipeline checks the cache. Cache hit rate for real-time traffic is typically 30–50% (repeated queries, shared documents). For batch re-embedding with a new model version, the cache is keyed by model version, so all cache entries are misses and the full corpus is re-embedded.

A model registry integration tracks which embedding model was used for each stored embedding. When a new model is deployed, the pipeline creates a migration job that re-embeds all documents whose stored embedding was generated with an older model version. The migration runs in the background over 7 days using low-priority GPU resources, and the vector database atomically swaps the old embeddings for new ones in batches as the migration progresses.

Core Components

Embedding Model Serving

Models run on NVIDIA Triton or vLLM (for large transformer-based embedding models). Text models (BGE-M3, E5-large) use mean pooling over the last hidden layer to produce a single embedding vector per document. Image models (CLIP ViT-L/14) run the vision encoder branch on preprocessed image patches. For multimodal embeddings (CLIP text + image), both branches run and the resulting vectors are projected into a shared embedding space. Model input normalization (mean/std, tokenization truncation at max_seq_len) is handled by a preprocessing worker on CPU.

Content Deduplication

A MinHash LSH (Locality-Sensitive Hashing) layer detects near-duplicate documents before embedding generation. Documents are tokenized into character 3-grams, hashed with 128 hash functions, and min-pooled to a 128-dimensional binary signature. LSH buckets group signatures with Jaccard similarity > 0.85 into the same bucket. Documents in the same bucket as an already-embedded document skip re-embedding and reuse the existing embedding. This deduplication saves 20–40% of compute on web-crawled corpora with heavy duplication.

Provenance & Versioning Store

Every stored embedding carries a metadata record: (doc_id, content_hash, model_id, model_version, embedding_dimension, generated_at, embedding_s3_path). This enables: (1) checking if a document already has an up-to-date embedding before re-embedding, (2) identifying which documents need re-embedding after a model update, and (3) debugging quality issues by tracing back to the exact model version that produced an embedding. Embeddings are stored in Parquet files on S3 (grouped by batch job ID) and separately indexed in the vector database.

Database Design

PostgreSQL for provenance: embeddings_registry (doc_id VARCHAR, content_hash VARCHAR(64), model_id VARCHAR, model_version VARCHAR, dim INT, vector_db_id VARCHAR, generated_at TIMESTAMP). Index on (doc_id, model_id) for existence checks. Redis for embedding cache: key embed:{model_id}:{content_hash} → serialized float32 array, TTL 7 days. S3 for bulk storage: embeddings stored as s3://embeddings/{model_id}/{YYYY-MM}/{batch_id}.parquet with schema (doc_id, embedding ARRAY).

API Design

POST /embed — Generate an embedding for a single document (text or image URL); returns the embedding vector and metadata. POST /embed/batch — Submit a batch embedding job for up to 1 million documents specified by S3 manifest. GET /embed/batch/{job_id} — Return batch job status and output S3 path. POST /embed/reindex — Trigger a full corpus re-embedding for a new model version; returns a migration job ID with estimated completion time.

Scaling & Bottlenecks

GPU throughput scales linearly with GPU count for batch jobs. The bottleneck shifts to data loading: reading documents from S3 at 100 MB/s per S3 prefix with 8 GPU workers requires 8 S3 prefix streams. Using S3 Select to push down document ID filtering and reading pre-tokenized (HuggingFace tokenized) binary files instead of raw text reduces data transfer and CPU tokenization overhead by 5x.

For real-time traffic, the embedding API can be a bottleneck for applications that embed every user query before search. A request coalescing layer groups multiple simultaneous embed requests within a 10ms window into a single batched inference call, improving GPU utilization from 20% (single-request mode) to 80% (batched mode) with minimal latency increase.

Key Trade-offs

In-house model vs. API-based embedding: Self-hosted models have higher infrastructure cost but lower per-query cost at scale, full control over model updates, and no data privacy concerns for sensitive content; API-based (OpenAI) is operationally simpler with zero infrastructure but cost grows linearly with volume.
FP32 vs. FP16 vs. INT8 embeddings: FP32 preserves maximum precision; FP16 halves storage and bandwidth with negligible quality loss for most retrieval tasks; INT8 quantization further halves storage but degrades recall by 1–3%.
Eager vs. lazy re-embedding: Re-embedding all documents immediately after a model update ensures maximum freshness but requires large GPU resources for 7 days; lazy re-embedding (only re-embed documents when they are accessed) reduces compute but results in mixed-model embeddings in the index, degrading search quality.
Single embedding space vs. multiple specialized models: One universal embedding model simplifies infrastructure; specialized models (code, science, multilingual) provide 10–20% recall improvement in their domains at the cost of multi-index complexity.