SYSTEM_DESIGN
System Design: RAG (Retrieval-Augmented Generation) System
Design a production RAG system that combines dense vector retrieval with large language models to answer questions grounded in a private knowledge base. Covers chunking strategies, hybrid retrieval, re-ranking, context assembly, and hallucination mitigation.
Requirements
Functional Requirements:
- Index a private knowledge base (documents, PDFs, wikis, code) and answer natural language questions grounded in it
- Retrieve relevant document chunks using both dense (vector) and sparse (BM25) retrieval
- Re-rank retrieved chunks using a cross-encoder for precision before passing to the LLM
- Generate answers that cite source documents with page/section references
- Support multi-turn conversations that maintain retrieval context across turns
- Provide a feedback mechanism: users can mark answers as helpful or incorrect
Non-Functional Requirements:
- End-to-end answer latency under 3 seconds for typical queries
- Retrieval recall@10 of 95% (95% of relevant chunks in top-10 results)
- Support knowledge base of 10 million document chunks
- 99.9% availability; degrade gracefully to retrieval-only (no LLM generation) if LLM is unavailable
- Incremental indexing: new documents indexed within 5 minutes of upload
Scale Estimation
10 million chunks * 1536 dimensions * 4 bytes = 61 GB for embeddings. HNSW graph overhead ~100 MB at this scale. Total vector index: ~62 GB, fits on a single high-memory server (128 GB RAM). BM25 index (Elasticsearch): 10 million chunks * 500 bytes average = 5 GB raw text, ~15 GB Elasticsearch index with inverted lists. Query rate: 1,000 QPS peak. Each query: embed query (20ms) + hybrid retrieval (30ms) + re-rank 20 candidates (80ms) + LLM generation (1–2s) = ~2.5s total.*
High-Level Architecture
The RAG system has four stages: Document Processing, Indexing, Retrieval, and Generation. Document Processing converts raw files (PDF, DOCX, HTML, Markdown, code) into clean text chunks of 512 tokens with 50-token overlap. Indexing embeds each chunk and writes to both the vector store (Qdrant) and the BM25 index (Elasticsearch). Retrieval executes a hybrid query in parallel against both stores, merges results using Reciprocal Rank Fusion, and re-ranks using a cross-encoder. Generation assembles a context window from the top-5 re-ranked chunks and calls the LLM to produce a grounded answer.
The Document Processing pipeline handles diverse formats: PyMuPDF for PDFs, python-docx for Word, BeautifulSoup for HTML, and tree-sitter for code. Chunking strategy depends on content type: recursive character splitting for prose, function-level splitting for code, table-aware chunking for spreadsheets. Each chunk stores metadata: source_document_id, page_number, section_heading, chunk_index, and content_hash. Metadata enables citation generation and supports metadata-filtered retrieval (e.g., retrieve only from documents tagged product=X).
Multi-turn conversation support uses a context-aware query rewriter: given the conversation history, an LLM rewrites the user's latest question into a self-contained query that includes implicit references resolved from prior turns. This rewritten query is used for retrieval, preventing context loss when users ask follow-up questions like "what about the pricing?" without specifying what "it" refers to.
Core Components
Document Chunking Service
A chunking service processes documents into overlapping chunks. Chunk size of 512 tokens with 128-token overlap balances recall (larger chunks miss nearby context) and precision (smaller chunks are more focused). For long structured documents (technical manuals), a hierarchical chunking approach stores both paragraph-level chunks (for precise retrieval) and section-level summaries (for broader context). Parent-child chunk retrieval: retrieve the small child chunk, but pass its parent (section) as context to the LLM for richer background.
Hybrid Retrieval Engine
Dense retrieval: embed the query using the same model as the corpus (text-embedding-3-large), query Qdrant HNSW index for top-20 nearest neighbors by cosine similarity. Sparse retrieval: execute a BM25 query in Elasticsearch with the same query text, retrieving top-20 by TF-IDF score. Merge both result lists using Reciprocal Rank Fusion: score = Σ 1/(k + rank_i) where k=60. RRF naturally handles the different score scales of dense and sparse retrieval. The merged top-20 list is passed to the cross-encoder re-ranker.
Cross-Encoder Re-Ranker
A cross-encoder (ms-marco-MiniLM-L-6-v2 or a fine-tuned variant) takes each (query, chunk) pair and produces a relevance score by processing them jointly through a transformer encoder. Unlike the bi-encoder used for embedding generation, the cross-encoder has full attention between query and document tokens, providing much higher precision. Running 20 cross-encoder inferences at 4ms each = 80ms total. The top-5 re-ranked chunks are selected as the LLM context.
Database Design
PostgreSQL for document and chunk metadata: documents (doc_id, title, source_url, file_type, uploaded_at, indexed_at, owner_id), chunks (chunk_id, doc_id, chunk_index, content TEXT, token_count INT, embedding_id VARCHAR, page_number INT, section_heading VARCHAR, content_hash VARCHAR). Qdrant collection: vectors with payload fields {chunk_id, doc_id, section_heading, page_number} for metadata filtering. Conversation history: Redis sorted set per session_id with turn timestamps as scores, TTL 24 hours.
API Design
POST /query — Submit a natural language query; returns generated answer, source citations, and confidence score.
POST /documents — Upload a document for processing and indexing; returns doc_id and estimated indexing completion time.
GET /documents/{doc_id}/chunks — Return all chunks for a document with their embedding status.
POST /feedback — Submit user feedback (helpful/unhelpful) for a query response, linked to the query_id.
Scaling & Bottlenecks
LLM generation latency (1–2 seconds) dominates end-to-end latency. Mitigation: use streaming generation (return tokens as they are produced) to improve perceived responsiveness; the user sees the first token within 500ms even if the full answer takes 2 seconds. For high-QPS deployments, an LLM response cache (keyed by query_hash + retrieved_chunk_hashes, TTL 1 hour) serves repeated identical queries without LLM invocation, reducing cost and latency by 40% for FAQ-style workloads.
Re-ranking 20 candidates with a cross-encoder at 80ms is the second-largest latency contributor. Distilled cross-encoders (6-layer vs. 12-layer) run in 40ms with <2% quality degradation. Batching all 20 (query, chunk) pairs into a single inference call maximizes GPU utilization. If re-ranking latency remains a bottleneck, a lighter-weight listwise re-ranker (ColBERT late interaction) achieves similar quality at 10ms.
Key Trade-offs
- Chunk size: Smaller chunks improve retrieval precision (the retrieved text is more focused on the relevant topic) but lose surrounding context needed for coherent answers; larger chunks preserve context but dilute relevance signals.
- Bi-encoder vs. cross-encoder: Bi-encoder (dual embedding) scales to millions of documents with sub-millisecond lookup but has lower precision; cross-encoder is highly precise but cannot scale to millions of candidates — the two-stage pipeline uses both optimally.
- RAG vs. fine-tuning: RAG provides up-to-date knowledge from a dynamic corpus without retraining; fine-tuning bakes knowledge into model weights for faster inference but requires retraining when knowledge changes.
- Answer generation vs. extraction: Generative answers are more natural and synthesize across multiple chunks but can hallucinate; extractive answers (returning the most relevant chunk verbatim) are fully grounded but may not answer complex questions that require synthesis.
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.