System Design: RAG (Retrieval-Augmented Generation) System

Requirements

Functional Requirements:

Index a private knowledge base (documents, PDFs, wikis, code) and answer natural language questions grounded in it
Retrieve relevant document chunks using both dense (vector) and sparse (BM25) retrieval
Re-rank retrieved chunks using a cross-encoder for precision before passing to the LLM
Generate answers that cite source documents with page/section references
Support multi-turn conversations that maintain retrieval context across turns
Provide a feedback mechanism: users can mark answers as helpful or incorrect

Non-Functional Requirements:

End-to-end answer latency under 3 seconds for typical queries
Retrieval recall@10 of 95% (95% of relevant chunks in top-10 results)
Support knowledge base of 10 million document chunks
99.9% availability; degrade gracefully to retrieval-only (no LLM generation) if LLM is unavailable
Incremental indexing: new documents indexed within 5 minutes of upload

Scale Estimation

10 million chunks * 1536 dimensions * 4 bytes = 61 GB for embeddings. HNSW graph overhead ~100 MB at this scale. Total vector index: ~62 GB, fits on a single high-memory server (128 GB RAM). BM25 index (Elasticsearch): 10 million chunks * 500 bytes average = 5 GB raw text, ~15 GB Elasticsearch index with inverted lists. Query rate: 1,000 QPS peak. Each query: embed query (20ms) + hybrid retrieval (30ms) + re-rank 20 candidates (80ms) + LLM generation (1–2s) = ~2.5s total.*

High-Level Architecture

The RAG system has four stages: Document Processing, Indexing, Retrieval, and Generation. Document Processing converts raw files (PDF, DOCX, HTML, Markdown, code) into clean text chunks of 512 tokens with 50-token overlap. Indexing embeds each chunk and writes to both the vector store (Qdrant) and the BM25 index (Elasticsearch). Retrieval executes a hybrid query in parallel against both stores, merges results using Reciprocal Rank Fusion, and re-ranks using a cross-encoder. Generation assembles a context window from the top-5 re-ranked chunks and calls the LLM to produce a grounded answer.

The Document Processing pipeline handles diverse formats: PyMuPDF for PDFs, python-docx for Word, BeautifulSoup for HTML, and tree-sitter for code. Chunking strategy depends on content type: recursive character splitting for prose, function-level splitting for code, table-aware chunking for spreadsheets. Each chunk stores metadata: source_document_id, page_number, section_heading, chunk_index, and content_hash. Metadata enables citation generation and supports metadata-filtered retrieval (e.g., retrieve only from documents tagged product=X).

Multi-turn conversation support uses a context-aware query rewriter: given the conversation history, an LLM rewrites the user's latest question into a self-contained query that includes implicit references resolved from prior turns. This rewritten query is used for retrieval, preventing context loss when users ask follow-up questions like "what about the pricing?" without specifying what "it" refers to.

Core Components

Document Chunking Service

A chunking service processes documents into overlapping chunks. Chunk size of 512 tokens with 128-token overlap balances recall (larger chunks miss nearby context) and precision (smaller chunks are more focused). For long structured documents (technical manuals), a hierarchical chunking approach stores both paragraph-level chunks (for precise retrieval) and section-level summaries (for broader context). Parent-child chunk retrieval: retrieve the small child chunk, but pass its parent (section) as context to the LLM for richer background.

Hybrid Retrieval Engine

Dense retrieval: embed the query using the same model as the corpus (text-embedding-3-large), query Qdrant HNSW index for top-20 nearest neighbors by cosine similarity. Sparse retrieval: execute a BM25 query in Elasticsearch with the same query text, retrieving top-20 by TF-IDF score. Merge both result lists using Reciprocal Rank Fusion: score = Σ 1/(k + rank_i) where k=60. RRF naturally handles the different score scales of dense and sparse retrieval. The merged top-20 list is passed to the cross-encoder re-ranker.

Cross-Encoder Re-Ranker

A cross-encoder (ms-marco-MiniLM-L-6-v2 or a fine-tuned variant) takes each (query, chunk) pair and produces a relevance score by processing them jointly through a transformer encoder. Unlike the bi-encoder used for embedding generation, the cross-encoder has full attention between query and document tokens, providing much higher precision. Running 20 cross-encoder inferences at 4ms each = 80ms total. The top-5 re-ranked chunks are selected as the LLM context.

Database Design

PostgreSQL for document and chunk metadata: documents (doc_id, title, source_url, file_type, uploaded_at, indexed_at, owner_id), chunks (chunk_id, doc_id, chunk_index, content TEXT, token_count INT, embedding_id VARCHAR, page_number INT, section_heading VARCHAR, content_hash VARCHAR). Qdrant collection: vectors with payload fields {chunk_id, doc_id, section_heading, page_number} for metadata filtering. Conversation history: Redis sorted set per session_id with turn timestamps as scores, TTL 24 hours.

API Design

POST /query — Submit a natural language query; returns generated answer, source citations, and confidence score. POST /documents — Upload a document for processing and indexing; returns doc_id and estimated indexing completion time. GET /documents/{doc_id}/chunks — Return all chunks for a document with their embedding status. POST /feedback — Submit user feedback (helpful/unhelpful) for a query response, linked to the query_id.

Scaling & Bottlenecks

LLM generation latency (1–2 seconds) dominates end-to-end latency. Mitigation: use streaming generation (return tokens as they are produced) to improve perceived responsiveness; the user sees the first token within 500ms even if the full answer takes 2 seconds. For high-QPS deployments, an LLM response cache (keyed by query_hash + retrieved_chunk_hashes, TTL 1 hour) serves repeated identical queries without LLM invocation, reducing cost and latency by 40% for FAQ-style workloads.

Re-ranking 20 candidates with a cross-encoder at 80ms is the second-largest latency contributor. Distilled cross-encoders (6-layer vs. 12-layer) run in 40ms with <2% quality degradation. Batching all 20 (query, chunk) pairs into a single inference call maximizes GPU utilization. If re-ranking latency remains a bottleneck, a lighter-weight listwise re-ranker (ColBERT late interaction) achieves similar quality at 10ms.

Key Trade-offs

Chunk size: Smaller chunks improve retrieval precision (the retrieved text is more focused on the relevant topic) but lose surrounding context needed for coherent answers; larger chunks preserve context but dilute relevance signals.
Bi-encoder vs. cross-encoder: Bi-encoder (dual embedding) scales to millions of documents with sub-millisecond lookup but has lower precision; cross-encoder is highly precise but cannot scale to millions of candidates — the two-stage pipeline uses both optimally.
RAG vs. fine-tuning: RAG provides up-to-date knowledge from a dynamic corpus without retraining; fine-tuning bakes knowledge into model weights for faster inference but requires retraining when knowledge changes.
Answer generation vs. extraction: Generative answers are more natural and synthesize across multiple chunks but can hallucinate; extractive answers (returning the most relevant chunk verbatim) are fully grounded but may not answer complex questions that require synthesis.