System Design: Document Search Platform

Requirements

Functional Requirements:

Index PDF, DOCX, XLSX, PPTX, HTML, and plain text files
Full-text search with phrase queries, boolean operators, and field search
Semantic search for concept-level matching beyond keywords
Enforce per-document access control (RBAC/ABAC) — users only see documents they have permission to view
Highlight matching passages in document previews
Support versioning — index and search multiple versions of a document

Non-Functional Requirements:

Sub-500ms search latency
Index 100 million documents up to 50 MB each
Support 10,000 concurrent users
Access control enforcement with zero tolerance for leakage
New documents indexed within 5 minutes of upload

Scale Estimation

With 100 million documents averaging 200 KB of extractable text, the raw text corpus is 20 TB. After tokenization and compression in the inverted index, the index is ~4 TB. Assuming average document length of 50 pages (50,000 tokens), each document generates ~200 text chunks for semantic search. Total chunk count: 20 billion — too large for a single vector index. A hierarchical approach (coarse document-level + fine passage-level) reduces ANN index size. Access control metadata (ACL lists per document) adds ~50 bytes per document: 5 GB total.

High-Level Architecture

The system has four pipelines: ingestion, parsing, indexing, and query serving. Ingestion receives files from storage systems (SharePoint, Google Drive, S3, email) via connectors or direct upload API. Parsing extracts text content from binary formats. Indexing writes extracted content to the search engine and computes embeddings. Query serving handles user searches, enforces access control, and assembles result previews.

Document parsing is format-specific: Apache Tika handles 1,000+ formats and extracts raw text, metadata (author, creation date, last modified), and structure (headings, paragraphs, tables). PDF parsing uses a combination of text layer extraction (for digital PDFs) and OCR (Tesseract) for scanned images. A document classifier detects scanned vs. digital PDFs to route to the appropriate parser. Extracted content is written to an intermediate JSON format with sections, paragraphs, and metadata fields.

Access control is enforced at query time using a security filter. Each document in the index stores its ACL as a set of allowed user IDs and group IDs. At query time, the user's identity (from an auth token) is used to fetch their group memberships from an LDAP/Active Directory service. These memberships are added to the search query as a mandatory filter: only documents where acl contains the user's ID or any of their groups are returned. This is implemented as a Elasticsearch terms filter with caching to avoid per-query LDAP lookups.

Core Components

Document Parser & OCR Service

The parser service is a pool of worker nodes that receive parsing jobs from a Kafka queue. Each job references a document in object storage (S3). Workers pull the document, detect its MIME type, and route to the appropriate parser: Tika for most formats, pdf2image + Tesseract for scanned PDFs, python-docx for DOCX, openpyxl for XLSX. OCR is parallelized per page (Tesseract's --oem 3 LSTM mode). Extracted text is segmented into sections using heuristics (heading detection, paragraph breaks). The output JSON includes sections with char offsets for later highlighting.

Elasticsearch Index with Access Control

Documents are indexed in Elasticsearch with fields: doc_id, filename, title (analyzed), content (analyzed, with term vectors for highlighting), author, created_at, modified_at, file_type, acl (keyword array of user/group IDs), embedding (dense_vector for semantic search). The ACL field is never exposed in search results (source filtering). Highlighting uses Elasticsearch's Fast Vector Highlighter on the content field. Version management stores each document version as a separate document with a version field, and a latest_version flag filters results to show only the most recent by default.

Access Control Enforcement

User group memberships are cached in Redis (TTL: 5 minutes) to avoid LDAP round-trips on every search request. The security filter is pre-computed: given a user's groups [g1, g2, g3], the query becomes: filter: { terms: { acl: [userId, g1, g2, g3] } }. This filter is cached in Elasticsearch's filter cache (LRU, 10% of JVM heap) since the same group set is reused across many queries by the same user. For highly sensitive documents, a post-query ACL check validates results against the authoritative ACL store before returning, providing defense-in-depth against index inconsistencies.

Database Design

Document metadata is stored in PostgreSQL (doc_id, file_path, file_type, owner_id, acl_version, indexed_at, last_modified, version_count). ACL data is stored in a dedicated ACL service backed by PostgreSQL, with Redis caching. The search index (Elasticsearch) stores processed text and embeddings. Raw files are stored in S3, with presigned URLs used for document previews. A processing status table (DynamoDB or Redis) tracks each document's indexing state (uploaded, parsing, indexed, failed) for the ingestion pipeline's job management.

API Design

Scaling & Bottlenecks

OCR is the throughput bottleneck for scanned document indexing. Tesseract processes ~1 page/second per CPU core. For 100 million scanned documents averaging 20 pages each, serial OCR would take 2 billion seconds. A distributed OCR farm with 1,000 CPU cores (each handling ~1 page/sec) processes 1,000 pages/sec, completing 2 billion pages in ~23 days. For ongoing ingestion, maintaining a fleet sized for the expected upload rate (e.g., 100,000 new pages/day requires ~2 OCR worker cores continuously). GPU-accelerated OCR (NVIDIA's cuDNN-based Tesseract) achieves 10–20x speedup.

Access control filtering at scale creates a challenge: Elasticsearch terms filter with 100,000 values (large group memberships) is slow. Strategies to handle large ACL sets: (1) document field fingerprinting — hash the ACL set to a compact integer ID; (2) role-based simplification — map fine-grained ACLs to broader role buckets; (3) pre-sharding by department — route queries for a given user to a shard containing only documents accessible to their department, reducing per-query ACL filter size.

Key Trade-offs

OCR quality vs. speed: LSTM-based OCR (high accuracy) takes 3–5x longer than legacy Tesseract; scanned document quality varies widely; consider quality-tiered routing
ACL index freshness vs. query performance: Caching group memberships reduces LDAP load but risks serving results for documents whose ACL has changed in the last 5 minutes
Full re-index vs. delta updates: When document content changes, full reindexing is simple but expensive; partial updates (new version as new document) avoid reindex cost at the price of version management complexity
Per-document ACL vs. per-folder ACL: Folder-level ACLs reduce metadata overhead but cannot support document-level exceptions without reverting to per-document ACLs