System Design: Public Records Management System

Requirements

Functional Requirements:

Ingest documents from multiple agency systems (email, scanned PDFs, electronic forms)
Full-text search and metadata search across all indexed records
FOIA request workflow: citizen submits request → agency reviews → redacts sensitive content → publishes approved records
Role-based access: public users see approved records; agency staff see all; specific roles can perform redaction
Retention schedule enforcement: automatically flag or archive records per statutory retention periods
Bulk export for data requests and cross-agency discovery

Non-Functional Requirements:

Retain records for 25-100 years depending on record type; archival integrity over decades
Search response under 2 seconds for 99th percentile across 1B+ documents
All access events logged with user, record, timestamp, and action
System must comply with Section 508 accessibility standards
Disaster recovery RPO of 1 hour, RTO of 4 hours

Scale Estimation

For a federal agency: 50 agencies × 10M documents each = 500M documents. Annual ingestion: 20M new documents/year. Average document size: 500KB → 250TB total. Search traffic: 1M queries/day = ~12/second. FOIA requests: ~100k/year with an average of 500 documents per request to review. Retention enforcement: nightly batch over 500M records.

High-Level Architecture

The system uses an ingestion pipeline, a document store, a search index, and a workflow engine for FOIA processing. Documents enter through an Ingestion Service that accepts uploads from agency systems via SFTP, API, or email integration. OCR processing converts scanned documents to searchable text. Metadata extraction (agency, date, document type, classification) is performed by an ML-enrichment service and human-reviewed tags.

The Document Store is a tiered storage architecture: recently ingested documents in S3 Standard; documents older than 2 years in S3 Intelligent-Tiering; documents older than 10 years in S3 Glacier with retrieval SLAs acceptable for archival access patterns. A metadata database in PostgreSQL stores searchable fields, access control lists, and retention metadata, allowing fast queries without retrieving document content.

The FOIA Workflow Engine is a state-machine-driven process with stages: Received → Agency Review → Redaction → Legal Review → Approved/Denied → Published. Each stage has assignable queues for agency staff, configurable SLA timers (most FOIA responses are legally required within 20 business days), and automated reminder notifications. Redaction is performed in a sandboxed web viewer that never exposes original files to the redactor's local machine.

Core Components

Ingestion & OCR Pipeline

A queue-driven pipeline where each document ingestion job goes through: virus scanning → format normalization (to PDF/A for long-term archival) → OCR via Tesseract or AWS Textract for scanned images → metadata extraction via NLP model (entity recognition for agency names, dates, subjects) → checksum computation → storage and indexing. The pipeline is idempotent: re-ingesting an already-indexed document with the same checksum is a no-op.

Search Service

An Elasticsearch cluster with per-agency indexes, supporting full-text search with relevance ranking, faceted filtering (agency, date range, document type, classification level), and Boolean query syntax. An access-control layer in the search service filters results to only documents the requesting role is permitted to view, applied at query time via document-level security. Suggest/autocomplete is powered by edge n-gram analyzers on title and subject fields.

Retention & Lifecycle Engine

A nightly batch job evaluates every document against its retention schedule (stored as metadata). Documents reaching their retention trigger date are flagged for review rather than automatically deleted — a human approval workflow ensures legal holds are respected. Approved-for-deletion records are first exported to a cryptographically sealed archive package stored in cold storage, then the primary copy is deleted. The archive package itself has its own immutable retention record.

Database Design

PostgreSQL stores document metadata: document_id UUID, agency_id, document_type, title, created_date, classification_level ENUM, status ENUM, s3_key, retention_trigger_date, access_group[]. A document_access_log table captures every view, download, or search result click: log_id, document_id, user_id, action, timestamp, ip_address. This table is append-only with row-level delete privileges restricted to a compliance role.

A separate foia_requests table manages the FOIA workflow: request_id, requester_id, description, status, assigned_to, due_date, responsive_documents[]. Related foia_notes and foia_redaction_records tables capture reviewer notes and the mapping of original to redacted document versions. Redacted documents are stored as a separate S3 object; the original is never modified.

API Design

POST /api/v1/documents — internal agency endpoint to ingest a new document; returns {document_id, ingest_job_id}.

GET /api/v1/documents/search?q={query}&agency={id}&from={date}&to={date} — public search with access-control filtering.

POST /api/v1/foia/requests — citizen submits a FOIA request with description and contact information.

PUT /api/v1/foia/requests/{requestId}/respond — agency endpoint to upload approved/redacted documents and set status to Published.

Scaling & Bottlenecks

Elasticsearch scaling for 500M+ documents requires careful index sharding. Documents are sharded by agency and year, with alias-based routing directing queries to the relevant shard groups. Hot shards (current year, frequently searched agencies) get more replicas than cold shards (archive years). A tiered query routing strategy sends simple metadata queries to a lightweight PostgreSQL full-text index and complex content searches to Elasticsearch, reducing load on the cluster.

The OCR pipeline is CPU-intensive and bursty during mass document ingestion events. An auto-scaling worker fleet backed by a durable SQS queue handles bursts — documents may take minutes to hours to become searchable after ingestion, which is acceptable for archival records. Priority queues ensure FOIA-responsive documents are indexed before batch-ingested archival materials.

Key Trade-offs

Immediate vs. eventual search availability: Synchronous indexing would make documents searchable instantly but slows ingestion throughput; async indexing via pipeline is faster for bulk loads but introduces a search availability delay.
Centralized vs. federated records management: A single system simplifies search but requires all agencies to adopt it; a federated model with cross-agency search federation is more politically tractable but harder to maintain consistently.
Automatic deletion vs. human approval: Fully automated retention enforcement is efficient but risks deleting records under undeclared legal hold; human-in-the-loop deletion review adds friction but reduces legal risk.
PDF/A archival format: Converting all documents to PDF/A ensures long-term readability but loses metadata embedded in native formats (Word, Excel), requiring a trade-off between format fidelity and archival standardization.