System Design: NLP Processing Pipeline

Requirements

Functional Requirements:

Process text documents through a configurable pipeline of NLP tasks: language detection, tokenization, POS tagging, NER, classification, and sentiment analysis
Support batch processing (millions of documents) and real-time processing (individual documents)
Enable semantic search: embed documents and queries using sentence transformers for dense retrieval
Support 50 languages with model quality degradation gracefully for low-resource languages
Allow pipeline composition: different tasks can be chained or run in parallel per use case
Output results in a structured schema and write to multiple sinks (Elasticsearch, database, event stream)

Non-Functional Requirements:

Real-time pipeline latency under 200ms for a 512-token document
Batch pipeline throughput of 1 million documents per hour
99.9% availability for the real-time pipeline
Model updates deployable with zero downtime using blue-green model loading
Horizontally scalable: adding GPUs must linearly increase batch throughput

Scale Estimation

1 million documents/hour batch = 278 docs/second. A 512-token document through a BERT-base model on GPU takes 20ms; a single T4 GPU processes 50 docs/second. 278 docs/second requires 6 T4 GPUs (with 20% headroom). For real-time: 1,000 requests/second at 200ms latency = 200 concurrent in-flight requests across the GPU pool. Embedding generation (for semantic search): a sentence transformer produces 768-dimensional vectors; 1 million docs * 768 * 4 bytes = 3 GB of embeddings per million documents.

High-Level Architecture

The NLP pipeline is organized as a directed acyclic graph of processing nodes. Each node is a stateless worker that receives a document, applies a transformation (tokenization, model inference, post-processing), and emits enriched output. The pipeline coordinator (a lightweight router service) determines which nodes to invoke for a given request type (defined by the use case config) and orchestrates parallel vs. sequential execution.

For real-time processing, the pipeline runs synchronously within a 200ms budget. Language detection (FastText, 1ms) runs first to select the appropriate language-specific models. Tokenization (HuggingFace tokenizer, compiled to Rust for speed) runs in 2ms. GPU inference runs batched across concurrent requests using NVIDIA Triton Inference Server with dynamic batching. Post-processing (converting logits to labels, filtering entities by confidence) runs CPU-side in parallel with the next request's preprocessing.

For batch processing, documents are partitioned into shards of 10,000 documents, published to Kafka, and consumed by GPU worker pods. Each pod loads the model into GPU memory once and processes the entire shard, amortizing model load overhead. Results are written to S3 in Parquet format and, for search use cases, bulk-indexed into Elasticsearch.

Core Components

Model Serving (Triton Inference Server)

NVIDIA Triton serves all NLP models with dynamic batching: requests arriving within a 5ms window are grouped into a batch for a single GPU forward pass. Models are stored in TensorRT format (BERT-base converted to TensorRT achieves 3x speedup vs. PyTorch on T4). Concurrent model instances allow multiple models (NER, classification, sentiment) to run on the same GPU in parallel using GPU time-slicing. Model versions are managed through Triton's model repository; new versions are loaded while old versions continue serving, enabling zero-downtime updates.

Text Preprocessing Service

A high-performance tokenization service (Rust-implemented HuggingFace tokenizers, Python bindings) handles Unicode normalization, lowercasing, byte-pair encoding, and padding to fixed sequence lengths. For multilingual support, the XLM-RoBERTa tokenizer covers 100 languages with a shared subword vocabulary. A language detection service (FastText lid.176.bin model, 1ms inference) runs before tokenization to route documents to the appropriate language-specific fine-tuned model when available.

Semantic Embedding Store

Sentence transformer models (all-MiniLM-L6-v2 or multilingual-e5-large) produce 768-dimensional embeddings per document. Embeddings are stored in a vector database (Qdrant, Weaviate, or Pinecone) for semantic search. The indexing pipeline embeds new documents as they arrive and upserts them to the vector store. At query time, the query is embedded using the same model, and approximate nearest neighbor search (HNSW) retrieves the 100 most similar documents in under 50ms across a corpus of 10 million embeddings.

Database Design

Processing results are stored in PostgreSQL with JSONB for flexible NLP output: processed_documents (doc_id, source_id, language, processing_pipeline_version, classification_results JSONB, entities_json JSONB, sentiment_score FLOAT, embedding_vector VECTOR(768), processed_at). The pg_vector extension enables cosine similarity search directly in PostgreSQL for smaller corpora (<1 million documents), eliminating the need for a separate vector store. For larger corpora, embeddings are offloaded to Qdrant.

API Design

POST /process — Submit a document for real-time NLP processing; returns classification labels, entities, sentiment, and optional embedding. POST /batch/jobs — Submit a batch processing job with S3 input path, pipeline config, and output destination. POST /search/semantic — Query the embedding store with a natural language query; returns ranked document matches with similarity scores. GET /models — List available NLP models with supported tasks, languages, and performance benchmarks.

Scaling & Bottlenecks

GPU memory is the primary scaling constraint. BERT-base (110M parameters) uses 450 MB GPU memory in FP16; a 16 GB GPU can hold 35 model instances. Sharing GPU memory between multiple model types (NER + classification + sentiment) requires careful memory management. NVIDIA MPS (Multi-Process Service) allows concurrent kernel execution from multiple processes on the same GPU, improving GPU utilization from 40% to 80% for short-duration NLP inference kernels.

Batch throughput for very long documents (>2,048 tokens) requires chunking with overlap (sliding window of 512 tokens with 50-token overlap) and result aggregation across chunks. This multiplies compute by 4x for 2,048-token documents; routing long documents to a separate high-memory queue prevents them from blocking shorter documents in the real-time pool.

Key Trade-offs

Single large multilingual model vs. per-language models: A single XLM-RoBERTa model simplifies deployment and covers all languages but underperforms language-specific fine-tuned models by 5–10% on high-resource languages; a routing strategy uses language-specific models for top-5 languages and falls back to the multilingual model for others.
GPU vs. CPU inference: GPU inference is 5–10x faster but expensive; distilled models (DistilBERT: 66M params, 60% of BERT speed but 95% of quality) make CPU inference viable for cost-sensitive workloads.
Synchronous vs. asynchronous pipeline: Synchronous processing within 200ms is required for interactive use cases; async processing (Kafka-based) achieves 10x higher throughput for batch use cases.
Pre-trained vs. fine-tuned models: Pre-trained models (no task-specific training) are fast to deploy but underperform fine-tuned models by 15–30% on domain-specific text; fine-tuning on 10,000+ labeled examples typically pays off within a week of human labeling effort.