System Design: Resume Parsing & Matching System

Requirements

Functional Requirements:

Parse resumes in PDF, DOCX, and plain text formats extracting structured fields (name, email, phone, education, work experience, skills, certifications)
Normalize extracted skills against a standardized skill taxonomy (e.g., "JS" → "JavaScript", "ML" → "Machine Learning")
Score candidate-job fit by matching extracted resume data against job requirements
Batch processing mode for bulk resume ingestion from email inboxes and ATS integrations
Human-in-the-loop correction interface where recruiters can fix parsing errors, feeding back into model improvement
Support resumes in 10+ languages with automatic language detection

Non-Functional Requirements:

Parse 1 million resumes/day with P95 latency under 10 seconds per resume
Extraction accuracy of 90%+ for key fields (name, email, experience dates)
99.9% availability for the parsing API
Model retraining weekly incorporating correction feedback
Handle diverse resume formats including creative layouts, multi-column designs, and scanned images

Scale Estimation

1M resumes/day = 11.6 resumes/sec. Average resume: 2 pages, 300KB PDF. Processing pipeline per resume: PDF-to-text extraction (1 sec), NLP parsing (3 sec), skill normalization (0.5 sec), embedding generation (1 sec) = 5.5 sec total. Required parallel workers: 11.6 × 5.5 = 64 workers at steady state, 150 workers for 2x peak. Storage: 1M × 300KB = 300GB/day raw resumes. Parsed structured data: 1M × 5KB JSON = 5GB/day. Candidate-job matching: 1M candidates × 20M jobs = 20T pairs — requires embedding-based ANN to avoid brute force.

High-Level Architecture

The system has three pipelines: Parse, Index, and Match. The Parse Pipeline receives resumes via API upload or batch ingestion (polling email inboxes, S3 bucket triggers, ATS webhooks). Each resume enters a processing queue (SQS) and is picked up by a Parser Worker. The worker first extracts raw text: for PDFs, Apache Tika with pdfbox extracts text with layout preservation; for scanned/image-based PDFs, an OCR stage (Tesseract or AWS Textract) converts images to text. The raw text is then processed by an NLP extraction model — a fine-tuned BERT-based Named Entity Recognition (NER) model that identifies entities: PERSON_NAME, EMAIL, PHONE, EDUCATION (institution, degree, dates), EXPERIENCE (company, title, dates, description), SKILL, CERTIFICATION.

The Index Pipeline takes parsed structured data and builds a searchable candidate index. Skills are mapped to a hierarchical taxonomy (e.g., "React" → "Frontend Frameworks" → "Web Development" → "Software Engineering") using a combination of exact matching, synonym lookup, and a skill embedding model for fuzzy matching (e.g., recognizing "TensorFlow" as related to "Deep Learning"). Each candidate profile is encoded into a 512-dimensional embedding vector using a Sentence-BERT model trained on resume-job description pairs. Embeddings are stored in a FAISS index for fast similarity search.

The Match Pipeline scores candidate-job pairs. When a recruiter requests matches for a job posting, the job description is encoded into the same embedding space and ANN search retrieves the top 500 candidates. A re-ranking model (LightGBM) then scores each candidate using fine-grained features: skill overlap percentage, experience years match, education level match, location proximity, and industry alignment. The ranked list is returned to the recruiter with explainable match scores.

Core Components

Resume Parser (NER Model)

The NER model is a fine-tuned BERT-base (110M parameters) trained on 500K labeled resumes spanning diverse industries and formats. The model uses BIO tagging to identify entity boundaries. Input preprocessing includes layout analysis (detecting multi-column formats and reordering text into reading order) and section detection (identifying "Experience", "Education", "Skills" headers using a separate classifier). Post-processing rules clean up extracted entities: date normalization ("Jan 2020 - Present" → {start: "2020-01", end: null}), email/phone regex validation, and name capitalization. For scanned resumes, OCR confidence scores below 0.8 trigger a fallback to a vision-based extraction model (LayoutLMv3) that works directly on document images.

Skill Taxonomy & Normalization

The skill taxonomy contains 50,000 skills organized in a 4-level hierarchy. Normalization uses a multi-strategy approach: (1) exact match against a synonym dictionary (10,000 entries maintained by domain experts), (2) fuzzy matching using Levenshtein distance for typos ("Pythn" → "Python"), (3) embedding similarity using a skill2vec model for semantic matching ("Data Wrangling" → "Data Preprocessing"). New skills not in the taxonomy are flagged for human review and added weekly. The taxonomy also maps skills to categories (programming languages, frameworks, soft skills) and proficiency indicators ("expert in Python" → {skill: "Python", proficiency: "expert"}).

Candidate-Job Scoring Engine

The scoring engine produces an interpretable match score (0-100) with per-dimension breakdowns. Dimensions include: skills_match (weighted Jaccard similarity of required vs candidate skills, with weights from the job's priority ranking), experience_fit (Gaussian scoring around the target years), education_match (hierarchical scoring: exact degree match > related field > any degree), location_score (distance-based decay function), and industry_relevance (cosine similarity of industry embeddings). The overall score is a learned weighted combination trained on recruiter accept/reject decisions. The engine processes 500 candidates per job in under 2 seconds using vectorized computation (NumPy/pandas).

Database Design

Parsed resumes are stored in PostgreSQL: candidates (candidate_id UUID PK, raw_resume_s3_path, parsed_data JSONB, skills ARRAY, experience_years FLOAT, education_level ENUM, location, language, parsing_confidence FLOAT, parsed_at, corrected_by nullable). The parsed_data JSONB contains the full structured extraction (experiences, education entries, certifications). A GIN index on the skills array enables efficient filtering. Candidate embeddings are stored in a pgvector column for small-scale matching and replicated to FAISS for large-scale ANN.

The skill taxonomy is stored in a separate PostgreSQL table: skills (skill_id, canonical_name, category, parent_skill_id, synonyms ARRAY, embedding VECTOR(128)). A corrections table tracks human feedback: corrections (correction_id, candidate_id, field_name, original_value, corrected_value, corrector_id, created_at) — this table feeds the weekly model retraining pipeline. Job-candidate match results are cached in Redis (job_id → sorted list of {candidate_id, score}) with a 24-hour TTL.

API Design

POST /api/v1/resumes/parse — Upload and parse a resume; body is multipart form with file; returns candidate_id and parsed structured data
GET /api/v1/candidates/{candidate_id}/profile — Fetch parsed candidate profile with skills, experience, and education
POST /api/v1/match — Find matching candidates for a job; body contains job_description, required_skills, location, experience_range; returns ranked candidate list with scores
PUT /api/v1/candidates/{candidate_id}/corrections — Submit human corrections to parsed data; body contains field corrections

Scaling & Bottlenecks

The NER model inference is the primary bottleneck. BERT inference on CPU takes 3 seconds per resume; on GPU (T4), it drops to 200ms. A fleet of 10 GPU instances handles the steady-state 11.6 resumes/sec with headroom. For batch ingestion spikes (e.g., an enterprise customer uploading 500K resumes from a legacy ATS), auto-scaling adds GPU instances with a 5-minute warm-up time (model loading). During the warm-up period, requests queue in SQS with a 30-minute visibility timeout.

The FAISS index for 200M candidate embeddings requires ~400GB of memory (512 dimensions × 4 bytes × 200M). This is distributed across 8 machines using FAISS's IndexShards, with each machine holding 25M vectors. Index rebuilds run nightly; real-time updates use a small auxiliary index (IndexIDMap) that is merged into the main index during the nightly rebuild. Query latency for top-500 ANN search is under 50ms per shard, with results merged and re-ranked in 200ms total.

Key Trade-offs

BERT-based NER vs rule-based parsing: BERT achieves 92% accuracy vs 75% for regex-based rules on diverse resume formats, but requires GPU infrastructure and labeled training data — the accuracy improvement directly impacts recruiter productivity
Pre-computed embeddings vs on-the-fly encoding: Pre-computing candidate embeddings enables sub-second matching but means embeddings are stale if the resume is updated — nightly recomputation provides a reasonable freshness trade-off
Hierarchical skill taxonomy vs flat skill list: The hierarchy enables semantic matching ("React" is related to "Frontend") but requires ongoing curation — automated taxonomy expansion using skill2vec reduces maintenance burden
OCR fallback for scanned resumes vs reject-and-request-resubmission: OCR handles 15% of resumes that are scanned images, but OCR errors (especially for non-English text) can degrade parsing quality — the trade-off favors inclusion over rejection