SYSTEM_DESIGN
System Design: Applicant Tracking System (ATS)
System design of an ATS covering application pipeline management, interview scheduling, collaborative evaluation, and compliance tracking for enterprise-scale hiring workflows.
Requirements
Functional Requirements:
- Recruiters create job requisitions with customizable hiring pipelines (stages: applied, phone screen, onsite, offer, hired/rejected)
- Candidates apply via career page, job board integrations, or recruiter-added referrals
- Interview scheduling with calendar integration (Google Calendar, Outlook) and automated candidate communication
- Collaborative evaluation: interviewers submit structured scorecards; hiring committees review aggregated feedback
- Pipeline analytics: time-to-hire, source effectiveness, diversity metrics, and funnel conversion rates
- Compliance features: EEOC/OFCCP reporting, data retention policies, GDPR right-to-erasure
Non-Functional Requirements:
- Support 10,000 enterprise customers with 50M total candidates in the system
- Candidate stage transitions processed within 2 seconds
- 99.95% availability; scheduling and offer workflows are business-critical
- Multi-tenant architecture with strict data isolation between customers
- Audit log for every action on candidate records (immutable, retained 7 years)
Scale Estimation
10K enterprises, averaging 500 open requisitions each = 5M active requisitions. 2M new applications/day = 23 applications/sec. Stage transitions: 500K/day = 5.8/sec. Interview scheduling events: 200K/day = 2.3/sec. Calendar API calls (check availability, create events): 1M/day. Scorecard submissions: 100K/day. Email/notifications sent: 3M/day. Total candidate records: 50M with average 20 documents each (resumes, cover letters, scorecards) = 1B documents. Storage: 50M candidates × 2MB average attachments = 100TB.
High-Level Architecture
The ATS uses a multi-tenant SaaS architecture deployed on Kubernetes. Tenant isolation is enforced at the database level using PostgreSQL Row-Level Security (RLS) with a tenant_id column on every table. Application servers are stateless and shared across tenants; tenant context is established from the JWT token on every request. The architecture has five core services communicating via an event bus (Kafka).
The Requisition Service manages job openings and their associated hiring pipelines. Each requisition has a configurable pipeline defined as an ordered list of stages with transition rules (e.g., "requires 2 scorecard submissions before advancing to onsite"). The Application Service handles candidate intake from multiple sources (career page, job boards via ATS-XML feeds, referrals, agency submissions) and deduplicates candidates using email matching and fuzzy name matching. The Scheduling Service integrates with calendar providers to find available interview slots, sends invitations, and handles rescheduling.
The Evaluation Service manages scorecards and hiring decisions. Interviewers receive scorecard templates (customizable per requisition) with structured rubrics (technical skills, communication, culture fit) and free-text notes. Scorecards are submitted asynchronously and aggregated into a hiring committee view. The Analytics Service consumes events from Kafka to build real-time dashboards: pipeline velocity, bottleneck identification (stages where candidates stall), diversity funnel analysis, and source ROI.
Core Components
Pipeline Engine
The pipeline engine is a state machine that governs candidate progression through hiring stages. Each requisition defines a pipeline as a directed graph of stages with transition predicates. A transition predicate might require: minimum number of completed interviews, scorecard average above a threshold, approval from a hiring manager, or completion of a background check. The engine evaluates predicates on every relevant event (scorecard submitted, interview completed) and automatically advances candidates when all predicates are satisfied. Manual overrides are supported with audit logging. The engine uses the saga pattern for multi-step transitions (e.g., advancing to offer stage requires generating an offer letter, getting approval, and sending to the candidate).
Interview Scheduling
The Scheduling Service uses a constraint satisfaction approach to find optimal interview slots. Inputs include: interviewer availability (fetched via Google Calendar/Microsoft Graph API with 15-minute polling), candidate availability (collected via a scheduling link), interview duration and type (phone, video, onsite), panel requirements (e.g., "2 engineers + 1 hiring manager"), and timezone coordination. The solver (a greedy algorithm with backtracking) finds the earliest slot satisfying all constraints. Calendar events are created via API with the ATS as the organizer; updates and cancellations are propagated bidirectionally. For onsite interviews, room booking integration reserves conference rooms. Automated reminder emails are sent 24 hours and 1 hour before the interview.
Compliance & Audit System
Every mutation to candidate data is logged in an append-only audit table: audit_logs (log_id, tenant_id, actor_id, action ENUM, entity_type, entity_id, old_value JSONB, new_value JSONB, ip_address, timestamp). This table uses TimescaleDB (PostgreSQL extension for time-series) for efficient time-range queries and automatic partitioning. GDPR right-to-erasure requests trigger a pseudonymization workflow: candidate PII (name, email, phone, address) is replaced with anonymized tokens, while non-PII data (stage progression, aggregate scores) is retained for analytics. EEOC data (voluntarily disclosed race, gender, veteran status) is stored in a separate, access-controlled table with aggregate-only query permissions to prevent individual identification.
Database Design
The primary database is PostgreSQL with RLS for multi-tenancy. Core tables: tenants (tenant_id, company_name, plan, settings JSONB), requisitions (req_id, tenant_id, title, department, pipeline_config JSONB, status, created_by, created_at), candidates (candidate_id, tenant_id, name, email_encrypted, phone_encrypted, source, created_at), applications (application_id, tenant_id, candidate_id, req_id, current_stage, applied_at, updated_at), scorecards (scorecard_id, application_id, interviewer_id, scores JSONB, notes_encrypted, submitted_at). PII fields are encrypted at rest using envelope encryption (AWS KMS) with per-tenant data keys.
Indexes: (tenant_id, req_id, current_stage) for pipeline views, (tenant_id, candidate_id) for candidate profiles, (tenant_id, created_at DESC) for recent activity. A read replica per region serves analytics queries without impacting transactional workloads. Candidate attachments (resumes, offer letters) are stored in S3 with server-side encryption, organized by tenant_id/candidate_id prefix for efficient lifecycle management.
API Design
POST /api/v1/requisitions— Create a job requisition with pipeline configuration; returns req_idPOST /api/v1/applications— Submit a candidate application; body contains candidate info, resume file, source; returns application_idPUT /api/v1/applications/{app_id}/stage— Advance or move a candidate to a new stage; body contains target_stage, notesPOST /api/v1/scorecards— Submit an interview scorecard; body contains application_id, scores, notes; returns scorecard_id
Scaling & Bottlenecks
Calendar integration is the primary external bottleneck. Polling 200K interviewer calendars for availability hits rate limits on Google Calendar API (1M requests/day per project) and Microsoft Graph API. The system uses webhook subscriptions (push notifications on calendar changes) instead of polling where supported, reducing API calls by 90%. A local availability cache (Redis) stores each interviewer's free/busy data with a 15-minute refresh cycle; scheduling requests use the cache first and fall back to live API calls only for the final confirmation.
Multi-tenant database performance requires careful index management. With 10K tenants sharing tables, queries must always include tenant_id to leverage RLS and indexes. Large tenants (enterprises with 1M+ candidates) can cause partition-level hot spots; for these customers, the system offers dedicated database instances (tenant-level sharding) at an enterprise pricing tier. The audit log grows by 5M rows/day and is partitioned monthly with automatic archival to S3 (Parquet format) after 1 year.
Key Trade-offs
- Row-Level Security vs schema-per-tenant for multi-tenancy: RLS provides simpler operational management (single schema, shared connection pools) but risks noisy-neighbor issues — dedicated instances for large tenants provide a safety valve
- Push (webhooks) vs pull (polling) for calendar sync: Webhooks reduce API calls 90% but require handling webhook delivery failures and expiration renewal — the hybrid approach (webhooks + periodic polling reconciliation) provides reliability
- Configurable pipeline engine vs hardcoded stages: A flexible state machine supports diverse hiring processes (startups vs enterprises) but increases complexity in the UI and API — progressive disclosure (simple defaults with advanced customization) balances usability
- PII encryption at the field level vs disk-level encryption: Field-level encryption with per-tenant keys enables true data isolation and simplifies GDPR erasure, but adds 2-3ms latency per encrypted field access — acceptable for the security guarantee
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.