System Design: Applicant Tracking System (ATS)

Requirements

Functional Requirements:

Recruiters create job requisitions with customizable hiring pipelines (stages: applied, phone screen, onsite, offer, hired/rejected)
Candidates apply via career page, job board integrations, or recruiter-added referrals
Interview scheduling with calendar integration (Google Calendar, Outlook) and automated candidate communication
Collaborative evaluation: interviewers submit structured scorecards; hiring committees review aggregated feedback
Pipeline analytics: time-to-hire, source effectiveness, diversity metrics, and funnel conversion rates
Compliance features: EEOC/OFCCP reporting, data retention policies, GDPR right-to-erasure

Non-Functional Requirements:

Support 10,000 enterprise customers with 50M total candidates in the system
Candidate stage transitions processed within 2 seconds
99.95% availability; scheduling and offer workflows are business-critical
Multi-tenant architecture with strict data isolation between customers
Audit log for every action on candidate records (immutable, retained 7 years)

Scale Estimation

10K enterprises, averaging 500 open requisitions each = 5M active requisitions. 2M new applications/day = 23 applications/sec. Stage transitions: 500K/day = 5.8/sec. Interview scheduling events: 200K/day = 2.3/sec. Calendar API calls (check availability, create events): 1M/day. Scorecard submissions: 100K/day. Email/notifications sent: 3M/day. Total candidate records: 50M with average 20 documents each (resumes, cover letters, scorecards) = 1B documents. Storage: 50M candidates × 2MB average attachments = 100TB.

High-Level Architecture

The ATS uses a multi-tenant SaaS architecture deployed on Kubernetes. Tenant isolation is enforced at the database level using PostgreSQL Row-Level Security (RLS) with a tenant_id column on every table. Application servers are stateless and shared across tenants; tenant context is established from the JWT token on every request. The architecture has five core services communicating via an event bus (Kafka).

The Requisition Service manages job openings and their associated hiring pipelines. Each requisition has a configurable pipeline defined as an ordered list of stages with transition rules (e.g., "requires 2 scorecard submissions before advancing to onsite"). The Application Service handles candidate intake from multiple sources (career page, job boards via ATS-XML feeds, referrals, agency submissions) and deduplicates candidates using email matching and fuzzy name matching. The Scheduling Service integrates with calendar providers to find available interview slots, sends invitations, and handles rescheduling.

The Evaluation Service manages scorecards and hiring decisions. Interviewers receive scorecard templates (customizable per requisition) with structured rubrics (technical skills, communication, culture fit) and free-text notes. Scorecards are submitted asynchronously and aggregated into a hiring committee view. The Analytics Service consumes events from Kafka to build real-time dashboards: pipeline velocity, bottleneck identification (stages where candidates stall), diversity funnel analysis, and source ROI.

Core Components

Pipeline Engine

The pipeline engine is a state machine that governs candidate progression through hiring stages. Each requisition defines a pipeline as a directed graph of stages with transition predicates. A transition predicate might require: minimum number of completed interviews, scorecard average above a threshold, approval from a hiring manager, or completion of a background check. The engine evaluates predicates on every relevant event (scorecard submitted, interview completed) and automatically advances candidates when all predicates are satisfied. Manual overrides are supported with audit logging. The engine uses the saga pattern for multi-step transitions (e.g., advancing to offer stage requires generating an offer letter, getting approval, and sending to the candidate).

Interview Scheduling

The Scheduling Service uses a constraint satisfaction approach to find optimal interview slots. Inputs include: interviewer availability (fetched via Google Calendar/Microsoft Graph API with 15-minute polling), candidate availability (collected via a scheduling link), interview duration and type (phone, video, onsite), panel requirements (e.g., "2 engineers + 1 hiring manager"), and timezone coordination. The solver (a greedy algorithm with backtracking) finds the earliest slot satisfying all constraints. Calendar events are created via API with the ATS as the organizer; updates and cancellations are propagated bidirectionally. For onsite interviews, room booking integration reserves conference rooms. Automated reminder emails are sent 24 hours and 1 hour before the interview.

Compliance & Audit System

Every mutation to candidate data is logged in an append-only audit table: audit_logs (log_id, tenant_id, actor_id, action ENUM, entity_type, entity_id, old_value JSONB, new_value JSONB, ip_address, timestamp). This table uses TimescaleDB (PostgreSQL extension for time-series) for efficient time-range queries and automatic partitioning. GDPR right-to-erasure requests trigger a pseudonymization workflow: candidate PII (name, email, phone, address) is replaced with anonymized tokens, while non-PII data (stage progression, aggregate scores) is retained for analytics. EEOC data (voluntarily disclosed race, gender, veteran status) is stored in a separate, access-controlled table with aggregate-only query permissions to prevent individual identification.

Database Design

The primary database is PostgreSQL with RLS for multi-tenancy. Core tables: tenants (tenant_id, company_name, plan, settings JSONB), requisitions (req_id, tenant_id, title, department, pipeline_config JSONB, status, created_by, created_at), candidates (candidate_id, tenant_id, name, email_encrypted, phone_encrypted, source, created_at), applications (application_id, tenant_id, candidate_id, req_id, current_stage, applied_at, updated_at), scorecards (scorecard_id, application_id, interviewer_id, scores JSONB, notes_encrypted, submitted_at). PII fields are encrypted at rest using envelope encryption (AWS KMS) with per-tenant data keys.

Indexes: (tenant_id, req_id, current_stage) for pipeline views, (tenant_id, candidate_id) for candidate profiles, (tenant_id, created_at DESC) for recent activity. A read replica per region serves analytics queries without impacting transactional workloads. Candidate attachments (resumes, offer letters) are stored in S3 with server-side encryption, organized by tenant_id/candidate_id prefix for efficient lifecycle management.

API Design

POST /api/v1/requisitions — Create a job requisition with pipeline configuration; returns req_id
POST /api/v1/applications — Submit a candidate application; body contains candidate info, resume file, source; returns application_id
PUT /api/v1/applications/{app_id}/stage — Advance or move a candidate to a new stage; body contains target_stage, notes
POST /api/v1/scorecards — Submit an interview scorecard; body contains application_id, scores, notes; returns scorecard_id

Scaling & Bottlenecks

Calendar integration is the primary external bottleneck. Polling 200K interviewer calendars for availability hits rate limits on Google Calendar API (1M requests/day per project) and Microsoft Graph API. The system uses webhook subscriptions (push notifications on calendar changes) instead of polling where supported, reducing API calls by 90%. A local availability cache (Redis) stores each interviewer's free/busy data with a 15-minute refresh cycle; scheduling requests use the cache first and fall back to live API calls only for the final confirmation.

Multi-tenant database performance requires careful index management. With 10K tenants sharing tables, queries must always include tenant_id to leverage RLS and indexes. Large tenants (enterprises with 1M+ candidates) can cause partition-level hot spots; for these customers, the system offers dedicated database instances (tenant-level sharding) at an enterprise pricing tier. The audit log grows by 5M rows/day and is partitioned monthly with automatic archival to S3 (Parquet format) after 1 year.

Key Trade-offs

Row-Level Security vs schema-per-tenant for multi-tenancy: RLS provides simpler operational management (single schema, shared connection pools) but risks noisy-neighbor issues — dedicated instances for large tenants provide a safety valve
Push (webhooks) vs pull (polling) for calendar sync: Webhooks reduce API calls 90% but require handling webhook delivery failures and expiration renewal — the hybrid approach (webhooks + periodic polling reconciliation) provides reliability
Configurable pipeline engine vs hardcoded stages: A flexible state machine supports diverse hiring processes (startups vs enterprises) but increases complexity in the UI and API — progressive disclosure (simple defaults with advanced customization) balances usability
PII encryption at the field level vs disk-level encryption: Field-level encryption with per-tenant keys enables true data isolation and simplifies GDPR erasure, but adds 2-3ms latency per encrypted field access — acceptable for the security guarantee