SYSTEM_DESIGN
System Design: Digital Identity System
Design a government-grade digital identity platform supporting citizen onboarding, credential issuance, and cross-agency identity verification. Emphasizes security, auditability, privacy, and resilience for national-scale deployments.
Requirements
Functional Requirements:
- Citizen registration with identity proofing (document scan, biometric liveness check)
- Issue verifiable digital credentials (national ID, driver's license, health card)
- Support cross-agency authentication — citizen logs in once and accesses multiple government portals
- Credential revocation and re-issuance on document expiry or fraud detection
- Privacy-preserving selective disclosure: citizens share only required attributes
- Audit trail of every identity verification event for legal compliance
Non-Functional Requirements:
- 99.99% availability — identity services are critical infrastructure
- Sub-500ms authentication for 99th percentile even under peak load (tax season, elections)
- All PII encrypted at rest and in transit; HSM-backed key management
- Full audit log immutability — records must be tamper-evident
- WCAG 2.1 AA accessibility for all citizen-facing interfaces
Scale Estimation
For a mid-size nation of 50M citizens: assume 10% authenticate per day = 5M authentications/day or ~58/second average, with 10x spikes = 580/second peak. Identity proofing during onboarding is lower volume — 100k new registrations/month. Document storage: 50M citizens × 3 documents × 2MB average = 300TB. Audit logs: 5M events/day × 1KB each = 5GB/day, ~1.8TB/year.
High-Level Architecture
The system is architected around three tiers: an Identity Proofing Service, a Credential Issuance Service, and a Federated Authentication Gateway. All tiers communicate over mutually authenticated TLS with service mesh (e.g., Istio). No tier exposes a public internet endpoint directly — all external traffic enters through a WAF and API gateway layer with DDoS protection.
Identity proofing is an async process: citizens submit document images and biometric video via a mobile app; a queue-backed pipeline runs document verification (OCR + validation against government registries), liveness detection (ML model), and sanctions screening. On approval, the Identity Service generates a cryptographic identity anchor linked to a unique citizen identifier. This anchor never leaves the identity tier — downstream services receive only derived credentials.
The Federated Authentication Gateway implements OpenID Connect and SAML 2.0 to integrate with agency portals. Citizens authenticate once, receive a short-lived JWT and a refresh token. The gateway maintains a session store backed by Redis Cluster with geographic distribution, enabling session continuity across regions without re-authentication.
Core Components
Identity Proofing Service
Orchestrates the multi-step onboarding pipeline. Uses a state machine per application (submitted → document_verified → biometric_verified → approved/rejected). Each step is processed by a dedicated worker pool consuming from a durable queue (SQS with DLQ). Third-party biometric and document verification APIs are wrapped with circuit breakers and fallback manual review queues. All submitted documents and biometric data are encrypted with citizen-specific keys before storage.
Credential Issuance & PKI Service
Issues W3C Verifiable Credentials signed with the government's root Certificate Authority private key, stored in a Hardware Security Module (HSM). Each credential contains the citizen identifier, issued claims, validity period, and a credential status URL for real-time revocation checks. Credential templates are versioned; schema upgrades are backward compatible. A revocation registry (using CRL or OCSP) is published to a globally replicated CDN so verifiers can check status without calling back to origin.
Federated Authentication Gateway
Implements OIDC Authorization Code Flow with PKCE. Integrates with agency portals via registered client configurations. On successful authentication, issues ID tokens with only the claims requested by the relying party (selective disclosure). Token signing keys are rotated on a 90-day schedule with a 7-day overlap for zero-downtime rotation. The gateway logs every authentication event (citizen ID, agency, timestamp, IP, success/failure) to an append-only audit store.
Database Design
Citizen identity records live in PostgreSQL with row-level encryption (pgcrypto) on PII fields. The schema separates identity anchors (non-PII: citizen_id, status, created_at) from identity profiles (PII: name, DOB, address) stored in a separate schema with stricter access controls. Foreign key joins across schemas require elevated role permissions logged by the audit system.
Audit logs are written to an append-only table in PostgreSQL and simultaneously streamed to an immutable S3-based log archive (using Object Lock) for tamper evidence. A separate read replica powers audit reporting queries without impacting the transactional system. Credential status data uses a Redis cluster for sub-millisecond revocation lookups, backed by a Postgres source of truth.
API Design
POST /api/v1/identity/applications — initiates identity proofing; returns application_id for status polling.
GET /api/v1/identity/applications/{appId}/status — returns current pipeline stage and any required remediation steps.
POST /api/v1/credentials/issue — (internal, agency-to-gateway) issues a new verifiable credential for an approved citizen.
GET /api/v1/credentials/{credentialId}/status — public OCSP-like endpoint for verifiers to check credential validity.
Scaling & Bottlenecks
Peak authentication load during nationwide elections or tax filing deadlines is the primary bottleneck. The authentication gateway is stateless (session state in Redis) and horizontally scalable behind a load balancer. Geographic distribution across multiple regions with active-active Redis replication ensures citizens in any region authenticate against a nearby node. A circuit breaker prevents cascading failure if a backend identity store becomes slow — cached tokens remain valid during transient outages.
The identity proofing pipeline is the second bottleneck at onboarding surges (e.g., new ID program rollout). Worker pool auto-scaling triggered by SQS queue depth handles burst capacity. Third-party verification API rate limits are managed with per-vendor token buckets and a manual review fallback queue, ensuring no citizen application is permanently blocked by a vendor outage.
Key Trade-offs
- Centralized vs. decentralized identity: Centralized simplifies revocation and fraud control but creates a single target; a federated model with verifiable credentials shifts risk but complicates revocation propagation.
- Online vs. offline credential verification: OCSP-based online checks give real-time revocation status but require connectivity; CRL-based offline checks allow air-gapped verifiers but introduce revocation lag.
- Biometric storage: Storing biometric templates enables re-proofing without re-enrollment but creates a high-value target; storing only a one-way hash reduces breach impact but prevents certain verification use cases.
- Session length vs. security: Longer-lived sessions reduce authentication friction but extend the window of token abuse; short-lived tokens with silent refresh balance usability and security.
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.