SYSTEM_DESIGN
System Design: TLS Certificate Management
Design an automated TLS certificate management system that provisions, renews, and distributes certificates across a large fleet of servers using ACME protocol. Covers certificate lifecycle, private key security, multi-cloud distribution, and monitoring for expiry.
Requirements
Functional Requirements:
- Automatically provision TLS certificates via ACME (Let's Encrypt) for all managed domains
- Distribute certificates and private keys to all consuming servers within 5 minutes of issuance
- Automate renewal: certificates renewed 30 days before expiry without human intervention
- Support wildcard certificates, SAN certificates, and internal CA-signed certificates
- Store private keys in HSMs or cloud KMS; never write unencrypted private keys to disk
- Monitor all certificates: alert 30 days before expiry for any certificate not managed by the system
Non-Functional Requirements:
- Certificate provisioning latency under 60 seconds for new domains (ACME DNS-01 challenge)
- Support 100,000 managed domains with certificates
- Certificate delivery to all servers must complete within 5 minutes of issuance
- Private key material must be stored with FIPS 140-2 Level 3 compliance
- 99.999% availability for the certificate serving layer (TLS must never fail due to expired certs)
Scale Estimation
100,000 domains * 1 certificate each. With 90-day Let's Encrypt certificates and renewal at 60 days: renewal rate = 100,000 / 60 days = 1,667 renewals/day = 70/hour = 1.2/minute. Each ACME transaction takes 10–30 seconds (DNS propagation + CA validation). 10 concurrent ACME workers can process 20–60 renewals/minute — more than sufficient. Certificate distribution: 100,000 domains * average 10 servers each = 1 million certificate-server pairs; each update delivers <10 KB = 10 GB total distribution per renewal cycle.
High-Level Architecture
The system has four components: Certificate Lifecycle Manager (CLM), ACME Client, Secure Key Store, and Certificate Distribution Layer. The CLM maintains an inventory of all managed domains, tracks certificate expiry dates, and triggers renewal workflows. The ACME Client communicates with Let's Encrypt (or other CAs) to complete domain validation and obtain signed certificates. The Secure Key Store (backed by AWS KMS or HashiCorp Vault with HSM) manages private key generation, storage, and signing operations. The Distribution Layer pushes certificates to consuming servers and CDN configurations.
The ACME DNS-01 challenge is preferred over HTTP-01 for wildcard certificates and environments where HTTP is not exposed. The ACME Client creates a _acme-challenge.{domain} TXT DNS record with the challenge token via the DNS provider's API (Route53, Cloudflare). After DNS propagation (30–60 seconds), the CA validates the record and issues the certificate. The CLM then stores the certificate in the key store and triggers distribution._
Certificate distribution uses a push model: when a new certificate is available, the CLM writes it to an encrypted S3 bucket and publishes a cert_updated event to Kafka. Certificate agents running on each server subscribe to Kafka events for their managed domains, download the new certificate from S3, write the certificate chain to disk (the private key is decrypted from KMS at this point and held in memory by the TLS server, not written to disk in cleartext in production-grade setups), and trigger a graceful TLS reload (SIGHUP to Nginx/Envoy) without dropping existing connections.
Core Components
Certificate Lifecycle Manager
The CLM runs a daily scheduled job that queries the certificate inventory for all certificates expiring within 30 days. For each, it creates a renewal workflow: (1) generate new RSA-2048 or ECDSA P-256 key pair in KMS (the private key never leaves KMS in plaintext), (2) generate a CSR using the KMS key, (3) submit the ACME order, (4) complete DNS-01 challenge, (5) download and validate the issued certificate, (6) store in inventory. Pre-expiry monitoring sends alerts at 30-day, 7-day, and 1-day thresholds; automated pages fire at 1-day if renewal has failed.
Secure Key Store (KMS + Vault)
Private keys are generated using AWS KMS asymmetric key operations (RSA_2048 or ECC_NIST_P256). KMS performs the signing operation for CSR generation; the private key material never leaves the HSM. Vault stores the certificate chain, metadata, and a KMS key reference. Access to private key operations is controlled by IAM policies (for KMS) and Vault policies: only the ACME client and certificate agent service accounts can perform signing operations. All key operations are logged in AWS CloudTrail and Vault audit logs.
Certificate Discovery & External Monitoring
A certificate scanner (built on ZLint and masscan) continuously probes all public-facing endpoints for their TLS certificates. It checks: expiry date, certificate chain completeness, domain name matches, key strength (minimum RSA-2048 or ECDSA P-256), CT log presence, OCSP stapling availability, and HSTS presence. Certificates not found in the managed inventory (shadow certificates issued without the system's knowledge) trigger a high-severity alert for investigation.
Database Design
PostgreSQL: certificates (cert_id UUID, domain VARCHAR, san_domains TEXT[], issuer VARCHAR, serial_number VARCHAR, not_before TIMESTAMP, not_after TIMESTAMP, key_algorithm VARCHAR, kms_key_id VARCHAR, cert_chain_s3_path VARCHAR, status ENUM(ACTIVE, RENEWING, EXPIRED, REVOKED), last_renewed_at, created_at), distribution_targets (target_id, cert_id, server_id, delivery_method ENUM, last_delivered_at, delivery_status), renewal_events (event_id, cert_id, event_type, details_json, occurred_at).
API Design
POST /domains — Register a new domain for automated certificate management.
GET /domains/{domain}/certificate — Return the current certificate metadata and download URL for the certificate chain.
POST /domains/{domain}/renew — Trigger an immediate out-of-band certificate renewal (for emergency re-key).
GET /certificates/expiring?days={n} — Return all certificates expiring within N days with renewal status.
Scaling & Bottlenecks
Let's Encrypt rate limits: 50 new certificate orders per domain per week, 300 new orders per account per 3 hours. With 100,000 domains and 1,667 renewals/day, a single ACME account can handle the load (1,667 << 300/3 hours * 24 hours = 2,400/day). Multiple ACME accounts with domain load balancing provide redundancy and headroom. DNS propagation time (30–120 seconds) is the dominant latency for ACME DNS-01 challenges; using a DNS provider with low TTLs (60 seconds) and an API that supports low-TTL TXT record creation minimizes this.*
Certificate distribution to 1 million server-domain pairs in 5 minutes requires 3,333 distribution events/second. Kafka-based push distribution with parallel certificate agents on each server handles this; each server receives only events for its domains (filtered by domain subscription), keeping per-server event rate low.
Key Trade-offs
- ACME DNS-01 vs. HTTP-01 challenge: DNS-01 supports wildcards and works for non-HTTP services but requires DNS API access (risk of DNS provider credential compromise); HTTP-01 is simpler but doesn't support wildcards.
- Short-lived vs. long-lived certificates: 90-day Let's Encrypt certificates limit revocation impact windows but require frequent automation; 1-year certificates reduce rotation overhead but extend the window of exposure for compromised private keys.
- KMS-resident vs. disk-resident private keys: KMS-resident keys are never exposed outside the HSM and have full audit trails but add API call latency (5–10ms) to CSR generation; disk-resident keys (encrypted with a KMS data key) are faster but require careful file permission management.
- Push vs. pull distribution: Push (CLM distributes to servers) delivers certificates faster and guarantees delivery; pull (servers check for new certificates on a schedule) is simpler but adds up to the poll interval of latency before deployment.
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.