System Design: Public Changelog & Status Page

Requirements

Functional Requirements:

Companies publish changelog entries (new features, improvements, bug fixes) with rich text, images, and categorization
Status page displays real-time health of system components (API, Dashboard, Database, CDN) with status indicators (operational, degraded, outage)
Incident management: create incidents, post updates, resolve incidents, and compute uptime SLAs
Subscriber notifications: users subscribe to status updates via email, SMS, Slack webhook, or RSS
Scheduled maintenance announcements with calendar integration
Historical uptime data with monthly/annual SLA reports (99.9%, 99.95%, 99.99%)

Non-Functional Requirements:

Status page must remain available during the customer's own outage (ironic if the status page goes down during an incident)
Page load under 500ms globally; status API response under 100ms
99.999% availability for the status page (higher than the monitored services themselves)
Handle notification fan-out to 1M subscribers within 5 minutes of an incident update
Support 10K companies, each with up to 50 components and 100K subscribers

Scale Estimation

10K companies × 50 components = 500K monitored components. Status checks: if automated, 500K components × 1 check/minute = 8,333 checks/sec. Changelog posts: 10K companies × 2 posts/week = 20K posts/week = 33/hour. Incidents: 10K companies × 2 incidents/month = 20K incidents/month. Status page views: during an incident, a company's status page may receive 100K views/hour (customers refreshing); normal traffic: 10K companies × 100 views/day = 1M views/day = 11.6/sec. Subscriber notifications: largest companies have 100K subscribers; incident update notification to 100K subscribers via email/SMS within 5 minutes. Total subscribers: 10K companies × 10K average = 100M subscribers.

High-Level Architecture

The architecture prioritizes availability of the status page above all else. The Status Serving Layer is a static-first architecture: status pages are pre-rendered as static HTML/JSON and served from a multi-region CDN (CloudFront with origin in 3 AWS regions: us-east-1, eu-west-1, ap-southeast-1). When a component's status changes, the system regenerates the static page and invalidates the CDN cache. This means the status page can be served even if the backend is completely down — the CDN serves the last-known state. The status API (api.statuspage.com/v1/status/{company}) is also CDN-cached with a 30-second TTL.

The Incident Management Plane handles incident lifecycle. When an operator creates an incident (via dashboard or API), the system: (1) updates component statuses in PostgreSQL, (2) regenerates the static status page and pushes to CDN, (3) fans out notifications to subscribers via a Notification Pipeline. Incident updates (investigating → identified → monitoring → resolved) follow the same flow. The operator interface (dashboard) runs on a separate infrastructure from the status serving layer to ensure status page availability is not affected by internal infrastructure issues.

The Changelog Plane is a CMS-like system where product teams publish changelog entries. Each entry has a title, body (Markdown rendered to HTML), category (new feature, improvement, fix, announcement), and associated product components. Changelog pages are similarly pre-rendered and CDN-served. An RSS feed and email digest (weekly summary of changelog entries) enable subscriber follow-up.

Core Components

Static-First Status Serving

The status page is generated as a static HTML file by a rendering service whenever a status change occurs. The HTML includes inline CSS and minimal JavaScript (no external dependencies) to ensure it loads even when third-party services are down. The page is pushed to 3 CDN origin buckets (S3 in each region) simultaneously. CloudFront is configured with origin failover: if the primary origin is unreachable, it falls back to the secondary and tertiary. DNS uses Route 53 with health checks on all three origins; if an entire AWS region fails, DNS routes to the surviving regions. The status API serves a lightweight JSON payload: {"status": "degraded", "components": [{"name": "API", "status": "operational"}, ...], "active_incidents": [...]}. This JSON is cached at the CDN edge with a 30-second TTL, ensuring near-real-time freshness with extreme availability.

Notification Fan-Out

The notification system must deliver incident updates to up to 100K subscribers per company within 5 minutes. The fan-out architecture: when an incident update is posted, a Notification Job is created and published to an SQS queue. Notification Workers consume jobs and fan out to individual subscribers. For email: messages are batched and sent via SES (Simple Email Service) at 1,000 emails/sec per account (multiple SES accounts provide higher throughput). For SMS: messages are sent via SNS or Twilio at 100 messages/sec. For Slack webhooks: HTTP POST to each subscriber's webhook URL with a concurrency of 500 concurrent requests. For RSS: the feed is regenerated and CDN-cached. A subscriber preference table determines which channels each subscriber wants and deduplicates across channels. Rate limiting prevents notification storms: maximum 1 notification per incident per 15-minute window per subscriber (updates within the window are batched).

Uptime Calculation & SLA Reporting

Uptime is calculated per component using heartbeat data. Each component reports status via one of three methods: (1) automated monitoring (the platform pings the component's health endpoint), (2) manual status update (operator sets status via API/dashboard), or (3) integration with monitoring tools (PagerDuty, Datadog, OpsGenie). The uptime calculation: for each component, the system records status transitions (operational → degraded at T1, degraded → operational at T2) in a timeline table. Monthly uptime = (total_minutes - downtime_minutes) / total_minutes × 100. Downtime is defined as any period the component is in "major_outage" status. "Degraded" may or may not count toward SLA downtime (configurable per company). SLA reports are generated monthly and include: uptime percentage, number of incidents, mean time to resolution (MTTR), and comparison against the SLA target.

Database Design

PostgreSQL stores all operational data: companies (company_id, name, subdomain, custom_domain, plan, created_at), components (component_id, company_id, name, description, status ENUM(operational, degraded_performance, partial_outage, major_outage), position INT, group_id nullable), incidents (incident_id, company_id, title, status ENUM(investigating, identified, monitoring, resolved, postmortem), impact ENUM(none, minor, major, critical), created_at, resolved_at), incident_updates (update_id, incident_id, status, body TEXT, created_at), component_status_history (history_id, component_id, status, started_at, ended_at).

Subscribers: subscribers (subscriber_id, company_id, email, phone nullable, slack_webhook_url nullable, channels ARRAY, component_ids ARRAY nullable for component-specific subscriptions, confirmed BOOLEAN, created_at). Changelog: changelog_entries (entry_id, company_id, title, body_markdown, body_html, category ENUM(new_feature, improvement, fix, announcement), published_at, author_id). Indexes: (company_id, status) for active incidents, (company_id, published_at DESC) for changelog feeds.

API Design

GET /api/v1/status/{company_subdomain} — Fetch current status of all components and active incidents; CDN-cached, 30-second TTL
POST /api/v1/incidents — Create an incident; body contains title, impact, affected_component_ids, initial update message; triggers subscriber notification
POST /api/v1/incidents/{incident_id}/updates — Post an incident update; body contains status and message; triggers notification fan-out
POST /api/v1/changelog — Publish a changelog entry; body contains title, body_markdown, category

Scaling & Bottlenecks

The status page itself has zero scaling concerns because it is served entirely from CDN as static content. The CDN handles millions of requests during a major incident (customers anxiously refreshing the status page) without any backend load. The rendering pipeline (regenerate HTML on status change) runs in under 2 seconds and is triggered at most once per minute (debounced to handle rapid status flapping).

The notification fan-out is the primary bottleneck. Sending 100K emails in 5 minutes requires 333 emails/sec. SES handles this with a single account (burst rate 1,000/sec). SMS is more constrained: 100K SMS at 100/sec = 17 minutes, exceeding the 5-minute target. The system uses multiple Twilio accounts (phone number pool) to achieve 500 SMS/sec, reducing delivery time to 3.3 minutes. For very large subscriber lists (> 100K), notifications are prioritized: email subscribers first (fastest, cheapest), then webhook/Slack (instant delivery), then SMS (most expensive, used for critical incidents only).

Key Trade-offs

Static-first CDN serving vs dynamic API: Static pages provide extreme availability (CDN serves even if backend is down) but introduce up to 30 seconds of staleness — the 30-second CDN TTL is short enough for status page freshness requirements
Multi-region CDN origins vs single origin: Three origin regions provide resilience against regional AWS outages but triple the storage and invalidation complexity — essential for a five-nines SLA status page
Automated monitoring vs manual status updates: Automated monitoring catches issues faster (sub-minute detection) but can produce false positives (leading to unnecessary incident pages) — most companies use automated detection with manual incident creation to avoid false alarms
Notification batching (15-minute window) vs instant on every update: Batching prevents notification fatigue during rapidly-evolving incidents (3-4 updates in 15 minutes become a single notification) — but delays individual updates; subscribers can opt in to instant mode for critical components