Zero-Downtime Deployments: Blue-Green, Canary, and Rolling Strategies

Deploying without downtime isn't just about the deployment strategy — it's about how your application handles the transition between versions. Database migrations, in-flight requests, and connection draining all need consideration. The deployment strategy is the easy part.

Rolling Deployments

The default in Kubernetes. Old pods are gradually replaced with new pods. At any point during the deployment, both old and new versions serve traffic.

yaml

Key settings:

maxUnavailable: 0 ensures capacity never drops below the desired replica count
readinessProbe prevents traffic from hitting pods that aren't ready
preStop hook gives in-flight requests time to complete before the pod shuts down

Risk: Both versions serve traffic simultaneously during the rollout. If v2 has a breaking change, some users see v2 while others see v1. This requires backward-compatible changes.

Blue-Green Deployments

Run two identical environments. "Blue" is the current production environment. "Green" is the new version. Once green passes health checks, switch all traffic from blue to green.

Implementation with Kubernetes Services:

yaml

Switch by updating the Service selector from version: blue to version: green. Rollback by switching back.

Advantage: Instant rollback. The old version is still running — just switch the selector back.

Disadvantage: Requires double the infrastructure during deployment. With 4 replicas, you need 8 running during the transition.

Canary Deployments

Route a small percentage of traffic to the new version. Monitor error rates and latency. Gradually increase traffic if metrics are healthy. Roll back if they degrade.

Using Argo Rollouts for automated canary:

yaml

This automatically promotes the canary if the success rate stays above 99%, and rolls back if it drops below.

Database Migrations Without Downtime

The deployment strategy is the easy part. Database migrations are where zero-downtime deployments actually break.

The problem: During a rolling deployment, both v1 and v2 code run simultaneously. If v2 requires a schema change, v1 code might break against the new schema.

The solution: Expand-Contract pattern.

Phase 1: Expand (backward-compatible)

Add new columns/tables without removing or renaming existing ones. Both v1 and v2 work with the expanded schema.

sql

Phase 2: Migrate Data

Backfill the new column with data from existing columns:

sql

Phase 3: Contract (remove old)

After all instances are running v2 and the new column is populated, remove the old column in a future deployment:

sql

Rules:

Never rename a column in a single deployment. Add the new column (expand), deploy, migrate data, drop the old column (contract) in the next deployment.
Never add a NOT NULL column without a DEFAULT in a single step. Add it as nullable first, backfill, then add the constraint.
Never drop a column that running code still reads.

Graceful Shutdown

When a pod is terminated, in-flight requests must complete before the process exits.

python

The Kubernetes pod lifecycle for graceful shutdown:

The sleep 10 in the preStop hook is critical. There's a race between the pod being removed from endpoints and the load balancer updating its target list. Without the sleep, the load balancer might still send requests to a pod that's already shutting down.

Feature Flags for Deployment Safety

Decouple deployment from release. Deploy v2 code to production but keep new features behind flags. Enable features gradually after deployment is verified.

python

This lets you:

Deploy code changes without user-facing impact
Enable features for specific users (internal team, beta users)
Instantly disable a broken feature without a rollback deployment
Run A/B tests on the same deployment

Feature flags + canary deployments is the safest combination. The canary validates infrastructure stability (new code doesn't crash), and feature flags control feature exposure independently.

Rollback Checklist

When a deployment goes wrong:

Automated rollback — Argo Rollouts or Flagger detect metric degradation and roll back automatically
Manual rollback — kubectl rollout undo deployment/order-service (Kubernetes keeps previous ReplicaSets)
Verify rollback — Check that error rates return to baseline after rollback
Database state — If a migration ran, is the old code compatible with the new schema? (This is why expand-contract matters)
Post-mortem — Why did the canary not catch the issue? Was the analysis template checking the right metrics?

Zero-downtime deployments are a system property, not a deployment strategy choice. The strategy (rolling, blue-green, canary) is one piece. Backward-compatible database migrations, graceful shutdown, readiness probes, and feature flags are equally important. Get all of them right, and deployments become routine. Miss any one, and you're one bad deploy away from an outage.

Zero-Downtime Deployments: Blue-Green, Canary, and Rolling Strategies

Zero-Downtime Deployments: Blue-Green, Canary, and Rolling Strategies

Rolling Deployments

Blue-Green Deployments

Canary Deployments

We build this end-to-end in the cohort.

Database Migrations Without Downtime

Phase 1: Expand (backward-compatible)

Phase 2: Migrate Data

Phase 3: Contract (remove old)

Graceful Shutdown

Feature Flags for Deployment Safety

Rollback Checklist

More in Architecture

The Strangler Fig Pattern: Migrating Legacy Systems Incrementally

Designing Data Pipeline Architecture for Real-Time Analytics

become an engineering leader