TL;DR - Quick Answer
- Zero-downtime deployment means releasing updates without service interruption
- Requires: proper readiness probes, rolling update strategy, graceful shutdown handling
- Common strategies: Rolling Updates (default), Blue-Green, Canary Releases
- Most failures are caused by missing readiness probes or improper connection draining
What is Zero-Downtime Deployment?
Zero-downtime deployment is a deployment strategy where application updates are released without any service interruption. Users continue to access the application normally throughout the entire deployment process - they don't experience errors, timeouts, or any indication that an update is happening.
In Kubernetes, this is achieved through a combination of proper deployment configuration, health checks, and graceful shutdown handling. When done correctly, you can deploy multiple times per day with complete confidence that users won't be affected.
Why Zero-Downtime Matters
Every minute of downtime has a cost:
- Revenue loss - E-commerce sites lose sales, SaaS platforms lose usage-based revenue
- User trust - Frequent outages erode confidence in your platform
- Developer velocity - Fear of deployments leads to infrequent releases with larger, riskier changes
- On-call burden - Risky deployments mean more incidents and stressed engineers
Teams that achieve reliable zero-downtime deployments deploy more frequently, ship smaller changes, and have higher confidence in their releases.
Deployment Strategies Compared
| Strategy | How It Works | Best For |
|---|---|---|
| Rolling Update | Gradually replaces old pods with new ones | Most applications (default strategy) |
| Blue-Green | Two identical environments, switch traffic at once | Database migrations, major changes |
| Canary | Route small % of traffic to new version first | High-risk changes, gradual rollout |
Step 1: Configure Readiness Probes
The readiness probe tells Kubernetes when your pod is ready to receive traffic. Without it, Kubernetes will send traffic to pods that aren't ready, causing errors during deployment.
spec:
containers:
- name: app
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
Common Mistake
Don't use liveness probe alone! Liveness probes restart unhealthy pods, but readiness probes control traffic routing. You need both for zero-downtime.
Your health endpoint should verify that the application is truly ready:
- Database connections are established
- Cache is warmed (if needed)
- Dependencies are reachable
Step 2: Configure Rolling Update Strategy
The rolling update strategy controls how Kubernetes replaces old pods with new ones. Two key parameters:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
maxUnavailable: 0- Never have fewer than desired pods (zero-downtime)maxSurge: 1- Allow one extra pod during rollout
With this configuration, Kubernetes will:
- Create a new pod with the updated version
- Wait for it to pass readiness checks
- Start routing traffic to the new pod
- Terminate an old pod
- Repeat until all pods are updated
Step 3: Implement Graceful Shutdown
When Kubernetes terminates a pod, it sends a SIGTERM signal. Your application must handle this signal gracefully:
- Stop accepting new requests
- Finish processing in-flight requests
- Close database connections cleanly
- Exit the process
# Python example
import signal
import sys
def graceful_shutdown(signum, frame):
print("Received SIGTERM, shutting down gracefully...")
# Stop accepting new requests
server.stop_accepting()
# Wait for in-flight requests to complete
server.wait_for_pending_requests(timeout=30)
# Close connections
db.close()
sys.exit(0)
signal.signal(signal.SIGTERM, graceful_shutdown)
Step 4: Add PreStop Hook
There's a race condition: Kubernetes removes the pod from the service endpoints at the same time it sends SIGTERM. Some traffic may still arrive during this window.
The preStop hook adds a delay before shutdown, giving time for endpoint updates to propagate:
spec:
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]
This 10-second sleep gives load balancers and ingress controllers time to stop sending traffic to this pod before it begins shutdown.
Pro Tip
Set terminationGracePeriodSeconds to at least preStop sleep + your application's shutdown time. Default is 30 seconds.
Complete Example
Here's a complete deployment configuration for zero-downtime:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
image: my-app:v2
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]
Troubleshooting Common Issues
1. Errors During Deployment
Symptom: Users see 502/503 errors during deployments.
Cause: Usually missing or misconfigured readiness probes.
Fix: Ensure readiness probe returns healthy only when app is truly ready.
2. Connection Resets
Symptom: In-flight requests fail with connection reset.
Cause: App not handling SIGTERM gracefully.
Fix: Implement graceful shutdown and add preStop hook.
3. Slow Rollouts
Symptom: Deployments take too long.
Cause: Usually slow readiness probes or conservative settings.
Fix: Tune initialDelaySeconds and periodSeconds, increase maxSurge.
Key Takeaways
- Zero-downtime deployment is achievable with proper configuration
- Readiness probes are critical - they control traffic routing
- Use
maxUnavailable: 0to ensure capacity throughout deployment - Implement graceful shutdown in your application
- Add preStop hook to handle the endpoint update race condition
- Test your deployment strategy before relying on it in production
Need Help with Kubernetes Deployments?
We've helped companies achieve zero-downtime deployments and reduce deployment anxiety. Let's talk about your infrastructure.
Contact Us