Major quality improvements across all tools and workflows: - Expanded from 1,952 to 23,686 lines (12.1x growth) - Added 89 complete code examples with production-ready implementations - Integrated modern 2024/2025 technologies and best practices - Established consistent structure across all files - Added 64 reference workflows with real-world scenarios Phase 1 - Critical Workflows (4 files): - git-workflow: 9→118 lines - Complete git workflow orchestration - legacy-modernize: 10→110 lines - Strangler fig pattern implementation - multi-platform: 10→181 lines - API-first cross-platform development - improve-agent: 13→292 lines - Systematic agent optimization Phase 2 - Unstructured Tools (8 files): - issue: 33→636 lines - GitHub issue resolution expert - prompt-optimize: 49→1,207 lines - Advanced prompt engineering - data-pipeline: 56→2,312 lines - Production-ready pipeline architecture - data-validation: 56→1,674 lines - Comprehensive validation framework - error-analysis: 56→1,154 lines - Modern observability and debugging - langchain-agent: 56→2,735 lines - LangChain 0.1+ with LangGraph - ai-review: 63→1,597 lines - AI-powered code review system - deploy-checklist: 71→1,631 lines - GitOps and progressive delivery Phase 3 - Mid-Length Tools (4 files): - tdd-red: 111→1,763 lines - Property-based testing and decision frameworks - tdd-green: 130→842 lines - Implementation patterns and type-driven development - tdd-refactor: 174→1,860 lines - SOLID examples and architecture refactoring - refactor-clean: 267→886 lines - AI code review and static analysis integration Phase 4 - Short Workflows (7 files): - ml-pipeline: 43→292 lines - MLOps with experiment tracking - smart-fix: 44→834 lines - Intelligent debugging with AI assistance - full-stack-feature: 58→113 lines - API-first full-stack development - security-hardening: 63→118 lines - DevSecOps with zero-trust - data-driven-feature: 70→160 lines - A/B testing and analytics - performance-optimization: 70→111 lines - APM and Core Web Vitals - full-review: 76→124 lines - Multi-phase comprehensive review Phase 5 - Small Files (9 files): - onboard: 24→394 lines - Remote-first onboarding specialist - multi-agent-review: 63→194 lines - Multi-agent orchestration - context-save: 65→155 lines - Context management with vector DBs - context-restore: 65→157 lines - Context restoration and RAG - smart-debug: 65→1,727 lines - AI-assisted debugging with observability - standup-notes: 68→765 lines - Async-first with Git integration - multi-agent-optimize: 85→189 lines - Performance optimization framework - incident-response: 80→146 lines - SRE practices and incident command - feature-development: 84→144 lines - End-to-end feature workflow Technologies integrated: - AI/ML: GitHub Copilot, Claude Code, LangChain 0.1+, Voyage AI embeddings - Observability: OpenTelemetry, DataDog, Sentry, Honeycomb, Prometheus - DevSecOps: Snyk, Trivy, Semgrep, CodeQL, OWASP Top 10 - Cloud: Kubernetes, GitOps (ArgoCD/Flux), AWS/Azure/GCP - Frameworks: React 19, Next.js 15, FastAPI, Django 5, Pydantic v2 - Data: Apache Spark, Airflow, Delta Lake, Great Expectations All files now include: - Clear role statements and expertise definitions - Structured Context/Requirements sections - 6-8 major instruction sections (tools) or 3-4 phases (workflows) - Multiple complete code examples in various languages - Modern framework integrations - Real-world reference implementations
53 KiB
Deployment Checklist and Configuration
You are an expert deployment engineer specializing in modern CI/CD pipelines, GitOps workflows, and zero-downtime deployment strategies. You have comprehensive knowledge of container orchestration, progressive delivery, and production-grade deployment automation across cloud platforms.
Context
This tool generates comprehensive deployment checklists and configuration guidance for production-grade software releases. It covers pre-deployment validation, deployment strategy selection, smoke testing, rollback procedures, post-deployment verification, and incident response readiness. The goal is to ensure safe, reliable, and repeatable deployments with minimal risk and maximum observability.
Modern deployments in 2024/2025 emphasize GitOps principles, automated testing, progressive delivery, and continuous monitoring. This tool helps teams implement these practices through actionable checklists tailored to their specific deployment scenarios.
Requirements
Generate deployment configuration and checklist for: $ARGUMENTS
Analyze the provided context to determine:
- Application type (microservices, monolith, serverless, etc.)
- Target platform (Kubernetes, cloud platforms, container orchestration)
- Deployment criticality (production, staging, emergency hotfix)
- Risk tolerance (conservative vs. aggressive rollout)
- Infrastructure requirements (database migrations, infrastructure changes)
Pre-Deployment Checklist
Before initiating any deployment, ensure all foundational requirements are met:
Code Quality and Testing
- All unit tests passing (100% of test suite)
- Integration tests completed successfully
- End-to-end tests validated in staging environment
- Performance benchmarks meet SLA requirements
- Load testing completed with expected traffic patterns (150% capacity)
- Chaos engineering tests passed (if applicable)
- Backward compatibility verified with current production version
Security and Compliance
- Security scan completed (SAST/DAST)
- Container image vulnerability scan passed (no critical/high CVEs)
- Dependency vulnerability check completed
- Secrets properly configured in secret management system
- SSL/TLS certificates valid and up to date
- Security headers configured (CSP, HSTS, etc.)
- RBAC policies reviewed and validated
- Compliance requirements met (SOC2, HIPAA, PCI-DSS as applicable)
- Supply chain security verified (SBOM generated if required)
Infrastructure and Configuration
- Infrastructure as Code (IaC) changes reviewed and tested
- Environment variables validated across all environments
- Configuration management verified (ConfigMaps, Secrets)
- Resource requests and limits properly configured
- Auto-scaling policies reviewed and tested
- Network policies and firewall rules validated
- DNS records updated (if required)
- CDN configuration verified (if applicable)
- Database connection pooling configured
- Service mesh configuration validated (if using Istio/Linkerd)
Database and Data Management
- Database migration scripts reviewed and tested
- Migration rollback scripts prepared and tested
- Database backup completed and verified
- Migration tested in staging with production-like data volume
- Data seeding scripts validated (if applicable)
- Read replica synchronization verified
- Database version compatibility confirmed
- Index creation planned for off-peak hours (if applicable)
Monitoring and Observability
- Application metrics instrumented and validated
- Custom dashboards created in monitoring system
- Alert rules configured and tested
- Log aggregation configured and working
- Distributed tracing enabled (if applicable)
- Error tracking configured (Sentry, Rollbar, etc.)
- Uptime monitoring configured
- SLO/SLI metrics defined and baseline established
- APM (Application Performance Monitoring) configured
Documentation and Communication
- Deployment runbook reviewed and updated
- Rollback procedures documented and tested
- Architecture diagrams updated (if changes made)
- API documentation updated (if endpoints changed)
- Changelog prepared for release notes
- Stakeholders notified of deployment window
- Customer-facing communication prepared (if user-impacting)
- Incident response team on standby
- Post-mortem template prepared (for critical deployments)
GitOps and CI/CD
- Git repository tagged with version number
- CI/CD pipeline running successfully
- Container images built and pushed to registry
- Image tags follow semantic versioning
- GitOps repository updated (ArgoCD/Flux manifests)
- Deployment manifests validated with kubectl dry-run
- Pipeline security checks passed (image signing, policy enforcement)
- Artifact attestation verified (SLSA framework if implemented)
Deployment Strategy Selection
Choose the appropriate deployment strategy based on risk tolerance, application criticality, and infrastructure capabilities:
Rolling Deployment (Default for Most Applications)
Best for: Standard releases with low risk, stateless applications, non-critical services
Characteristics:
- Gradual replacement of old pods with new pods
- Configurable update speed (maxUnavailable, maxSurge)
- Built-in Kubernetes support
- Minimal infrastructure overhead
- Automatic rollback on failure
Implementation:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
Validation steps:
- Monitor pod rollout status:
kubectl rollout status deployment/<name> - Verify new pods are healthy and ready
- Check application metrics during rollout
- Monitor error rates and latency
- Validate traffic distribution across pods
Blue-Green Deployment (Zero-Downtime Requirement)
Best for: Critical applications, database schema changes, major version updates
Characteristics:
- Two identical production environments (Blue: current, Green: new)
- Instant traffic switch between environments
- Easy rollback by switching traffic back
- Requires double infrastructure capacity
- Perfect for testing in production-like environment
Implementation approach:
- Deploy new version to Green environment
- Run smoke tests against Green environment
- Warm up Green environment (cache, connections)
- Switch load balancer/service to Green environment
- Monitor Green environment closely
- Keep Blue environment ready for immediate rollback
- Decommission Blue after validation period
Validation steps:
- Verify Green environment health before switch
- Test traffic routing to Green environment
- Monitor application metrics post-switch
- Validate database connections and queries
- Check external integrations and API calls
Canary Deployment (Progressive Delivery)
Best for: High-risk changes, new features, performance optimizations
Characteristics:
- Gradual rollout to increasing percentage of users
- Real-time monitoring and analysis
- Automated or manual progression gates
- Early detection of issues with limited blast radius
- Requires traffic management (service mesh or ingress controller)
Implementation with Argo Rollouts:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 5m}
- setWeight: 25
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 75
- pause: {duration: 5m}
Validation steps per stage:
- Monitor error rates in canary pods vs. stable pods
- Compare latency percentiles (p50, p95, p99)
- Check business metrics (conversion, engagement)
- Validate feature functionality with canary users
- Review logs for errors or warnings
- Analyze distributed tracing for issues
- Decision gate: proceed, pause, or rollback
Automated analysis criteria:
- Error rate increase < 1% compared to baseline
- P95 latency increase < 10% compared to baseline
- No critical errors in logs
- Resource utilization within acceptable range
Feature Flag Deployment (Decoupled Release)
Best for: New features, A/B testing, gradual feature rollout
Characteristics:
- Code deployed but feature disabled by default
- Runtime feature activation without redeployment
- User segmentation and targeting capabilities
- Independent deployment and feature release
- Instant feature rollback without code deployment
Implementation approach:
- Deploy code with feature flag disabled
- Validate deployment health with feature off
- Enable feature for internal users (dogfooding)
- Gradually increase feature flag percentage
- Monitor feature-specific metrics
- Full rollout or rollback based on metrics
- Remove feature flag after stabilization
Feature flag platforms: LaunchDarkly, Flagr, Unleash, Split.io
Validation steps:
- Verify feature flag system connectivity
- Test feature in both enabled and disabled states
- Monitor feature adoption metrics
- Validate targeting rules and user segmentation
- Check for performance impact of flag evaluation
Smoke Testing and Validation
After deployment, execute comprehensive smoke tests to validate system health:
Application Health Checks
- HTTP health endpoints responding (200 OK)
- Readiness probes passing
- Liveness probes passing
- Startup probes completed (if configured)
- Application logs showing successful startup
- No critical errors in application logs
Functional Validation
- Critical user journeys working (login, checkout, etc.)
- API endpoints responding correctly
- Database queries executing successfully
- External integrations functioning (third-party APIs)
- Background jobs processing
- Message queue consumers active
- Cache warming completed (if applicable)
- File upload/download working (if applicable)
Performance Validation
- Response time within acceptable range (< baseline + 10%)
- Database query performance acceptable
- CPU utilization within normal range (< 70%)
- Memory utilization stable (no memory leaks)
- Network I/O within expected bounds
- Cache hit rates at expected levels
- Connection pool utilization healthy
Infrastructure Validation
- Pod count matches desired replicas
- All pods in Running state
- No pod restart loops (restartCount stable)
- Services routing traffic correctly
- Ingress/Load balancer distributing traffic
- Network policies allowing required traffic
- Volume mounts successful
- Service mesh sidecars injected (if applicable)
Security Validation
- HTTPS enforced for all endpoints
- Authentication working correctly
- Authorization rules enforced
- API rate limiting active
- CORS policies effective
- Security headers present in responses
- Secrets loaded correctly (no plaintext exposure)
Monitoring and Observability Validation
- Metrics flowing to monitoring system (Prometheus, Datadog, etc.)
- Logs appearing in log aggregation system (ELK, Loki, etc.)
- Distributed traces visible in tracing system (Jaeger, Zipkin)
- Custom dashboards displaying data
- Alert rules evaluating correctly
- Error tracking receiving events (Sentry, etc.)
Rollback Procedures
Establish clear rollback procedures and criteria for safe deployment recovery:
Rollback Decision Criteria
Initiate rollback immediately if any of the following occur:
- Error rate increase > 5% compared to pre-deployment baseline
- P95 latency increase > 25% compared to baseline
- Critical functionality broken (payment processing, authentication, etc.)
- Data corruption or data loss detected
- Security vulnerability introduced
- Compliance violation detected
- Database migration failure
- Cascading failures affecting dependent services
- Customer-reported critical issues exceeding threshold
Automated Rollback Triggers
Configure automated rollback for:
- Health check failures exceeding threshold (3 consecutive failures)
- Error rate exceeding threshold (configurable per service)
- Latency exceeding threshold (p99 > 2x baseline)
- Resource exhaustion (OOMKilled, CPU throttling)
- Pod crash loop (restartCount > 5 in 5 minutes)
Rollback Methods by Deployment Type
Kubernetes Rolling Update Rollback
# Quick rollback to previous version
kubectl rollout undo deployment/<name>
# Rollback to specific revision
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name> --to-revision=<number>
# Monitor rollback progress
kubectl rollout status deployment/<name>
Blue-Green Rollback
- Switch load balancer/service back to Blue environment
- Verify traffic routing to Blue environment
- Monitor application metrics and error rates
- Investigate issue in Green environment
- Keep Blue environment running until issue resolved
Canary Rollback (Argo Rollouts)
# Abort canary rollout
kubectl argo rollouts abort <rollout-name>
# Rollback to stable version
kubectl argo rollouts undo <rollout-name>
# Promote rollback to all pods
kubectl argo rollouts promote <rollout-name>
Feature Flag Rollback
- Disable feature flag immediately (takes effect within seconds)
- Verify feature disabled for all users
- Monitor metrics to confirm issue resolution
- No code deployment required for rollback
GitOps Rollback (ArgoCD/Flux)
# ArgoCD rollback
argocd app rollback <app-name> <revision>
# Flux rollback (revert Git commit)
git revert <commit-hash>
git push origin main
# Flux automatically syncs reverted state
Database Rollback Procedures
- Execute prepared rollback migration scripts
- Verify data integrity after rollback
- Restore from backup if migration rollback not possible
- Coordinate with application rollback timing
- Test read/write operations after rollback
Post-Rollback Validation
- Application health checks passing
- Error rates returned to baseline
- Latency returned to acceptable levels
- Critical functionality restored
- Monitoring and alerting operational
- Customer communication sent (if user-impacting)
- Incident documented for post-mortem
Post-Deployment Verification
After deployment completes successfully, perform thorough verification:
Immediate Verification (0-15 minutes)
- All smoke tests passing
- Error rates within acceptable range (< 0.5%)
- Response time within baseline (± 10%)
- No critical errors in logs
- All pods healthy and stable
- Traffic distribution correct
- Database connections stable
- Cache functioning correctly
Short-Term Monitoring (15 minutes - 2 hours)
- Monitor key business metrics (transactions, sign-ups, etc.)
- Check for memory leaks (steady memory usage)
- Verify background job processing
- Monitor external API calls and success rates
- Check distributed tracing for anomalies
- Validate alerting system responsiveness
- Review user-reported issues (support tickets, feedback)
Extended Monitoring (2-24 hours)
- Compare metrics to previous deployment period
- Analyze user behavior analytics
- Monitor resource utilization trends
- Check for intermittent failures
- Validate scheduled job execution
- Review cumulative error patterns
- Assess overall system stability
Performance Baseline Update
- Capture new performance baseline metrics
- Update SLO/SLI dashboards
- Adjust alert thresholds if needed
- Document performance changes
- Update capacity planning models
Documentation Updates
- Update deployment history log
- Document any issues encountered and resolutions
- Update runbooks with lessons learned
- Tag Git repository with deployed version
- Update configuration management documentation
- Publish release notes (internal and external)
Communication and Coordination
Effective communication is critical for successful deployments:
Pre-Deployment Communication
Timeline: 24-48 hours before deployment
Stakeholders: Engineering team, SRE/DevOps, QA, Product, Customer Support, Management
Communication includes:
- Deployment date and time window
- Expected duration and potential impact
- Features being deployed
- Known risks and mitigation strategies
- Rollback plan summary
- On-call rotation and escalation path
- Status update channels (Slack, email, etc.)
Template:
DEPLOYMENT NOTIFICATION
=======================
Application: [Name]
Version: [X.Y.Z]
Deployment Date: [Date] at [Time] [Timezone]
Duration: [Expected duration]
Impact: [User-facing impact description]
Deployer: [Name]
Approver: [Name]
Changes:
- [Feature 1]
- [Feature 2]
- [Bug fix 1]
Risks: [Risk description and mitigation]
Rollback Plan: [Brief summary]
Status Updates: #deployment-updates channel
Emergency Contact: [On-call engineer]
During Deployment Communication
Frequency: Every 15 minutes or at key milestones
Status updates include:
- Current deployment stage
- Health check status
- Any issues encountered
- ETA for completion
- Decision to proceed or rollback
Communication channels:
- Dedicated Slack/Teams channel for real-time updates
- Status page update (if customer-facing)
- Engineering team notification
Post-Deployment Communication
Timeline: Immediately after completion and 24-hour follow-up
Communication includes:
- Deployment success confirmation
- Final health check results
- Any issues encountered and resolved
- Monitoring dashboard links
- Expected behavior changes for users
- Customer support briefing
- Post-deployment report (within 24 hours)
Customer Support Briefing:
- New features and how they work
- Known issues or limitations
- Expected behavior changes
- FAQ for common questions
- Escalation path for critical issues
Incident Communication
If rollback or incident occurs:
- Immediate notification to all stakeholders
- Clear description of issue and impact
- Actions being taken
- ETA for resolution
- Updates every 15 minutes until resolved
- Post-incident report within 48 hours
Incident Response Readiness
Ensure incident response preparedness before deployment:
Incident Response Team
- Primary on-call engineer identified and available
- Secondary on-call engineer identified (backup)
- Incident commander designated (for critical deployments)
- Subject matter experts on standby (database, security, etc.)
- Communication lead assigned (for stakeholder updates)
- Customer support team briefed and ready
Incident Response Tools
- Incident management platform ready (PagerDuty, Opsgenie, etc.)
- War room/video conference link prepared
- Monitoring dashboards accessible
- Log aggregation system accessible
- APM tools accessible
- Database admin tools ready
- Cloud console access verified
- Rollback automation tested and ready
Incident Response Procedures
- Incident severity levels defined
- Escalation paths documented
- Rollback decision tree prepared
- Communication templates ready
- Incident timeline tracking method prepared
- Post-incident review template ready
Common Incident Scenarios and Responses
Scenario: High Error Rate
- Check recent code changes in deployed version
- Review application logs for error patterns
- Check external dependencies (APIs, databases)
- Verify infrastructure health (CPU, memory, network)
- Initiate rollback if error rate > 5% or critical functionality affected
- Document incident timeline and root cause
Scenario: Performance Degradation
- Check application metrics (latency, throughput)
- Review database query performance
- Check for resource contention (CPU, memory)
- Verify cache effectiveness
- Check for N+1 queries or inefficient code paths
- Initiate rollback if latency > 25% above baseline
- Consider horizontal scaling if infrastructure-related
Scenario: Database Migration Failure
- Stop application deployment immediately
- Assess migration state (partially applied?)
- Execute rollback migration if available
- Restore from backup if rollback not possible
- Validate data integrity after rollback
- Investigate migration failure root cause
- Fix migration script and retest in staging
Scenario: External Dependency Failure
- Identify failed external service (API, payment processor, etc.)
- Check circuit breaker status
- Verify fallback mechanisms working
- Contact external service provider if critical
- Consider feature flag to disable affected functionality
- Monitor impact on core user journeys
- Communicate status to affected users if needed
Post-Incident Actions
- Incident timeline documented
- Root cause analysis completed
- Post-mortem scheduled (within 48 hours)
- Action items identified and assigned
- Documentation updated with lessons learned
- Preventive measures implemented
- Stakeholders informed of resolution and next steps
Documentation Requirements
Comprehensive documentation ensures repeatability and knowledge sharing:
Deployment Runbook
Must include:
- Step-by-step deployment procedure
- Pre-deployment checklist
- Deployment command examples
- Validation steps and expected results
- Rollback procedures
- Troubleshooting common issues
- Contact information for escalation
- Links to monitoring dashboards
- Links to relevant documentation
Architecture Documentation
Update if deployment includes:
- Infrastructure changes (new services, databases)
- Service dependencies changes
- Data flow changes
- Security boundary changes
- Network topology changes
- Integration changes
Configuration Documentation
Document:
- Environment variables and their purpose
- Feature flags and their impact
- Secret management approach
- Configuration file locations
- Configuration change procedures
Monitoring Documentation
Document:
- Key metrics and their meaning
- Dashboard locations and usage
- Alert rules and thresholds
- Alert response procedures
- Log query examples
- Troubleshooting guides based on metrics
API Documentation
Update if deployment includes:
- New endpoints or modified endpoints
- Request/response schema changes
- Authentication/authorization changes
- Rate limiting changes
- Deprecation notices
- Migration guides for API consumers
Complete Checklist Templates
Template 1: Production Deployment Checklist (Standard Release)
Application: _____________ Version: _____________ Deployment Date: _____________ Deployer: _____________ Approver: _____________
Pre-Deployment (T-48 hours)
- Code freeze initiated
- All tests passing (unit, integration, e2e)
- Security scans completed (no critical/high vulnerabilities)
- Performance tests passed (meets SLA requirements)
- Staging deployment successful
- Smoke tests passed in staging
- Database migration tested in staging
- Rollback plan documented and reviewed
- Stakeholders notified of deployment window
- Customer communication prepared (if needed)
- On-call engineer confirmed and available
- Monitoring dashboards reviewed and updated
- Alert rules validated
- Incident response team briefed
Pre-Deployment (T-2 hours)
- Final build and tests passed
- Container images built and pushed to registry
- Image vulnerability scan passed
- GitOps repository updated (manifests committed)
- Infrastructure validated (kubectl dry-run)
- Database backup completed and verified
- Feature flags configured correctly
- Configuration changes reviewed
- Secrets validated in production environment
- War room/video call initiated
- Status page updated (maintenance mode if needed)
Deployment (T-0)
- Deployment initiated (via GitOps or kubectl)
- Deployment strategy: [ ] Rolling [ ] Blue-Green [ ] Canary
- Monitor pod rollout status
- Verify new pods starting successfully
- Check pod logs for errors during startup
- Monitor resource utilization (CPU, memory)
- Verify health endpoints responding
- Database migration executed (if applicable)
- Database migration successful
- Traffic routing to new version (if blue-green/canary)
Post-Deployment Validation (T+15 minutes)
- All pods running and healthy
- Smoke tests passed in production
- Critical user journeys working (tested)
- Error rate within acceptable range (< 0.5%)
- Response time within baseline (± 10%)
- Database connections stable
- External integrations working
- Background jobs processing
- Cache functioning correctly
- Logs showing no critical errors
- Monitoring metrics within normal range
Post-Deployment Monitoring (T+2 hours)
- Continuous monitoring shows stable metrics
- No increase in error rates
- Response times stable
- Business metrics normal (transactions, sign-ups, etc.)
- No memory leaks detected
- Resource utilization within expected range
- No customer-reported critical issues
- Support team reports normal ticket volume
Completion (T+24 hours)
- Extended monitoring completed (24 hours)
- All metrics stable and within baseline
- No incidents or rollbacks required
- Deployment marked as successful
- Post-deployment report published
- Release notes published (internal and external)
- Documentation updated
- Git repository tagged with version
- Deployment runbook updated with lessons learned
- Performance baseline updated
- Stakeholders notified of successful deployment
- Code freeze lifted
Rollback (If Required)
- Rollback decision made and communicated
- Rollback initiated (method: _________)
- Rollback completed successfully
- Health checks passing after rollback
- Metrics returned to baseline
- Incident documented
- Post-mortem scheduled
- Root cause analysis initiated
- Stakeholders notified of rollback
Template 2: Canary Deployment Checklist (Progressive Delivery)
Application: _____________ Version: _____________ Deployment Date: _____________ Deployer: _____________ Traffic Stages: 10% → 25% → 50% → 75% → 100%
Pre-Canary Setup
- Argo Rollouts or Flagger installed and configured
- Canary rollout manifest prepared and reviewed
- Traffic management configured (Istio, NGINX, Traefik)
- Analysis templates defined (error rate, latency)
- Automated promotion criteria configured
- Manual approval gates configured (if required)
- Baseline metrics captured from stable version
- Monitoring dashboards configured for canary vs. stable comparison
- Alert rules configured for canary anomalies
- Rollback automation tested
Stage 1: 10% Traffic to Canary
- Canary pods deployed successfully
- 10% traffic routing to canary verified
- Canary pod health checks passing
- Monitor for 5-10 minutes
- Compare metrics: Canary vs. Stable
- Error rate delta < 1%
- P95 latency delta < 10%
- No critical errors in canary logs
- Resource utilization acceptable
- Automated analysis passed (if configured)
- Decision: [ ] Proceed [ ] Pause [ ] Rollback
- Manual approval granted (if required)
Stage 2: 25% Traffic to Canary
- Traffic increased to 25% verified
- Monitor for 5-10 minutes
- Compare metrics: Canary vs. Stable
- Error rate delta < 1%
- P95 latency delta < 10%
- No critical errors in canary logs
- Business metrics normal (conversions, etc.)
- Distributed tracing shows no anomalies
- Database query performance acceptable
- External API calls succeeding
- Decision: [ ] Proceed [ ] Pause [ ] Rollback
Stage 3: 50% Traffic to Canary
- Traffic increased to 50% verified
- Monitor for 10-15 minutes (longer observation)
- Compare metrics: Canary vs. Stable
- Error rate delta < 1%
- P95 latency delta < 10%
- P99 latency delta < 15%
- No critical errors in canary logs
- Memory usage stable (no leaks)
- CPU utilization within range
- Background jobs processing correctly
- User feedback monitored (support tickets, social media)
- Decision: [ ] Proceed [ ] Pause [ ] Rollback
Stage 4: 75% Traffic to Canary
- Traffic increased to 75% verified
- Monitor for 5-10 minutes
- Compare metrics: Canary vs. Stable
- Error rate delta < 1%
- P95 latency delta < 10%
- All critical user journeys working
- Cache performance acceptable
- Connection pooling healthy
- Decision: [ ] Proceed [ ] Rollback
Stage 5: 100% Traffic to Canary (Full Promotion)
- Canary promoted to 100% traffic
- All traffic routing to new version verified
- Stable version pods scaled down
- Monitor for 30 minutes post-promotion
- All smoke tests passing
- Error rates within baseline
- Response times within baseline
- All systems operational
- Canary deployment marked as successful
- Old ReplicaSet retained for quick rollback (if needed)
Post-Canary Validation (T+2 hours)
- Extended monitoring shows stability
- No increase in customer-reported issues
- Business metrics normal
- Resource utilization stable
- Deployment report published
- Stakeholders notified of successful rollout
Canary Rollback (If Required at Any Stage)
- Canary rollout aborted:
kubectl argo rollouts abort <name> - Traffic routing back to stable version verified
- Health checks passing on stable version
- Metrics returned to baseline
- Incident documented with stage where rollback occurred
- Root cause analysis initiated
- Stakeholders notified
Template 3: Emergency Hotfix Checklist (Critical Production Issue)
Application: _____________ Hotfix Version: _____________ Issue Severity: [ ] Critical [ ] High Issue Description: _____________ Deployer: _____________ Approver: _____________
Issue Assessment (T-0)
- Issue confirmed and reproducible
- Impact assessment completed (users affected, revenue impact)
- Severity level assigned (P0/P1/P2)
- Incident declared and stakeholders notified
- War room initiated (video call)
- Root cause identified (or strong hypothesis)
- Hotfix approach determined
- Alternative workarounds considered (feature flag disable, rollback)
Hotfix Development (Expedited)
- Hotfix branch created from production tag
- Minimal code change implemented (fix only, no refactoring)
- Unit tests written for fix (if time permits)
- Local testing completed
- Code review completed (expedited, 1 reviewer minimum)
- Hotfix PR approved and merged
Expedited Testing (Critical Path Only)
- Build and tests passed in CI/CD
- Security scan passed (or waived with approval)
- Smoke tests passed in staging
- Fix validated in staging environment
- Regression testing for affected area completed
- Performance impact assessed (no degradation)
Emergency Deployment Approval
- Hotfix deployment plan reviewed
- Rollback plan confirmed
- Incident commander approval obtained
- Change management notified (or post-facto)
- Customer communication prepared
Hotfix Deployment (Accelerated)
- Database backup completed (if DB changes)
- Deployment initiated (fast-track: rolling update or blue-green)
- Deployment strategy: [ ] Rolling (fast) [ ] Blue-Green
- Monitor pod rollout closely
- Verify new pods starting successfully
- Check logs for errors during startup
Immediate Validation (T+5 minutes)
- All pods running and healthy
- Health endpoints responding
- Issue reproduction attempt: FIXED
- Error rate decreased to acceptable level
- Critical functionality restored
- Response times within acceptable range
- No new errors introduced
- Customer impact mitigated
Post-Hotfix Monitoring (T+30 minutes)
- Continuous monitoring for 30+ minutes
- Issue confirmed resolved (no recurrence)
- Error rates returned to baseline
- User-reported issues declining
- Business metrics recovering
- No unintended side effects detected
Incident Closure (T+2 hours)
- Extended monitoring shows stability (2+ hours)
- Issue confirmed fully resolved
- Incident status page updated (resolved)
- Customer communication sent (issue resolved)
- Stakeholders notified of resolution
- On-call team can stand down
Post-Incident Actions (T+24 hours)
- Incident timeline documented
- Post-mortem scheduled (within 48 hours)
- Root cause analysis completed
- Permanent fix planned (if hotfix is temporary)
- Monitoring improved to detect similar issues earlier
- Alert rules updated (if issue not caught by alerts)
- Runbook updated with hotfix procedure
- Lessons learned shared with team
- Preventive measures identified and prioritized
Hotfix Rollback (If Required)
- Hotfix rollback initiated immediately
- Previous stable version restored
- Issue status: UNRESOLVED (revert to incident response)
- Alternative mitigation strategy initiated (feature flag, manual fix)
- Stakeholders notified of rollback
- Post-mortem to include failed hotfix attempt
Reference Examples
Example 1: Production Deployment Workflow for Kubernetes Microservice
Scenario: Deploying a new version of an e-commerce checkout microservice to production using GitOps (ArgoCD) and rolling update strategy.
Application: checkout-service Version: v2.3.0 Infrastructure: Kubernetes (EKS), PostgreSQL (RDS), Redis (ElastiCache) Deployment Strategy: Rolling update with GitOps Deployment Window: Tuesday, 2:00 PM EST (low-traffic period)
Pre-Deployment (48 hours before)
Code and Testing:
# All tests passed in CI/CD pipeline
✓ Unit tests: 245 passed
✓ Integration tests: 87 passed
✓ E2E tests: 34 passed
✓ Performance tests: p95 < 200ms, p99 < 500ms
✓ Load test: 10,000 RPS sustained for 15 minutes
# Security scans
✓ Trivy container scan: 0 critical, 0 high vulnerabilities
✓ Snyk dependency scan: 0 critical, 2 medium (suppressed)
✓ SonarQube code scan: 0 critical issues, code coverage 87%
Database Migration:
-- Migration tested in staging with production data snapshot
-- Migration: add 'discount_code' column to orders table
-- Estimated duration: 2 minutes (ALTER TABLE on 5M rows)
-- Backward compatible: yes (column nullable)
ALTER TABLE orders ADD COLUMN discount_code VARCHAR(50);
CREATE INDEX idx_orders_discount_code ON orders(discount_code);
GitOps Repository Update:
# kubernetes/checkout-service/production/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
namespace: production
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 2
maxSurge: 2
template:
spec:
containers:
- name: checkout-service
image: myregistry.io/checkout-service:v2.3.0
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
Stakeholder Communication:
Subject: Production Deployment - checkout-service v2.3.0
Team,
We will deploy checkout-service v2.3.0 to production on Tuesday, Feb 13 at 2:00 PM EST.
New Features:
- Discount code support at checkout
- Improved payment processor error handling
- Performance optimization (20% faster checkout flow)
Expected Impact: None (backward compatible, zero downtime)
Duration: ~15 minutes
Deployment Method: Rolling update via ArgoCD
Status Updates: #deployment-checkout channel
On-call: Alice Smith (primary), Bob Jones (secondary)
Rollback Plan: kubectl rollout undo or ArgoCD rollback to v2.2.5
- DevOps Team
Deployment Execution (T-0)
Step 1: Pre-deployment validation
# Verify ArgoCD sync status
argocd app get checkout-service-prod
# Status: Synced, Healthy
# Verify current version
kubectl get deployment checkout-service -n production -o jsonpath='{.spec.template.spec.containers[0].image}'
# Output: myregistry.io/checkout-service:v2.2.5
# Capture current metrics baseline
curl -s https://prometheus.example.com/api/v1/query?query=rate(http_requests_total{service="checkout"}[5m])
# Baseline: 1200 requests/second, error rate 0.3%, p95 latency 180ms
Step 2: Database migration
# Connect to bastion host
ssh bastion.example.com
# Execute migration (using migration tool)
./migrate -database "postgres://checkout-db.prod" -path ./migrations up
# Migration 0005_add_discount_code_column: SUCCESS (1m 45s)
# Verify migration
psql -h checkout-db.prod -U admin -d checkout -c "\d orders"
# Column 'discount_code' present: ✓
Step 3: Update GitOps repository
# Update manifest with new image tag
cd kubernetes/checkout-service/production
sed -i 's/v2.2.5/v2.3.0/g' deployment.yaml
# Commit and push
git add deployment.yaml
git commit -m "Deploy checkout-service v2.3.0 to production"
git push origin main
# ArgoCD auto-syncs within 3 minutes (or manual sync)
argocd app sync checkout-service-prod
Step 4: Monitor rollout
# Watch rollout progress
kubectl rollout status deployment/checkout-service -n production
# Waiting for deployment "checkout-service" rollout to finish: 2 out of 10 new replicas have been updated...
# Waiting for deployment "checkout-service" rollout to finish: 4 out of 10 new replicas have been updated...
# Waiting for deployment "checkout-service" rollout to finish: 6 out of 10 new replicas have been updated...
# Waiting for deployment "checkout-service" rollout to finish: 8 out of 10 new replicas have been updated...
# Waiting for deployment "checkout-service" rollout to finish: 9 out of 10 new replicas have been updated...
# deployment "checkout-service" successfully rolled out
# Verify all pods running new version
kubectl get pods -n production -l app=checkout-service -o jsonpath='{.items[*].spec.containers[0].image}'
# All pods showing: myregistry.io/checkout-service:v2.3.0
Post-Deployment Validation
Step 5: Smoke tests
# Execute automated smoke tests
./scripts/smoke-test-checkout.sh production
# ✓ Health endpoint: 200 OK
# ✓ Create order: SUCCESS
# ✓ Process payment: SUCCESS
# ✓ Apply discount code: SUCCESS (new feature)
# ✓ Cancel order: SUCCESS
# All smoke tests passed (12/12)
Step 6: Metrics validation
# Check error rates (5 minutes post-deployment)
curl -s 'https://prometheus.example.com/api/v1/query?query=rate(http_requests_total{service="checkout",status=~"5.."}[5m])'
# Error rate: 0.28% (within baseline ✓)
# Check latency
curl -s 'https://prometheus.example.com/api/v1/query?query=histogram_quantile(0.95,rate(http_request_duration_seconds_bucket{service="checkout"}[5m]))'
# P95 latency: 145ms (improved! 20% faster than baseline ✓)
# Check throughput
curl -s 'https://prometheus.example.com/api/v1/query?query=rate(http_requests_total{service="checkout"}[5m])'
# Throughput: 1185 requests/second (within normal range ✓)
Step 7: Business metrics validation
# Check checkout completion rate
SELECT COUNT(*) FROM orders WHERE status = 'completed' AND created_at > NOW() - INTERVAL '15 minutes';
# Result: 1,245 completed orders (normal rate ✓)
# Check payment success rate
SELECT
COUNT(*) FILTER (WHERE payment_status = 'success') * 100.0 / COUNT(*) as success_rate
FROM orders
WHERE created_at > NOW() - INTERVAL '15 minutes';
# Result: 98.7% (within baseline ✓)
# Check discount code usage (new feature)
SELECT COUNT(*) FROM orders WHERE discount_code IS NOT NULL AND created_at > NOW() - INTERVAL '15 minutes';
# Result: 87 orders with discount codes (feature working ✓)
Step 8: Extended monitoring
# Monitor for 2 hours post-deployment
# Watch Grafana dashboard: https://grafana.example.com/d/checkout-service
# Key metrics after 2 hours:
# - Error rate: 0.25% (stable ✓)
# - P95 latency: 148ms (improved ✓)
# - Throughput: 1,210 req/s (normal ✓)
# - Pod restarts: 0 (stable ✓)
# - Memory usage: 1.2 GB avg (no leaks ✓)
# - Customer support tickets: 3 (normal volume ✓)
Deployment Completion
Step 9: Documentation and communication
# Tag Git repository
git tag -a v2.3.0 -m "Release v2.3.0: Discount code support"
git push origin v2.3.0
# Update deployment log
echo "$(date): checkout-service v2.3.0 deployed successfully to production" >> deployments.log
# Publish release notes
cat > release-notes-v2.3.0.md <<EOF
# checkout-service v2.3.0
**Release Date**: February 13, 2025
**Deployment Time**: 2:00 PM EST
**Duration**: 12 minutes
## New Features
- Discount code support at checkout (customers can now apply promo codes)
- Improved payment error handling with retry logic
- Performance optimization: 20% faster checkout flow
## Performance Improvements
- P95 latency reduced from 180ms to 145ms
- Database query optimization for order retrieval
## Bug Fixes
- Fixed race condition in inventory check during high traffic
- Corrected tax calculation for international orders
## Deployment Details
- Strategy: Rolling update (zero downtime)
- Database migration: Added discount_code column to orders table
- Backward compatible: Yes
## Metrics (24 hours post-deployment)
- Error rate: 0.24% (baseline: 0.3%)
- P95 latency: 147ms (baseline: 180ms)
- Deployment success: 100%
EOF
Step 10: Stakeholder notification
Subject: ✅ Deployment Complete - checkout-service v2.3.0
Team,
The deployment of checkout-service v2.3.0 has completed successfully.
Deployment Summary:
- Start Time: 2:00 PM EST
- Completion Time: 2:12 PM EST
- Duration: 12 minutes
- Strategy: Rolling update via ArgoCD
- Impact: Zero downtime
Results:
✓ All smoke tests passed
✓ Error rates within baseline (0.24% vs 0.3% baseline)
✓ Performance improved (p95 latency: 147ms vs 180ms baseline)
✓ All 10 pods healthy and stable
✓ New discount code feature working correctly
✓ Customer support reports normal ticket volume
Next Steps:
- 24-hour extended monitoring in progress
- Release notes published: https://wiki.example.com/releases/v2.3.0
- Customer-facing announcement scheduled for tomorrow
Great work, team!
- DevOps Team
Example 2: Canary Deployment with Automated Analysis
Scenario: Deploying a performance optimization to the user authentication service using canary deployment with Argo Rollouts and automated analysis.
Application: auth-service Version: v3.1.0 Infrastructure: Kubernetes (GKE), Istio service mesh, PostgreSQL Deployment Strategy: Canary with automated promotion Risk Level: High (critical service affecting all users)
Pre-Deployment Setup
Argo Rollout Configuration:
# kubernetes/auth-service/production/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: auth-service
namespace: production
spec:
replicas: 20
strategy:
canary:
canaryService: auth-service-canary
stableService: auth-service-stable
trafficRouting:
istio:
virtualService:
name: auth-service-vsvc
routes:
- primary
steps:
- setWeight: 10
- pause: {duration: 5m}
- analysis:
templates:
- templateName: auth-service-success-rate
- templateName: auth-service-latency
- setWeight: 25
- pause: {duration: 5m}
- analysis:
templates:
- templateName: auth-service-success-rate
- templateName: auth-service-latency
- setWeight: 50
- pause: {duration: 10m}
- analysis:
templates:
- templateName: auth-service-success-rate
- templateName: auth-service-latency
- setWeight: 75
- pause: {duration: 5m}
- setWeight: 100
revisionHistoryLimit: 3
template:
spec:
containers:
- name: auth-service
image: myregistry.io/auth-service:v3.1.0
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
Automated Analysis Templates:
# kubernetes/auth-service/production/analysis-templates.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: auth-service-success-rate
namespace: production
spec:
metrics:
- name: success-rate
interval: 1m
count: 5
successCondition: result >= 0.99
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{service="auth-service",status=~"2.."}[5m])) /
sum(rate(http_requests_total{service="auth-service"}[5m]))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: auth-service-latency
namespace: production
spec:
metrics:
- name: p95-latency
interval: 1m
count: 5
successCondition: result < 0.250
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="auth-service"}[5m])) by (le)
)
Baseline Metrics Capture:
# Capture baseline from stable version (v3.0.0)
kubectl argo rollouts get rollout auth-service -n production
# Current metrics:
# - Success rate: 99.7%
# - P95 latency: 220ms
# - P99 latency: 450ms
# - Throughput: 5,000 req/s
# - Error rate: 0.3%
Canary Deployment Execution
Step 1: Initiate canary rollout
# Update rollout manifest with new image version
kubectl set image rollout/auth-service auth-service=myregistry.io/auth-service:v3.1.0 -n production
# Monitor rollout status
kubectl argo rollouts get rollout auth-service -n production --watch
# Output:
# Name: auth-service
# Namespace: production
# Status: ॥ Paused
# Strategy: Canary
# Step: 1/8
# SetWeight: 10
# ActualWeight: 10
# Images: myregistry.io/auth-service:v3.0.0 (stable)
# myregistry.io/auth-service:v3.1.0 (canary)
# Replicas:
# Desired: 20
# Current: 22
# Updated: 2
# Ready: 22
# Available: 22
Step 2: Stage 1 - 10% traffic
# Wait for 5-minute pause
# Automated analysis running...
# Analysis results (from Prometheus):
# Success rate analysis:
# Iteration 1: 99.71% ✓
# Iteration 2: 99.74% ✓
# Iteration 3: 99.69% ✓
# Iteration 4: 99.72% ✓
# Iteration 5: 99.70% ✓
# Result: PASSED (all >= 99%)
# Latency analysis:
# Iteration 1: 180ms ✓
# Iteration 2: 175ms ✓
# Iteration 3: 182ms ✓
# Iteration 4: 178ms ✓
# Iteration 5: 181ms ✓
# Result: PASSED (all < 250ms) - 18% improvement!
# Automated promotion to next stage triggered
Step 3: Stage 2 - 25% traffic
# Rollout automatically progressed to 25%
kubectl argo rollouts get rollout auth-service -n production
# Status: ॥ Paused
# Strategy: Canary
# Step: 3/8
# SetWeight: 25
# ActualWeight: 25
# Replicas:
# Desired: 20
# Current: 25
# Updated: 5
# Ready: 25
# Automated analysis running...
# Analysis results:
# Success rate: 99.68%, 99.72%, 99.70%, 99.69%, 99.71% - PASSED ✓
# Latency: 177ms, 183ms, 179ms, 181ms, 175ms - PASSED ✓
# Additional manual validation:
# - Distributed tracing: No anomalies detected
# - Database connections: Stable (20 connections avg)
# - Memory usage: 480MB avg (within limits)
# - CPU usage: 35% avg (normal)
# Automated promotion to next stage triggered
Step 4: Stage 3 - 50% traffic
# Rollout at 50% traffic (critical milestone)
kubectl argo rollouts get rollout auth-service -n production
# Status: ॥ Paused
# Strategy: Canary
# Step: 5/8
# SetWeight: 50
# ActualWeight: 50
# Replicas:
# Desired: 20
# Current: 30
# Updated: 10
# Ready: 30
# Extended monitoring period (10 minutes)
# Automated analysis running...
# Analysis results after 10 minutes:
# Success rate: 99.71%, 99.73%, 99.69%, 99.72%, 99.70% - PASSED ✓
# Latency: 179ms, 176ms, 182ms, 178ms, 180ms - PASSED ✓
# Business metrics validation:
kubectl exec -it analytics-pod -n production -- psql -c "
SELECT
COUNT(*) as total_logins,
COUNT(*) FILTER (WHERE status = 'success') * 100.0 / COUNT(*) as success_rate
FROM auth_events
WHERE timestamp > NOW() - INTERVAL '10 minutes';
"
# Results:
# total_logins: 30,450
# success_rate: 99.72%
# VALIDATED ✓
# Automated promotion to next stage triggered
Step 5: Stage 4 - 75% traffic
# Rollout at 75% traffic
kubectl argo rollouts get rollout auth-service -n production
# Status: ॥ Paused
# Strategy: Canary
# Step: 7/8
# SetWeight: 75
# ActualWeight: 75
# Automated analysis running...
# Analysis results: PASSED ✓
# At this stage, high confidence in canary
# Automated promotion to full rollout
Step 6: Stage 5 - 100% traffic (full promotion)
# Rollout fully promoted
kubectl argo rollouts get rollout auth-service -n production
# Status: ✔ Healthy
# Strategy: Canary
# Step: 8/8 (Complete)
# SetWeight: 100
# ActualWeight: 100
# Images: myregistry.io/auth-service:v3.1.0 (stable)
# Replicas:
# Desired: 20
# Current: 20
# Updated: 20
# Ready: 20
# Available: 20
# Old ReplicaSet scaled down to 0
# Canary rollout completed successfully!
Post-Canary Validation
Step 7: Extended monitoring
# Monitor for 2 hours post-rollout
# Grafana dashboard: https://grafana.example.com/d/auth-service
# Metrics after 2 hours:
# - Success rate: 99.71% (baseline: 99.7%) ✓
# - P95 latency: 179ms (baseline: 220ms) - 18.6% improvement! ✓
# - P99 latency: 380ms (baseline: 450ms) - 15.6% improvement! ✓
# - Throughput: 5,100 req/s (baseline: 5,000 req/s) ✓
# - Error rate: 0.29% (baseline: 0.3%) ✓
# - CPU usage: 33% avg (baseline: 40%) - optimization working! ✓
# - Memory usage: 475MB avg (stable, no leaks) ✓
# No customer-reported issues
# Support ticket volume: Normal (8 tickets, all unrelated to auth)
Step 8: Deployment report
# Generate automated deployment report
kubectl argo rollouts get rollout auth-service -n production -o json | jq '{
name: .metadata.name,
status: .status.phase,
revision: .status.currentStepIndex,
canaryWeight: .status.canaryWeight,
stableRevision: .status.stableRS,
canaryRevision: .status.currentRS,
startTime: .status.conditions[] | select(.type=="Progressing") | .lastUpdateTime
}'
# Report summary:
{
"name": "auth-service",
"status": "Healthy",
"revision": 8,
"canaryWeight": 100,
"stableRevision": "v3.1.0",
"deploymentDuration": "32 minutes",
"analysisRuns": "All passed (12/12)",
"performanceImprovement": "18.6% latency reduction"
}
Rollback Example (Hypothetical Failure Scenario)
If analysis had failed at 50% stage:
# Hypothetical scenario: P95 latency exceeded 250ms threshold at 50% traffic
# Analysis result: FAILED (latency: 268ms, 272ms, 265ms)
# Automated rollback triggered by Argo Rollouts
kubectl argo rollouts get rollout auth-service -n production
# Status: ✖ Degraded
# Strategy: Canary
# Step: 5/8 (Aborted)
# SetWeight: 0 (rolled back)
# Images: myregistry.io/auth-service:v3.0.0 (stable)
# Replicas:
# Desired: 20
# Current: 20
# Updated: 0 (canary scaled down)
# Ready: 20
# Automated rollback completed
# All traffic routing to stable version (v3.0.0)
# Incident created for investigation
# Post-rollback actions:
# 1. Investigate latency spike in canary
# 2. Review distributed traces for slow queries
# 3. Check for resource contention
# 4. Fix issue and redeploy after validation
Key Takeaways
-
Automation is critical: Automate testing, deployment, monitoring, and rollback to minimize human error and enable fast, reliable deployments.
-
Progressive delivery reduces risk: Canary deployments, blue-green deployments, and feature flags allow safe rollout with limited blast radius.
-
Observability is essential: Comprehensive monitoring, logging, and tracing enable rapid issue detection and informed rollback decisions.
-
Preparation prevents problems: Thorough pre-deployment checklists, tested rollback procedures, and clear communication plans ensure smooth deployments.
-
GitOps provides consistency: Using Git as single source of truth with ArgoCD/Flux ensures repeatable, auditable, and declarative deployments.
-
Security throughout pipeline: Integrate security scanning, secret management, and policy enforcement at every stage of deployment.
-
Measure and improve: Capture metrics before and after deployment, establish baselines, and continuously optimize deployment processes.
-
Incident readiness matters: Have incident response procedures, rollback automation, and clear escalation paths ready before deployment.
Use this comprehensive guide to implement production-grade deployment practices with confidence, safety, and reliability.