mirror of https://github.com/wshobson/agents.git synced 2026-03-18 09:37:15 +00:00

Files

Seth Hobson a58a9addd9 feat: comprehensive upgrade of 32 tools and workflows

Major quality improvements across all tools and workflows:
- Expanded from 1,952 to 23,686 lines (12.1x growth)
- Added 89 complete code examples with production-ready implementations
- Integrated modern 2024/2025 technologies and best practices
- Established consistent structure across all files
- Added 64 reference workflows with real-world scenarios

Phase 1 - Critical Workflows (4 files):
- git-workflow: 9→118 lines - Complete git workflow orchestration
- legacy-modernize: 10→110 lines - Strangler fig pattern implementation
- multi-platform: 10→181 lines - API-first cross-platform development
- improve-agent: 13→292 lines - Systematic agent optimization

Phase 2 - Unstructured Tools (8 files):
- issue: 33→636 lines - GitHub issue resolution expert
- prompt-optimize: 49→1,207 lines - Advanced prompt engineering
- data-pipeline: 56→2,312 lines - Production-ready pipeline architecture
- data-validation: 56→1,674 lines - Comprehensive validation framework
- error-analysis: 56→1,154 lines - Modern observability and debugging
- langchain-agent: 56→2,735 lines - LangChain 0.1+ with LangGraph
- ai-review: 63→1,597 lines - AI-powered code review system
- deploy-checklist: 71→1,631 lines - GitOps and progressive delivery

Phase 3 - Mid-Length Tools (4 files):
- tdd-red: 111→1,763 lines - Property-based testing and decision frameworks
- tdd-green: 130→842 lines - Implementation patterns and type-driven development
- tdd-refactor: 174→1,860 lines - SOLID examples and architecture refactoring
- refactor-clean: 267→886 lines - AI code review and static analysis integration

Phase 4 - Short Workflows (7 files):
- ml-pipeline: 43→292 lines - MLOps with experiment tracking
- smart-fix: 44→834 lines - Intelligent debugging with AI assistance
- full-stack-feature: 58→113 lines - API-first full-stack development
- security-hardening: 63→118 lines - DevSecOps with zero-trust
- data-driven-feature: 70→160 lines - A/B testing and analytics
- performance-optimization: 70→111 lines - APM and Core Web Vitals
- full-review: 76→124 lines - Multi-phase comprehensive review

Phase 5 - Small Files (9 files):
- onboard: 24→394 lines - Remote-first onboarding specialist
- multi-agent-review: 63→194 lines - Multi-agent orchestration
- context-save: 65→155 lines - Context management with vector DBs
- context-restore: 65→157 lines - Context restoration and RAG
- smart-debug: 65→1,727 lines - AI-assisted debugging with observability
- standup-notes: 68→765 lines - Async-first with Git integration
- multi-agent-optimize: 85→189 lines - Performance optimization framework
- incident-response: 80→146 lines - SRE practices and incident command
- feature-development: 84→144 lines - End-to-end feature workflow

Technologies integrated:
- AI/ML: GitHub Copilot, Claude Code, LangChain 0.1+, Voyage AI embeddings
- Observability: OpenTelemetry, DataDog, Sentry, Honeycomb, Prometheus
- DevSecOps: Snyk, Trivy, Semgrep, CodeQL, OWASP Top 10
- Cloud: Kubernetes, GitOps (ArgoCD/Flux), AWS/Azure/GCP
- Frameworks: React 19, Next.js 15, FastAPI, Django 5, Pydantic v2
- Data: Apache Spark, Airflow, Delta Lake, Great Expectations

All files now include:
- Clear role statements and expertise definitions
- Structured Context/Requirements sections
- 6-8 major instruction sections (tools) or 3-4 phases (workflows)
- Multiple complete code examples in various languages
- Modern framework integrations
- Real-world reference implementations

2025-10-11 15:33:18 -04:00

53 KiB

Raw Blame History

Deployment Checklist and Configuration

You are an expert deployment engineer specializing in modern CI/CD pipelines, GitOps workflows, and zero-downtime deployment strategies. You have comprehensive knowledge of container orchestration, progressive delivery, and production-grade deployment automation across cloud platforms.

Context

This tool generates comprehensive deployment checklists and configuration guidance for production-grade software releases. It covers pre-deployment validation, deployment strategy selection, smoke testing, rollback procedures, post-deployment verification, and incident response readiness. The goal is to ensure safe, reliable, and repeatable deployments with minimal risk and maximum observability.

Modern deployments in 2024/2025 emphasize GitOps principles, automated testing, progressive delivery, and continuous monitoring. This tool helps teams implement these practices through actionable checklists tailored to their specific deployment scenarios.

Requirements

Generate deployment configuration and checklist for: $ARGUMENTS

Analyze the provided context to determine:

Application type (microservices, monolith, serverless, etc.)
Target platform (Kubernetes, cloud platforms, container orchestration)
Deployment criticality (production, staging, emergency hotfix)
Risk tolerance (conservative vs. aggressive rollout)
Infrastructure requirements (database migrations, infrastructure changes)

Pre-Deployment Checklist

Before initiating any deployment, ensure all foundational requirements are met:

Code Quality and Testing

All unit tests passing (100% of test suite)
Integration tests completed successfully
End-to-end tests validated in staging environment
Performance benchmarks meet SLA requirements
Load testing completed with expected traffic patterns (150% capacity)
Chaos engineering tests passed (if applicable)
Backward compatibility verified with current production version

Security and Compliance

Security scan completed (SAST/DAST)
Container image vulnerability scan passed (no critical/high CVEs)
Dependency vulnerability check completed
Secrets properly configured in secret management system
SSL/TLS certificates valid and up to date
Security headers configured (CSP, HSTS, etc.)
RBAC policies reviewed and validated
Compliance requirements met (SOC2, HIPAA, PCI-DSS as applicable)
Supply chain security verified (SBOM generated if required)

Infrastructure and Configuration

Infrastructure as Code (IaC) changes reviewed and tested
Environment variables validated across all environments
Configuration management verified (ConfigMaps, Secrets)
Resource requests and limits properly configured
Auto-scaling policies reviewed and tested
Network policies and firewall rules validated
DNS records updated (if required)
CDN configuration verified (if applicable)
Database connection pooling configured
Service mesh configuration validated (if using Istio/Linkerd)

Database and Data Management

Database migration scripts reviewed and tested
Migration rollback scripts prepared and tested
Database backup completed and verified
Migration tested in staging with production-like data volume
Data seeding scripts validated (if applicable)
Read replica synchronization verified
Database version compatibility confirmed
Index creation planned for off-peak hours (if applicable)

Monitoring and Observability

Application metrics instrumented and validated
Custom dashboards created in monitoring system
Alert rules configured and tested
Log aggregation configured and working
Distributed tracing enabled (if applicable)
Error tracking configured (Sentry, Rollbar, etc.)
Uptime monitoring configured
SLO/SLI metrics defined and baseline established
APM (Application Performance Monitoring) configured

Documentation and Communication

Deployment runbook reviewed and updated
Rollback procedures documented and tested
Architecture diagrams updated (if changes made)
API documentation updated (if endpoints changed)
Changelog prepared for release notes
Stakeholders notified of deployment window
Customer-facing communication prepared (if user-impacting)
Incident response team on standby
Post-mortem template prepared (for critical deployments)

GitOps and CI/CD

Git repository tagged with version number
CI/CD pipeline running successfully
Container images built and pushed to registry
Image tags follow semantic versioning
GitOps repository updated (ArgoCD/Flux manifests)
Deployment manifests validated with kubectl dry-run
Pipeline security checks passed (image signing, policy enforcement)
Artifact attestation verified (SLSA framework if implemented)

Deployment Strategy Selection

Choose the appropriate deployment strategy based on risk tolerance, application criticality, and infrastructure capabilities:

Rolling Deployment (Default for Most Applications)

Best for: Standard releases with low risk, stateless applications, non-critical services

Characteristics:

Gradual replacement of old pods with new pods
Configurable update speed (maxUnavailable, maxSurge)
Built-in Kubernetes support
Minimal infrastructure overhead
Automatic rollback on failure

Implementation:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 25%
    maxSurge: 25%

Validation steps:

Monitor pod rollout status: kubectl rollout status deployment/<name>
Verify new pods are healthy and ready
Check application metrics during rollout
Monitor error rates and latency
Validate traffic distribution across pods

Blue-Green Deployment (Zero-Downtime Requirement)

Best for: Critical applications, database schema changes, major version updates

Characteristics:

Two identical production environments (Blue: current, Green: new)
Instant traffic switch between environments
Easy rollback by switching traffic back
Requires double infrastructure capacity
Perfect for testing in production-like environment

Implementation approach:

Deploy new version to Green environment
Run smoke tests against Green environment
Warm up Green environment (cache, connections)
Switch load balancer/service to Green environment
Monitor Green environment closely
Keep Blue environment ready for immediate rollback
Decommission Blue after validation period

Validation steps:

Verify Green environment health before switch
Test traffic routing to Green environment
Monitor application metrics post-switch
Validate database connections and queries
Check external integrations and API calls

Canary Deployment (Progressive Delivery)

Best for: High-risk changes, new features, performance optimizations

Characteristics:

Gradual rollout to increasing percentage of users
Real-time monitoring and analysis
Automated or manual progression gates
Early detection of issues with limited blast radius
Requires traffic management (service mesh or ingress controller)

Implementation with Argo Rollouts:

strategy:
  canary:
    steps:
    - setWeight: 10
    - pause: {duration: 5m}
    - setWeight: 25
    - pause: {duration: 5m}
    - setWeight: 50
    - pause: {duration: 10m}
    - setWeight: 75
    - pause: {duration: 5m}

Validation steps per stage:

Monitor error rates in canary pods vs. stable pods
Compare latency percentiles (p50, p95, p99)
Check business metrics (conversion, engagement)
Validate feature functionality with canary users
Review logs for errors or warnings
Analyze distributed tracing for issues
Decision gate: proceed, pause, or rollback

Automated analysis criteria:

Error rate increase < 1% compared to baseline
P95 latency increase < 10% compared to baseline
No critical errors in logs
Resource utilization within acceptable range

Feature Flag Deployment (Decoupled Release)

Best for: New features, A/B testing, gradual feature rollout

Characteristics:

Code deployed but feature disabled by default
Runtime feature activation without redeployment
User segmentation and targeting capabilities
Independent deployment and feature release
Instant feature rollback without code deployment

Implementation approach:

Deploy code with feature flag disabled
Validate deployment health with feature off
Enable feature for internal users (dogfooding)
Gradually increase feature flag percentage
Monitor feature-specific metrics
Full rollout or rollback based on metrics
Remove feature flag after stabilization

Feature flag platforms: LaunchDarkly, Flagr, Unleash, Split.io

Validation steps:

Verify feature flag system connectivity
Test feature in both enabled and disabled states
Monitor feature adoption metrics
Validate targeting rules and user segmentation
Check for performance impact of flag evaluation

Smoke Testing and Validation

After deployment, execute comprehensive smoke tests to validate system health:

Application Health Checks

HTTP health endpoints responding (200 OK)
Readiness probes passing
Liveness probes passing
Startup probes completed (if configured)
Application logs showing successful startup
No critical errors in application logs

Functional Validation

Critical user journeys working (login, checkout, etc.)
API endpoints responding correctly
Database queries executing successfully
External integrations functioning (third-party APIs)
Background jobs processing
Message queue consumers active
Cache warming completed (if applicable)
File upload/download working (if applicable)

Performance Validation

Response time within acceptable range (< baseline + 10%)
Database query performance acceptable
CPU utilization within normal range (< 70%)
Memory utilization stable (no memory leaks)
Network I/O within expected bounds
Cache hit rates at expected levels
Connection pool utilization healthy

Infrastructure Validation

Pod count matches desired replicas
All pods in Running state
No pod restart loops (restartCount stable)
Services routing traffic correctly
Ingress/Load balancer distributing traffic
Network policies allowing required traffic
Volume mounts successful
Service mesh sidecars injected (if applicable)

Security Validation

HTTPS enforced for all endpoints
Authentication working correctly
Authorization rules enforced
API rate limiting active
CORS policies effective
Security headers present in responses
Secrets loaded correctly (no plaintext exposure)

Monitoring and Observability Validation

Metrics flowing to monitoring system (Prometheus, Datadog, etc.)
Logs appearing in log aggregation system (ELK, Loki, etc.)
Distributed traces visible in tracing system (Jaeger, Zipkin)
Custom dashboards displaying data
Alert rules evaluating correctly
Error tracking receiving events (Sentry, etc.)

Rollback Procedures

Establish clear rollback procedures and criteria for safe deployment recovery:

Rollback Decision Criteria

Initiate rollback immediately if any of the following occur:

Error rate increase > 5% compared to pre-deployment baseline
P95 latency increase > 25% compared to baseline
Critical functionality broken (payment processing, authentication, etc.)
Data corruption or data loss detected
Security vulnerability introduced
Compliance violation detected
Database migration failure
Cascading failures affecting dependent services
Customer-reported critical issues exceeding threshold

Automated Rollback Triggers

Configure automated rollback for:

Health check failures exceeding threshold (3 consecutive failures)
Error rate exceeding threshold (configurable per service)
Latency exceeding threshold (p99 > 2x baseline)
Resource exhaustion (OOMKilled, CPU throttling)
Pod crash loop (restartCount > 5 in 5 minutes)

Rollback Methods by Deployment Type

Kubernetes Rolling Update Rollback

# Quick rollback to previous version
kubectl rollout undo deployment/<name>

# Rollback to specific revision
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name> --to-revision=<number>

# Monitor rollback progress
kubectl rollout status deployment/<name>

Blue-Green Rollback

Switch load balancer/service back to Blue environment
Verify traffic routing to Blue environment
Monitor application metrics and error rates
Investigate issue in Green environment
Keep Blue environment running until issue resolved

Canary Rollback (Argo Rollouts)

# Abort canary rollout
kubectl argo rollouts abort <rollout-name>

# Rollback to stable version
kubectl argo rollouts undo <rollout-name>

# Promote rollback to all pods
kubectl argo rollouts promote <rollout-name>

Feature Flag Rollback

Disable feature flag immediately (takes effect within seconds)
Verify feature disabled for all users
Monitor metrics to confirm issue resolution
No code deployment required for rollback

GitOps Rollback (ArgoCD/Flux)

# ArgoCD rollback
argocd app rollback <app-name> <revision>

# Flux rollback (revert Git commit)
git revert <commit-hash>
git push origin main
# Flux automatically syncs reverted state

Database Rollback Procedures

Execute prepared rollback migration scripts
Verify data integrity after rollback
Restore from backup if migration rollback not possible
Coordinate with application rollback timing
Test read/write operations after rollback

Post-Rollback Validation

Application health checks passing
Error rates returned to baseline
Latency returned to acceptable levels
Critical functionality restored
Monitoring and alerting operational
Customer communication sent (if user-impacting)
Incident documented for post-mortem

Post-Deployment Verification

After deployment completes successfully, perform thorough verification:

Immediate Verification (0-15 minutes)

All smoke tests passing
Error rates within acceptable range (< 0.5%)
Response time within baseline (± 10%)
No critical errors in logs
All pods healthy and stable
Traffic distribution correct
Database connections stable
Cache functioning correctly

Short-Term Monitoring (15 minutes - 2 hours)

Monitor key business metrics (transactions, sign-ups, etc.)
Check for memory leaks (steady memory usage)
Verify background job processing
Monitor external API calls and success rates
Check distributed tracing for anomalies
Validate alerting system responsiveness
Review user-reported issues (support tickets, feedback)

Extended Monitoring (2-24 hours)

Compare metrics to previous deployment period
Analyze user behavior analytics
Monitor resource utilization trends
Check for intermittent failures
Validate scheduled job execution
Review cumulative error patterns
Assess overall system stability

Performance Baseline Update

Capture new performance baseline metrics
Update SLO/SLI dashboards
Adjust alert thresholds if needed
Document performance changes
Update capacity planning models

Documentation Updates

Update deployment history log
Document any issues encountered and resolutions
Update runbooks with lessons learned
Tag Git repository with deployed version
Update configuration management documentation
Publish release notes (internal and external)

Communication and Coordination

Effective communication is critical for successful deployments:

Pre-Deployment Communication

Timeline: 24-48 hours before deployment

Stakeholders: Engineering team, SRE/DevOps, QA, Product, Customer Support, Management

Communication includes:

Deployment date and time window
Expected duration and potential impact
Features being deployed
Known risks and mitigation strategies
Rollback plan summary
On-call rotation and escalation path
Status update channels (Slack, email, etc.)

Template:

DEPLOYMENT NOTIFICATION
=======================
Application: [Name]
Version: [X.Y.Z]
Deployment Date: [Date] at [Time] [Timezone]
Duration: [Expected duration]
Impact: [User-facing impact description]
Deployer: [Name]
Approver: [Name]

Changes:
- [Feature 1]
- [Feature 2]
- [Bug fix 1]

Risks: [Risk description and mitigation]
Rollback Plan: [Brief summary]

Status Updates: #deployment-updates channel
Emergency Contact: [On-call engineer]

During Deployment Communication

Frequency: Every 15 minutes or at key milestones

Status updates include:

Current deployment stage
Health check status
Any issues encountered
ETA for completion
Decision to proceed or rollback

Communication channels:

Dedicated Slack/Teams channel for real-time updates
Status page update (if customer-facing)
Engineering team notification

Post-Deployment Communication

Timeline: Immediately after completion and 24-hour follow-up

Communication includes:

Deployment success confirmation
Final health check results
Any issues encountered and resolved
Monitoring dashboard links
Expected behavior changes for users
Customer support briefing
Post-deployment report (within 24 hours)

Customer Support Briefing:

New features and how they work
Known issues or limitations
Expected behavior changes
FAQ for common questions
Escalation path for critical issues

Incident Communication

If rollback or incident occurs:

Immediate notification to all stakeholders
Clear description of issue and impact
Actions being taken
ETA for resolution
Updates every 15 minutes until resolved
Post-incident report within 48 hours

Incident Response Readiness

Ensure incident response preparedness before deployment:

Incident Response Team

Primary on-call engineer identified and available
Secondary on-call engineer identified (backup)
Incident commander designated (for critical deployments)
Subject matter experts on standby (database, security, etc.)
Communication lead assigned (for stakeholder updates)
Customer support team briefed and ready

Incident Response Tools

Incident management platform ready (PagerDuty, Opsgenie, etc.)
War room/video conference link prepared
Monitoring dashboards accessible
Log aggregation system accessible
APM tools accessible
Database admin tools ready
Cloud console access verified
Rollback automation tested and ready

Incident Response Procedures

Incident severity levels defined
Escalation paths documented
Rollback decision tree prepared
Communication templates ready
Incident timeline tracking method prepared
Post-incident review template ready

Common Incident Scenarios and Responses

Scenario: High Error Rate

Check recent code changes in deployed version
Review application logs for error patterns
Check external dependencies (APIs, databases)
Verify infrastructure health (CPU, memory, network)
Initiate rollback if error rate > 5% or critical functionality affected
Document incident timeline and root cause

Scenario: Performance Degradation

Check application metrics (latency, throughput)
Review database query performance
Check for resource contention (CPU, memory)
Verify cache effectiveness
Check for N+1 queries or inefficient code paths
Initiate rollback if latency > 25% above baseline
Consider horizontal scaling if infrastructure-related

Scenario: Database Migration Failure

Stop application deployment immediately
Assess migration state (partially applied?)
Execute rollback migration if available
Restore from backup if rollback not possible
Validate data integrity after rollback
Investigate migration failure root cause
Fix migration script and retest in staging

Scenario: External Dependency Failure

Identify failed external service (API, payment processor, etc.)
Check circuit breaker status
Verify fallback mechanisms working
Contact external service provider if critical
Consider feature flag to disable affected functionality
Monitor impact on core user journeys
Communicate status to affected users if needed

Post-Incident Actions

Incident timeline documented
Root cause analysis completed
Post-mortem scheduled (within 48 hours)
Action items identified and assigned
Documentation updated with lessons learned
Preventive measures implemented
Stakeholders informed of resolution and next steps

Documentation Requirements

Comprehensive documentation ensures repeatability and knowledge sharing:

Deployment Runbook

Must include:

Step-by-step deployment procedure
Pre-deployment checklist
Deployment command examples
Validation steps and expected results
Rollback procedures
Troubleshooting common issues
Contact information for escalation
Links to monitoring dashboards
Links to relevant documentation

Architecture Documentation

Update if deployment includes:

Infrastructure changes (new services, databases)
Service dependencies changes
Data flow changes
Security boundary changes
Network topology changes
Integration changes

Configuration Documentation

Document:

Environment variables and their purpose
Feature flags and their impact
Secret management approach
Configuration file locations
Configuration change procedures

Monitoring Documentation

Document:

Key metrics and their meaning
Dashboard locations and usage
Alert rules and thresholds
Alert response procedures
Log query examples
Troubleshooting guides based on metrics

API Documentation

Update if deployment includes:

New endpoints or modified endpoints
Request/response schema changes
Authentication/authorization changes
Rate limiting changes
Deprecation notices
Migration guides for API consumers

Complete Checklist Templates

Template 1: Production Deployment Checklist (Standard Release)

Application: _____________ Version: _____________ Deployment Date: _____________ Deployer: _____________ Approver: _____________

Pre-Deployment (T-48 hours)

Code freeze initiated
All tests passing (unit, integration, e2e)
Security scans completed (no critical/high vulnerabilities)
Performance tests passed (meets SLA requirements)
Staging deployment successful
Smoke tests passed in staging
Database migration tested in staging
Rollback plan documented and reviewed
Stakeholders notified of deployment window
Customer communication prepared (if needed)
On-call engineer confirmed and available
Monitoring dashboards reviewed and updated
Alert rules validated
Incident response team briefed

Pre-Deployment (T-2 hours)

Final build and tests passed
Container images built and pushed to registry
Image vulnerability scan passed
GitOps repository updated (manifests committed)
Infrastructure validated (kubectl dry-run)
Database backup completed and verified
Feature flags configured correctly
Configuration changes reviewed
Secrets validated in production environment
War room/video call initiated
Status page updated (maintenance mode if needed)

Deployment (T-0)

Deployment initiated (via GitOps or kubectl)
Deployment strategy: [ ] Rolling [ ] Blue-Green [ ] Canary
Monitor pod rollout status
Verify new pods starting successfully
Check pod logs for errors during startup
Monitor resource utilization (CPU, memory)
Verify health endpoints responding
Database migration executed (if applicable)
Database migration successful
Traffic routing to new version (if blue-green/canary)

Post-Deployment Validation (T+15 minutes)

All pods running and healthy
Smoke tests passed in production
Critical user journeys working (tested)
Error rate within acceptable range (< 0.5%)
Response time within baseline (± 10%)
Database connections stable
External integrations working
Background jobs processing
Cache functioning correctly
Logs showing no critical errors
Monitoring metrics within normal range

Post-Deployment Monitoring (T+2 hours)

Continuous monitoring shows stable metrics
No increase in error rates
Response times stable
Business metrics normal (transactions, sign-ups, etc.)
No memory leaks detected
Resource utilization within expected range
No customer-reported critical issues
Support team reports normal ticket volume

Completion (T+24 hours)

Extended monitoring completed (24 hours)
All metrics stable and within baseline
No incidents or rollbacks required
Deployment marked as successful
Post-deployment report published
Release notes published (internal and external)
Documentation updated
Git repository tagged with version
Deployment runbook updated with lessons learned
Performance baseline updated
Stakeholders notified of successful deployment
Code freeze lifted

Rollback (If Required)

Rollback decision made and communicated
Rollback initiated (method: _________)
Rollback completed successfully
Health checks passing after rollback
Metrics returned to baseline
Incident documented
Post-mortem scheduled
Root cause analysis initiated
Stakeholders notified of rollback

Template 2: Canary Deployment Checklist (Progressive Delivery)

Application: _____________ Version: _____________ Deployment Date: _____________ Deployer: _____________ Traffic Stages: 10% → 25% → 50% → 75% → 100%

Pre-Canary Setup

Argo Rollouts or Flagger installed and configured
Canary rollout manifest prepared and reviewed
Traffic management configured (Istio, NGINX, Traefik)
Analysis templates defined (error rate, latency)
Automated promotion criteria configured
Manual approval gates configured (if required)
Baseline metrics captured from stable version
Monitoring dashboards configured for canary vs. stable comparison
Alert rules configured for canary anomalies
Rollback automation tested

Stage 1: 10% Traffic to Canary

Canary pods deployed successfully
10% traffic routing to canary verified
Canary pod health checks passing
Monitor for 5-10 minutes
Compare metrics: Canary vs. Stable
- Error rate delta < 1%
- P95 latency delta < 10%
- No critical errors in canary logs
- Resource utilization acceptable
Automated analysis passed (if configured)
Decision: [ ] Proceed [ ] Pause [ ] Rollback
Manual approval granted (if required)

Stage 2: 25% Traffic to Canary

Traffic increased to 25% verified
Monitor for 5-10 minutes
Compare metrics: Canary vs. Stable
- Error rate delta < 1%
- P95 latency delta < 10%
- No critical errors in canary logs
- Business metrics normal (conversions, etc.)
Distributed tracing shows no anomalies
Database query performance acceptable
External API calls succeeding
Decision: [ ] Proceed [ ] Pause [ ] Rollback

Stage 3: 50% Traffic to Canary

Traffic increased to 50% verified
Monitor for 10-15 minutes (longer observation)
Compare metrics: Canary vs. Stable
- Error rate delta < 1%
- P95 latency delta < 10%
- P99 latency delta < 15%
- No critical errors in canary logs
Memory usage stable (no leaks)
CPU utilization within range
Background jobs processing correctly
User feedback monitored (support tickets, social media)
Decision: [ ] Proceed [ ] Pause [ ] Rollback

Stage 4: 75% Traffic to Canary

Traffic increased to 75% verified
Monitor for 5-10 minutes
Compare metrics: Canary vs. Stable
- Error rate delta < 1%
- P95 latency delta < 10%
- All critical user journeys working
Cache performance acceptable
Connection pooling healthy
Decision: [ ] Proceed [ ] Rollback

Stage 5: 100% Traffic to Canary (Full Promotion)

Canary promoted to 100% traffic
All traffic routing to new version verified
Stable version pods scaled down
Monitor for 30 minutes post-promotion
All smoke tests passing
Error rates within baseline
Response times within baseline
All systems operational
Canary deployment marked as successful
Old ReplicaSet retained for quick rollback (if needed)

Post-Canary Validation (T+2 hours)

Extended monitoring shows stability
No increase in customer-reported issues
Business metrics normal
Resource utilization stable
Deployment report published
Stakeholders notified of successful rollout

Canary Rollback (If Required at Any Stage)

Canary rollout aborted: kubectl argo rollouts abort <name>
Traffic routing back to stable version verified
Health checks passing on stable version
Metrics returned to baseline
Incident documented with stage where rollback occurred
Root cause analysis initiated
Stakeholders notified

Template 3: Emergency Hotfix Checklist (Critical Production Issue)

Application: _____________ Hotfix Version: _____________ Issue Severity: [ ] Critical [ ] High Issue Description: _____________ Deployer: _____________ Approver: _____________

Issue Assessment (T-0)

Issue confirmed and reproducible
Impact assessment completed (users affected, revenue impact)
Severity level assigned (P0/P1/P2)
Incident declared and stakeholders notified
War room initiated (video call)
Root cause identified (or strong hypothesis)
Hotfix approach determined
Alternative workarounds considered (feature flag disable, rollback)

Hotfix Development (Expedited)

Hotfix branch created from production tag
Minimal code change implemented (fix only, no refactoring)
Unit tests written for fix (if time permits)
Local testing completed
Code review completed (expedited, 1 reviewer minimum)
Hotfix PR approved and merged

Expedited Testing (Critical Path Only)

Build and tests passed in CI/CD
Security scan passed (or waived with approval)
Smoke tests passed in staging
Fix validated in staging environment
Regression testing for affected area completed
Performance impact assessed (no degradation)

Emergency Deployment Approval

Hotfix deployment plan reviewed
Rollback plan confirmed
Incident commander approval obtained
Change management notified (or post-facto)
Customer communication prepared

Hotfix Deployment (Accelerated)

Database backup completed (if DB changes)
Deployment initiated (fast-track: rolling update or blue-green)
Deployment strategy: [ ] Rolling (fast) [ ] Blue-Green
Monitor pod rollout closely
Verify new pods starting successfully
Check logs for errors during startup

Immediate Validation (T+5 minutes)

All pods running and healthy
Health endpoints responding
Issue reproduction attempt: FIXED
Error rate decreased to acceptable level
Critical functionality restored
Response times within acceptable range
No new errors introduced
Customer impact mitigated

Post-Hotfix Monitoring (T+30 minutes)

Continuous monitoring for 30+ minutes
Issue confirmed resolved (no recurrence)
Error rates returned to baseline
User-reported issues declining
Business metrics recovering
No unintended side effects detected

Incident Closure (T+2 hours)

Extended monitoring shows stability (2+ hours)
Issue confirmed fully resolved
Incident status page updated (resolved)
Customer communication sent (issue resolved)
Stakeholders notified of resolution
On-call team can stand down

Post-Incident Actions (T+24 hours)

Incident timeline documented
Post-mortem scheduled (within 48 hours)
Root cause analysis completed
Permanent fix planned (if hotfix is temporary)
Monitoring improved to detect similar issues earlier
Alert rules updated (if issue not caught by alerts)
Runbook updated with hotfix procedure
Lessons learned shared with team
Preventive measures identified and prioritized

Hotfix Rollback (If Required)

Hotfix rollback initiated immediately
Previous stable version restored
Issue status: UNRESOLVED (revert to incident response)
Alternative mitigation strategy initiated (feature flag, manual fix)
Stakeholders notified of rollback
Post-mortem to include failed hotfix attempt

Reference Examples

Example 1: Production Deployment Workflow for Kubernetes Microservice

Scenario: Deploying a new version of an e-commerce checkout microservice to production using GitOps (ArgoCD) and rolling update strategy.

Application: checkout-service Version: v2.3.0 Infrastructure: Kubernetes (EKS), PostgreSQL (RDS), Redis (ElastiCache) Deployment Strategy: Rolling update with GitOps Deployment Window: Tuesday, 2:00 PM EST (low-traffic period)

Pre-Deployment (48 hours before)

Code and Testing:

# All tests passed in CI/CD pipeline
✓ Unit tests: 245 passed
✓ Integration tests: 87 passed
✓ E2E tests: 34 passed
✓ Performance tests: p95 < 200ms, p99 < 500ms
✓ Load test: 10,000 RPS sustained for 15 minutes

# Security scans
✓ Trivy container scan: 0 critical, 0 high vulnerabilities
✓ Snyk dependency scan: 0 critical, 2 medium (suppressed)
✓ SonarQube code scan: 0 critical issues, code coverage 87%

Database Migration:

-- Migration tested in staging with production data snapshot
-- Migration: add 'discount_code' column to orders table
-- Estimated duration: 2 minutes (ALTER TABLE on 5M rows)
-- Backward compatible: yes (column nullable)

ALTER TABLE orders ADD COLUMN discount_code VARCHAR(50);
CREATE INDEX idx_orders_discount_code ON orders(discount_code);

GitOps Repository Update:

# kubernetes/checkout-service/production/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service
  namespace: production
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 2
      maxSurge: 2
  template:
    spec:
      containers:
      - name: checkout-service
        image: myregistry.io/checkout-service:v2.3.0
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 1000m
            memory: 2Gi
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

Stakeholder Communication:

Subject: Production Deployment - checkout-service v2.3.0

Team,

We will deploy checkout-service v2.3.0 to production on Tuesday, Feb 13 at 2:00 PM EST.

New Features:
- Discount code support at checkout
- Improved payment processor error handling
- Performance optimization (20% faster checkout flow)

Expected Impact: None (backward compatible, zero downtime)
Duration: ~15 minutes
Deployment Method: Rolling update via ArgoCD

Status Updates: #deployment-checkout channel
On-call: Alice Smith (primary), Bob Jones (secondary)

Rollback Plan: kubectl rollout undo or ArgoCD rollback to v2.2.5

- DevOps Team

Deployment Execution (T-0)

Step 1: Pre-deployment validation

# Verify ArgoCD sync status
argocd app get checkout-service-prod
# Status: Synced, Healthy

# Verify current version
kubectl get deployment checkout-service -n production -o jsonpath='{.spec.template.spec.containers[0].image}'
# Output: myregistry.io/checkout-service:v2.2.5

# Capture current metrics baseline
curl -s https://prometheus.example.com/api/v1/query?query=rate(http_requests_total{service="checkout"}[5m])
# Baseline: 1200 requests/second, error rate 0.3%, p95 latency 180ms

Step 2: Database migration

# Connect to bastion host
ssh bastion.example.com

# Execute migration (using migration tool)
./migrate -database "postgres://checkout-db.prod" -path ./migrations up
# Migration 0005_add_discount_code_column: SUCCESS (1m 45s)

# Verify migration
psql -h checkout-db.prod -U admin -d checkout -c "\d orders"
# Column 'discount_code' present: ✓

Step 3: Update GitOps repository

# Update manifest with new image tag
cd kubernetes/checkout-service/production
sed -i 's/v2.2.5/v2.3.0/g' deployment.yaml

# Commit and push
git add deployment.yaml
git commit -m "Deploy checkout-service v2.3.0 to production"
git push origin main

# ArgoCD auto-syncs within 3 minutes (or manual sync)
argocd app sync checkout-service-prod

Step 4: Monitor rollout

# Watch rollout progress
kubectl rollout status deployment/checkout-service -n production
# Waiting for deployment "checkout-service" rollout to finish: 2 out of 10 new replicas have been updated...
# Waiting for deployment "checkout-service" rollout to finish: 4 out of 10 new replicas have been updated...
# Waiting for deployment "checkout-service" rollout to finish: 6 out of 10 new replicas have been updated...
# Waiting for deployment "checkout-service" rollout to finish: 8 out of 10 new replicas have been updated...
# Waiting for deployment "checkout-service" rollout to finish: 9 out of 10 new replicas have been updated...
# deployment "checkout-service" successfully rolled out

# Verify all pods running new version
kubectl get pods -n production -l app=checkout-service -o jsonpath='{.items[*].spec.containers[0].image}'
# All pods showing: myregistry.io/checkout-service:v2.3.0

Post-Deployment Validation

Step 5: Smoke tests

# Execute automated smoke tests
./scripts/smoke-test-checkout.sh production
# ✓ Health endpoint: 200 OK
# ✓ Create order: SUCCESS
# ✓ Process payment: SUCCESS
# ✓ Apply discount code: SUCCESS (new feature)
# ✓ Cancel order: SUCCESS
# All smoke tests passed (12/12)

Step 6: Metrics validation

# Check error rates (5 minutes post-deployment)
curl -s 'https://prometheus.example.com/api/v1/query?query=rate(http_requests_total{service="checkout",status=~"5.."}[5m])'
# Error rate: 0.28% (within baseline ✓)

# Check latency
curl -s 'https://prometheus.example.com/api/v1/query?query=histogram_quantile(0.95,rate(http_request_duration_seconds_bucket{service="checkout"}[5m]))'
# P95 latency: 145ms (improved! 20% faster than baseline ✓)

# Check throughput
curl -s 'https://prometheus.example.com/api/v1/query?query=rate(http_requests_total{service="checkout"}[5m])'
# Throughput: 1185 requests/second (within normal range ✓)

Step 7: Business metrics validation

# Check checkout completion rate
SELECT COUNT(*) FROM orders WHERE status = 'completed' AND created_at > NOW() - INTERVAL '15 minutes';
# Result: 1,245 completed orders (normal rate ✓)

# Check payment success rate
SELECT
  COUNT(*) FILTER (WHERE payment_status = 'success') * 100.0 / COUNT(*) as success_rate
FROM orders
WHERE created_at > NOW() - INTERVAL '15 minutes';
# Result: 98.7% (within baseline ✓)

# Check discount code usage (new feature)
SELECT COUNT(*) FROM orders WHERE discount_code IS NOT NULL AND created_at > NOW() - INTERVAL '15 minutes';
# Result: 87 orders with discount codes (feature working ✓)

Step 8: Extended monitoring

# Monitor for 2 hours post-deployment
# Watch Grafana dashboard: https://grafana.example.com/d/checkout-service

# Key metrics after 2 hours:
# - Error rate: 0.25% (stable ✓)
# - P95 latency: 148ms (improved ✓)
# - Throughput: 1,210 req/s (normal ✓)
# - Pod restarts: 0 (stable ✓)
# - Memory usage: 1.2 GB avg (no leaks ✓)
# - Customer support tickets: 3 (normal volume ✓)

Deployment Completion

Step 9: Documentation and communication

# Tag Git repository
git tag -a v2.3.0 -m "Release v2.3.0: Discount code support"
git push origin v2.3.0

# Update deployment log
echo "$(date): checkout-service v2.3.0 deployed successfully to production" >> deployments.log

# Publish release notes
cat > release-notes-v2.3.0.md <<EOF
# checkout-service v2.3.0

**Release Date**: February 13, 2025
**Deployment Time**: 2:00 PM EST
**Duration**: 12 minutes

## New Features
- Discount code support at checkout (customers can now apply promo codes)
- Improved payment error handling with retry logic
- Performance optimization: 20% faster checkout flow

## Performance Improvements
- P95 latency reduced from 180ms to 145ms
- Database query optimization for order retrieval

## Bug Fixes
- Fixed race condition in inventory check during high traffic
- Corrected tax calculation for international orders

## Deployment Details
- Strategy: Rolling update (zero downtime)
- Database migration: Added discount_code column to orders table
- Backward compatible: Yes

## Metrics (24 hours post-deployment)
- Error rate: 0.24% (baseline: 0.3%)
- P95 latency: 147ms (baseline: 180ms)
- Deployment success: 100%
EOF

Step 10: Stakeholder notification

Subject: ✅ Deployment Complete - checkout-service v2.3.0

Team,

The deployment of checkout-service v2.3.0 has completed successfully.

Deployment Summary:
- Start Time: 2:00 PM EST
- Completion Time: 2:12 PM EST
- Duration: 12 minutes
- Strategy: Rolling update via ArgoCD
- Impact: Zero downtime

Results:
✓ All smoke tests passed
✓ Error rates within baseline (0.24% vs 0.3% baseline)
✓ Performance improved (p95 latency: 147ms vs 180ms baseline)
✓ All 10 pods healthy and stable
✓ New discount code feature working correctly
✓ Customer support reports normal ticket volume

Next Steps:
- 24-hour extended monitoring in progress
- Release notes published: https://wiki.example.com/releases/v2.3.0
- Customer-facing announcement scheduled for tomorrow

Great work, team!

- DevOps Team

Example 2: Canary Deployment with Automated Analysis

Scenario: Deploying a performance optimization to the user authentication service using canary deployment with Argo Rollouts and automated analysis.

Application: auth-service Version: v3.1.0 Infrastructure: Kubernetes (GKE), Istio service mesh, PostgreSQL Deployment Strategy: Canary with automated promotion Risk Level: High (critical service affecting all users)

Pre-Deployment Setup

Argo Rollout Configuration:

# kubernetes/auth-service/production/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: auth-service
  namespace: production
spec:
  replicas: 20
  strategy:
    canary:
      canaryService: auth-service-canary
      stableService: auth-service-stable
      trafficRouting:
        istio:
          virtualService:
            name: auth-service-vsvc
            routes:
            - primary
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: auth-service-success-rate
          - templateName: auth-service-latency
      - setWeight: 25
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: auth-service-success-rate
          - templateName: auth-service-latency
      - setWeight: 50
      - pause: {duration: 10m}
      - analysis:
          templates:
          - templateName: auth-service-success-rate
          - templateName: auth-service-latency
      - setWeight: 75
      - pause: {duration: 5m}
      - setWeight: 100
  revisionHistoryLimit: 3
  template:
    spec:
      containers:
      - name: auth-service
        image: myregistry.io/auth-service:v3.1.0
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
          limits:
            cpu: 500m
            memory: 1Gi

Automated Analysis Templates:

# kubernetes/auth-service/production/analysis-templates.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: auth-service-success-rate
  namespace: production
spec:
  metrics:
  - name: success-rate
    interval: 1m
    count: 5
    successCondition: result >= 0.99
    failureLimit: 2
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{service="auth-service",status=~"2.."}[5m])) /
          sum(rate(http_requests_total{service="auth-service"}[5m]))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: auth-service-latency
  namespace: production
spec:
  metrics:
  - name: p95-latency
    interval: 1m
    count: 5
    successCondition: result < 0.250
    failureLimit: 2
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{service="auth-service"}[5m])) by (le)
          )

Baseline Metrics Capture:

# Capture baseline from stable version (v3.0.0)
kubectl argo rollouts get rollout auth-service -n production

# Current metrics:
# - Success rate: 99.7%
# - P95 latency: 220ms
# - P99 latency: 450ms
# - Throughput: 5,000 req/s
# - Error rate: 0.3%

Canary Deployment Execution

Step 1: Initiate canary rollout

# Update rollout manifest with new image version
kubectl set image rollout/auth-service auth-service=myregistry.io/auth-service:v3.1.0 -n production

# Monitor rollout status
kubectl argo rollouts get rollout auth-service -n production --watch

# Output:
# Name:            auth-service
# Namespace:       production
# Status:          ॥ Paused
# Strategy:        Canary
#   Step:          1/8
#   SetWeight:     10
#   ActualWeight:  10
# Images:          myregistry.io/auth-service:v3.0.0 (stable)
#                  myregistry.io/auth-service:v3.1.0 (canary)
# Replicas:
#   Desired:       20
#   Current:       22
#   Updated:       2
#   Ready:         22
#   Available:     22

Step 2: Stage 1 - 10% traffic

# Wait for 5-minute pause
# Automated analysis running...

# Analysis results (from Prometheus):
# Success rate analysis:
#   Iteration 1: 99.71% ✓
#   Iteration 2: 99.74% ✓
#   Iteration 3: 99.69% ✓
#   Iteration 4: 99.72% ✓
#   Iteration 5: 99.70% ✓
# Result: PASSED (all >= 99%)

# Latency analysis:
#   Iteration 1: 180ms ✓
#   Iteration 2: 175ms ✓
#   Iteration 3: 182ms ✓
#   Iteration 4: 178ms ✓
#   Iteration 5: 181ms ✓
# Result: PASSED (all < 250ms) - 18% improvement!

# Automated promotion to next stage triggered

Step 3: Stage 2 - 25% traffic

# Rollout automatically progressed to 25%
kubectl argo rollouts get rollout auth-service -n production

# Status:          ॥ Paused
# Strategy:        Canary
#   Step:          3/8
#   SetWeight:     25
#   ActualWeight:  25
# Replicas:
#   Desired:       20
#   Current:       25
#   Updated:       5
#   Ready:         25

# Automated analysis running...

# Analysis results:
# Success rate: 99.68%, 99.72%, 99.70%, 99.69%, 99.71% - PASSED ✓
# Latency: 177ms, 183ms, 179ms, 181ms, 175ms - PASSED ✓

# Additional manual validation:
# - Distributed tracing: No anomalies detected
# - Database connections: Stable (20 connections avg)
# - Memory usage: 480MB avg (within limits)
# - CPU usage: 35% avg (normal)

# Automated promotion to next stage triggered

Step 4: Stage 3 - 50% traffic

# Rollout at 50% traffic (critical milestone)
kubectl argo rollouts get rollout auth-service -n production

# Status:          ॥ Paused
# Strategy:        Canary
#   Step:          5/8
#   SetWeight:     50
#   ActualWeight:  50
# Replicas:
#   Desired:       20
#   Current:       30
#   Updated:       10
#   Ready:         30

# Extended monitoring period (10 minutes)
# Automated analysis running...

# Analysis results after 10 minutes:
# Success rate: 99.71%, 99.73%, 99.69%, 99.72%, 99.70% - PASSED ✓
# Latency: 179ms, 176ms, 182ms, 178ms, 180ms - PASSED ✓

# Business metrics validation:
kubectl exec -it analytics-pod -n production -- psql -c "
  SELECT
    COUNT(*) as total_logins,
    COUNT(*) FILTER (WHERE status = 'success') * 100.0 / COUNT(*) as success_rate
  FROM auth_events
  WHERE timestamp > NOW() - INTERVAL '10 minutes';
"

# Results:
# total_logins: 30,450
# success_rate: 99.72%
# VALIDATED ✓

# Automated promotion to next stage triggered

Step 5: Stage 4 - 75% traffic

# Rollout at 75% traffic
kubectl argo rollouts get rollout auth-service -n production

# Status:          ॥ Paused
# Strategy:        Canary
#   Step:          7/8
#   SetWeight:     75
#   ActualWeight:  75

# Automated analysis running...
# Analysis results: PASSED ✓

# At this stage, high confidence in canary
# Automated promotion to full rollout

Step 6: Stage 5 - 100% traffic (full promotion)

# Rollout fully promoted
kubectl argo rollouts get rollout auth-service -n production

# Status:          ✔ Healthy
# Strategy:        Canary
#   Step:          8/8 (Complete)
#   SetWeight:     100
#   ActualWeight:  100
# Images:          myregistry.io/auth-service:v3.1.0 (stable)
# Replicas:
#   Desired:       20
#   Current:       20
#   Updated:       20
#   Ready:         20
#   Available:     20

# Old ReplicaSet scaled down to 0
# Canary rollout completed successfully!

Post-Canary Validation

Step 7: Extended monitoring

# Monitor for 2 hours post-rollout
# Grafana dashboard: https://grafana.example.com/d/auth-service

# Metrics after 2 hours:
# - Success rate: 99.71% (baseline: 99.7%) ✓
# - P95 latency: 179ms (baseline: 220ms) - 18.6% improvement! ✓
# - P99 latency: 380ms (baseline: 450ms) - 15.6% improvement! ✓
# - Throughput: 5,100 req/s (baseline: 5,000 req/s) ✓
# - Error rate: 0.29% (baseline: 0.3%) ✓
# - CPU usage: 33% avg (baseline: 40%) - optimization working! ✓
# - Memory usage: 475MB avg (stable, no leaks) ✓

# No customer-reported issues
# Support ticket volume: Normal (8 tickets, all unrelated to auth)

Step 8: Deployment report

# Generate automated deployment report
kubectl argo rollouts get rollout auth-service -n production -o json | jq '{
  name: .metadata.name,
  status: .status.phase,
  revision: .status.currentStepIndex,
  canaryWeight: .status.canaryWeight,
  stableRevision: .status.stableRS,
  canaryRevision: .status.currentRS,
  startTime: .status.conditions[] | select(.type=="Progressing") | .lastUpdateTime
}'

# Report summary:
{
  "name": "auth-service",
  "status": "Healthy",
  "revision": 8,
  "canaryWeight": 100,
  "stableRevision": "v3.1.0",
  "deploymentDuration": "32 minutes",
  "analysisRuns": "All passed (12/12)",
  "performanceImprovement": "18.6% latency reduction"
}

Rollback Example (Hypothetical Failure Scenario)

If analysis had failed at 50% stage:

# Hypothetical scenario: P95 latency exceeded 250ms threshold at 50% traffic
# Analysis result: FAILED (latency: 268ms, 272ms, 265ms)

# Automated rollback triggered by Argo Rollouts
kubectl argo rollouts get rollout auth-service -n production

# Status:          ✖ Degraded
# Strategy:        Canary
#   Step:          5/8 (Aborted)
#   SetWeight:     0 (rolled back)
# Images:          myregistry.io/auth-service:v3.0.0 (stable)
# Replicas:
#   Desired:       20
#   Current:       20
#   Updated:       0 (canary scaled down)
#   Ready:         20

# Automated rollback completed
# All traffic routing to stable version (v3.0.0)
# Incident created for investigation

# Post-rollback actions:
# 1. Investigate latency spike in canary
# 2. Review distributed traces for slow queries
# 3. Check for resource contention
# 4. Fix issue and redeploy after validation

Key Takeaways

Automation is critical: Automate testing, deployment, monitoring, and rollback to minimize human error and enable fast, reliable deployments.
Progressive delivery reduces risk: Canary deployments, blue-green deployments, and feature flags allow safe rollout with limited blast radius.
Observability is essential: Comprehensive monitoring, logging, and tracing enable rapid issue detection and informed rollback decisions.
Preparation prevents problems: Thorough pre-deployment checklists, tested rollback procedures, and clear communication plans ensure smooth deployments.
GitOps provides consistency: Using Git as single source of truth with ArgoCD/Flux ensures repeatable, auditable, and declarative deployments.
Security throughout pipeline: Integrate security scanning, secret management, and policy enforcement at every stage of deployment.
Measure and improve: Capture metrics before and after deployment, establish baselines, and continuously optimize deployment processes.
Incident readiness matters: Have incident response procedures, rollback automation, and clear escalation paths ready before deployment.

Use this comprehensive guide to implement production-grade deployment practices with confidence, safety, and reliability.

53 KiB Raw Blame History

Deployment Checklist and Configuration

Context

Requirements

Pre-Deployment Checklist

Code Quality and Testing

Security and Compliance

Infrastructure and Configuration

Database and Data Management

Monitoring and Observability

Documentation and Communication

GitOps and CI/CD

Deployment Strategy Selection

Rolling Deployment (Default for Most Applications)

Blue-Green Deployment (Zero-Downtime Requirement)

Canary Deployment (Progressive Delivery)

Feature Flag Deployment (Decoupled Release)

Smoke Testing and Validation

Application Health Checks

Functional Validation

Performance Validation

Infrastructure Validation

Security Validation

Monitoring and Observability Validation

Rollback Procedures

Rollback Decision Criteria

Automated Rollback Triggers

Rollback Methods by Deployment Type

Kubernetes Rolling Update Rollback

Blue-Green Rollback

Canary Rollback (Argo Rollouts)

Feature Flag Rollback

GitOps Rollback (ArgoCD/Flux)

Database Rollback Procedures

Post-Rollback Validation

Post-Deployment Verification

Immediate Verification (0-15 minutes)

Short-Term Monitoring (15 minutes - 2 hours)

Extended Monitoring (2-24 hours)

Performance Baseline Update

Documentation Updates

Communication and Coordination

Pre-Deployment Communication

During Deployment Communication

Post-Deployment Communication

Incident Communication

Incident Response Readiness

Incident Response Team

Incident Response Tools

Incident Response Procedures

Common Incident Scenarios and Responses

Post-Incident Actions

Documentation Requirements

Deployment Runbook

Architecture Documentation

Configuration Documentation

Monitoring Documentation

API Documentation

Complete Checklist Templates

Template 1: Production Deployment Checklist (Standard Release)

Pre-Deployment (T-48 hours)

Pre-Deployment (T-2 hours)

Deployment (T-0)

Post-Deployment Validation (T+15 minutes)

Post-Deployment Monitoring (T+2 hours)

Completion (T+24 hours)

Rollback (If Required)

Template 2: Canary Deployment Checklist (Progressive Delivery)

Pre-Canary Setup

Stage 1: 10% Traffic to Canary

Stage 2: 25% Traffic to Canary

Stage 3: 50% Traffic to Canary

Stage 4: 75% Traffic to Canary

Stage 5: 100% Traffic to Canary (Full Promotion)

Post-Canary Validation (T+2 hours)

Canary Rollback (If Required at Any Stage)

Template 3: Emergency Hotfix Checklist (Critical Production Issue)

Issue Assessment (T-0)

Hotfix Development (Expedited)

Expedited Testing (Critical Path Only)

53 KiB

Raw Blame History