mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 09:37:15 +00:00
style: format all files with prettier
This commit is contained in:
@@ -20,12 +20,12 @@ Production-ready templates for incident response runbooks covering detection, tr
|
||||
|
||||
### 1. Incident Severity Levels
|
||||
|
||||
| Severity | Impact | Response Time | Example |
|
||||
|----------|--------|---------------|---------|
|
||||
| **SEV1** | Complete outage, data loss | 15 min | Production down |
|
||||
| **SEV2** | Major degradation | 30 min | Critical feature broken |
|
||||
| **SEV3** | Minor impact | 2 hours | Non-critical bug |
|
||||
| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
|
||||
| Severity | Impact | Response Time | Example |
|
||||
| -------- | -------------------------- | ----------------- | ----------------------- |
|
||||
| **SEV1** | Complete outage, data loss | 15 min | Production down |
|
||||
| **SEV2** | Major degradation | 30 min | Critical feature broken |
|
||||
| **SEV3** | Minor impact | 2 hours | Non-critical bug |
|
||||
| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
|
||||
|
||||
### 2. Runbook Structure
|
||||
|
||||
@@ -45,28 +45,33 @@ Production-ready templates for incident response runbooks covering detection, tr
|
||||
|
||||
### Template 1: Service Outage Runbook
|
||||
|
||||
```markdown
|
||||
````markdown
|
||||
# [Service Name] Outage Runbook
|
||||
|
||||
## Overview
|
||||
|
||||
**Service**: Payment Processing Service
|
||||
**Owner**: Platform Team
|
||||
**Slack**: #payments-incidents
|
||||
**PagerDuty**: payments-oncall
|
||||
|
||||
## Impact Assessment
|
||||
|
||||
- [ ] Which customers are affected?
|
||||
- [ ] What percentage of traffic is impacted?
|
||||
- [ ] Are there financial implications?
|
||||
- [ ] What's the blast radius?
|
||||
|
||||
## Detection
|
||||
|
||||
### Alerts
|
||||
|
||||
- `payment_error_rate > 5%` (PagerDuty)
|
||||
- `payment_latency_p99 > 2s` (Slack)
|
||||
- `payment_success_rate < 95%` (PagerDuty)
|
||||
|
||||
### Dashboards
|
||||
|
||||
- [Payment Service Dashboard](https://grafana/d/payments)
|
||||
- [Error Tracking](https://sentry.io/payments)
|
||||
- [Dependency Status](https://status.stripe.com)
|
||||
@@ -74,6 +79,7 @@ Production-ready templates for incident response runbooks covering detection, tr
|
||||
## Initial Triage (First 5 Minutes)
|
||||
|
||||
### 1. Assess Scope
|
||||
|
||||
```bash
|
||||
# Check service health
|
||||
kubectl get pods -n payments -l app=payment-service
|
||||
@@ -84,24 +90,28 @@ kubectl rollout history deployment/payment-service -n payments
|
||||
# Check error rates
|
||||
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
|
||||
```
|
||||
````
|
||||
|
||||
### 2. Quick Health Checks
|
||||
|
||||
- [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`
|
||||
- [ ] Database connectivity? Check connection pool metrics
|
||||
- [ ] External dependencies? Check Stripe, bank API status
|
||||
- [ ] Recent changes? Check deploy history
|
||||
|
||||
### 3. Initial Classification
|
||||
| Symptom | Likely Cause | Go To Section |
|
||||
|---------|--------------|---------------|
|
||||
| All requests failing | Service down | Section 4.1 |
|
||||
| High latency | Database/dependency | Section 4.2 |
|
||||
| Partial failures | Code bug | Section 4.3 |
|
||||
| Spike in errors | Traffic surge | Section 4.4 |
|
||||
|
||||
| Symptom | Likely Cause | Go To Section |
|
||||
| -------------------- | ------------------- | ------------- |
|
||||
| All requests failing | Service down | Section 4.1 |
|
||||
| High latency | Database/dependency | Section 4.2 |
|
||||
| Partial failures | Code bug | Section 4.3 |
|
||||
| Spike in errors | Traffic surge | Section 4.4 |
|
||||
|
||||
## Mitigation Procedures
|
||||
|
||||
### 4.1 Service Completely Down
|
||||
|
||||
```bash
|
||||
# Step 1: Check pod status
|
||||
kubectl get pods -n payments
|
||||
@@ -123,6 +133,7 @@ kubectl rollout status deployment/payment-service -n payments
|
||||
```
|
||||
|
||||
### 4.2 High Latency
|
||||
|
||||
```bash
|
||||
# Step 1: Check database connections
|
||||
kubectl exec -n payments deploy/payment-service -- \
|
||||
@@ -147,6 +158,7 @@ kubectl set env deployment/payment-service \
|
||||
```
|
||||
|
||||
### 4.3 Partial Failures (Specific Errors)
|
||||
|
||||
```bash
|
||||
# Step 1: Identify error pattern
|
||||
kubectl logs -n payments -l app=payment-service --tail=500 | \
|
||||
@@ -167,6 +179,7 @@ psql -h $DB_HOST -c "
|
||||
```
|
||||
|
||||
### 4.4 Traffic Surge
|
||||
|
||||
```bash
|
||||
# Step 1: Check current request rate
|
||||
kubectl top pods -n payments
|
||||
@@ -200,6 +213,7 @@ EOF
|
||||
```
|
||||
|
||||
## Verification Steps
|
||||
|
||||
```bash
|
||||
# Verify service is healthy
|
||||
curl -s https://api.company.com/payments/health | jq
|
||||
@@ -215,6 +229,7 @@ curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(r
|
||||
```
|
||||
|
||||
## Rollback Procedures
|
||||
|
||||
```bash
|
||||
# Rollback Kubernetes deployment
|
||||
kubectl rollout undo deployment/payment-service -n payments
|
||||
@@ -229,16 +244,17 @@ curl -X POST https://api.company.com/internal/feature-flags \
|
||||
|
||||
## Escalation Matrix
|
||||
|
||||
| Condition | Escalate To | Contact |
|
||||
|-----------|-------------|---------|
|
||||
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
|
||||
| Data breach suspected | Security Team | #security-incidents |
|
||||
| Financial impact > $10k | Finance + Legal | @finance-oncall |
|
||||
| Customer communication needed | Support Lead | @support-lead |
|
||||
| Condition | Escalate To | Contact |
|
||||
| ----------------------------- | ------------------- | ------------------- |
|
||||
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
|
||||
| Data breach suspected | Security Team | #security-incidents |
|
||||
| Financial impact > $10k | Finance + Legal | @finance-oncall |
|
||||
| Customer communication needed | Support Lead | @support-lead |
|
||||
|
||||
## Communication Templates
|
||||
|
||||
### Initial Notification (Internal)
|
||||
|
||||
```
|
||||
🚨 INCIDENT: Payment Service Degradation
|
||||
|
||||
@@ -257,6 +273,7 @@ Updates in #payments-incidents
|
||||
```
|
||||
|
||||
### Status Update
|
||||
|
||||
```
|
||||
📊 UPDATE: Payment Service Incident
|
||||
|
||||
@@ -276,6 +293,7 @@ ETA to Resolution: ~15 minutes
|
||||
```
|
||||
|
||||
### Resolution Notification
|
||||
|
||||
```
|
||||
✅ RESOLVED: Payment Service Incident
|
||||
|
||||
@@ -291,7 +309,8 @@ Follow-up:
|
||||
- Postmortem scheduled for [DATE]
|
||||
- Bug fix in progress
|
||||
```
|
||||
```
|
||||
|
||||
````
|
||||
|
||||
### Template 2: Database Incident Runbook
|
||||
|
||||
@@ -325,9 +344,10 @@ SELECT pg_terminate_backend(pid)
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'idle'
|
||||
AND query_start < now() - interval '10 minutes';
|
||||
```
|
||||
````
|
||||
|
||||
## Replication Lag
|
||||
|
||||
```sql
|
||||
-- Check lag on replica
|
||||
SELECT
|
||||
@@ -343,6 +363,7 @@ SELECT
|
||||
```
|
||||
|
||||
## Disk Space Critical
|
||||
|
||||
```bash
|
||||
# Check disk usage
|
||||
df -h /var/lib/postgresql/data
|
||||
@@ -358,6 +379,7 @@ psql -c "VACUUM FULL large_table;"
|
||||
|
||||
# If emergency, delete old data or expand disk
|
||||
```
|
||||
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
@@ -381,3 +403,4 @@ psql -c "VACUUM FULL large_table;"
|
||||
- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
|
||||
- [PagerDuty Incident Response](https://response.pagerduty.com/)
|
||||
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)
|
||||
```
|
||||
|
||||
@@ -20,13 +20,13 @@ Effective patterns for on-call shift transitions, ensuring continuity, context t
|
||||
|
||||
### 1. Handoff Components
|
||||
|
||||
| Component | Purpose |
|
||||
|-----------|---------|
|
||||
| **Active Incidents** | What's currently broken |
|
||||
| **Ongoing Investigations** | Issues being debugged |
|
||||
| **Recent Changes** | Deployments, configs |
|
||||
| **Known Issues** | Workarounds in place |
|
||||
| **Upcoming Events** | Maintenance, releases |
|
||||
| Component | Purpose |
|
||||
| -------------------------- | ----------------------- |
|
||||
| **Active Incidents** | What's currently broken |
|
||||
| **Ongoing Investigations** | Issues being debugged |
|
||||
| **Recent Changes** | Deployments, configs |
|
||||
| **Known Issues** | Workarounds in place |
|
||||
| **Upcoming Events** | Maintenance, releases |
|
||||
|
||||
### 2. Handoff Timing
|
||||
|
||||
@@ -47,7 +47,7 @@ Incoming:
|
||||
|
||||
### Template 1: Shift Handoff Document
|
||||
|
||||
```markdown
|
||||
````markdown
|
||||
# On-Call Handoff: Platform Team
|
||||
|
||||
**Outgoing**: @alice (2024-01-15 to 2024-01-22)
|
||||
@@ -59,6 +59,7 @@ Incoming:
|
||||
## 🔴 Active Incidents
|
||||
|
||||
### None currently active
|
||||
|
||||
No active incidents at handoff time.
|
||||
|
||||
---
|
||||
@@ -66,40 +67,48 @@ No active incidents at handoff time.
|
||||
## 🟡 Ongoing Investigations
|
||||
|
||||
### 1. Intermittent API Timeouts (ENG-1234)
|
||||
|
||||
**Status**: Investigating
|
||||
**Started**: 2024-01-20
|
||||
**Impact**: ~0.1% of requests timing out
|
||||
|
||||
**Context**:
|
||||
|
||||
- Timeouts correlate with database backup window (02:00-03:00 UTC)
|
||||
- Suspect backup process causing lock contention
|
||||
- Added extra logging in PR #567 (deployed 01/21)
|
||||
|
||||
**Next Steps**:
|
||||
|
||||
- [ ] Review new logs after tonight's backup
|
||||
- [ ] Consider moving backup window if confirmed
|
||||
|
||||
**Resources**:
|
||||
|
||||
- Dashboard: [API Latency](https://grafana/d/api-latency)
|
||||
- Thread: #platform-eng (01/20, 14:32)
|
||||
|
||||
---
|
||||
|
||||
### 2. Memory Growth in Auth Service (ENG-1235)
|
||||
|
||||
**Status**: Monitoring
|
||||
**Started**: 2024-01-18
|
||||
**Impact**: None yet (proactive)
|
||||
|
||||
**Context**:
|
||||
|
||||
- Memory usage growing ~5% per day
|
||||
- No memory leak found in profiling
|
||||
- Suspect connection pool not releasing properly
|
||||
|
||||
**Next Steps**:
|
||||
|
||||
- [ ] Review heap dump from 01/21
|
||||
- [ ] Consider restart if usage > 80%
|
||||
|
||||
**Resources**:
|
||||
|
||||
- Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)
|
||||
- Analysis doc: [Memory Investigation](https://docs/eng-1235)
|
||||
|
||||
@@ -108,6 +117,7 @@ No active incidents at handoff time.
|
||||
## 🟢 Resolved This Shift
|
||||
|
||||
### Payment Service Outage (2024-01-19)
|
||||
|
||||
- **Duration**: 23 minutes
|
||||
- **Root Cause**: Database connection exhaustion
|
||||
- **Resolution**: Rolled back v2.3.4, increased pool size
|
||||
@@ -119,17 +129,20 @@ No active incidents at handoff time.
|
||||
## 📋 Recent Changes
|
||||
|
||||
### Deployments
|
||||
| Service | Version | Time | Notes |
|
||||
|---------|---------|------|-------|
|
||||
| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |
|
||||
| user-service | v2.8.0 | 01/20 10:00 | New profile features |
|
||||
| auth-service | v4.1.2 | 01/19 16:00 | Security patch |
|
||||
|
||||
| Service | Version | Time | Notes |
|
||||
| ------------ | ------- | ----------- | -------------------------- |
|
||||
| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |
|
||||
| user-service | v2.8.0 | 01/20 10:00 | New profile features |
|
||||
| auth-service | v4.1.2 | 01/19 16:00 | Security patch |
|
||||
|
||||
### Configuration Changes
|
||||
|
||||
- 01/21: Increased API rate limit from 1000 to 1500 RPS
|
||||
- 01/20: Updated database connection pool max from 50 to 75
|
||||
|
||||
### Infrastructure
|
||||
|
||||
- 01/20: Added 2 nodes to Kubernetes cluster
|
||||
- 01/19: Upgraded Redis from 6.2 to 7.0
|
||||
|
||||
@@ -138,11 +151,13 @@ No active incidents at handoff time.
|
||||
## ⚠️ Known Issues & Workarounds
|
||||
|
||||
### 1. Slow Dashboard Loading
|
||||
|
||||
**Issue**: Grafana dashboards slow on Monday mornings
|
||||
**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up
|
||||
**Ticket**: OPS-456 (P3)
|
||||
|
||||
### 2. Flaky Integration Test
|
||||
|
||||
**Issue**: `test_payment_flow` fails intermittently in CI
|
||||
**Workaround**: Re-run failed job (usually passes on retry)
|
||||
**Ticket**: ENG-1200 (P2)
|
||||
@@ -151,28 +166,29 @@ No active incidents at handoff time.
|
||||
|
||||
## 📅 Upcoming Events
|
||||
|
||||
| Date | Event | Impact | Contact |
|
||||
|------|-------|--------|---------|
|
||||
| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |
|
||||
| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |
|
||||
| 01/25 | Marketing campaign | 2x traffic expected | @platform |
|
||||
| Date | Event | Impact | Contact |
|
||||
| ----------- | -------------------- | ------------------- | ------------- |
|
||||
| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |
|
||||
| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |
|
||||
| 01/25 | Marketing campaign | 2x traffic expected | @platform |
|
||||
|
||||
---
|
||||
|
||||
## 📞 Escalation Reminders
|
||||
|
||||
| Issue Type | First Escalation | Second Escalation |
|
||||
|------------|------------------|-------------------|
|
||||
| Payment issues | @payments-oncall | @payments-manager |
|
||||
| Auth issues | @auth-oncall | @security-team |
|
||||
| Database issues | @dba-team | @infra-manager |
|
||||
| Unknown/severe | @engineering-manager | @vp-engineering |
|
||||
| Issue Type | First Escalation | Second Escalation |
|
||||
| --------------- | -------------------- | ----------------- |
|
||||
| Payment issues | @payments-oncall | @payments-manager |
|
||||
| Auth issues | @auth-oncall | @security-team |
|
||||
| Database issues | @dba-team | @infra-manager |
|
||||
| Unknown/severe | @engineering-manager | @vp-engineering |
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Quick Reference
|
||||
|
||||
### Common Commands
|
||||
|
||||
```bash
|
||||
# Check service health
|
||||
kubectl get pods -A | grep -v Running
|
||||
@@ -186,8 +202,10 @@ psql -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
# Clear cache (emergency only)
|
||||
redis-cli FLUSHDB
|
||||
```
|
||||
````
|
||||
|
||||
### Important Links
|
||||
|
||||
- [Runbooks](https://wiki/runbooks)
|
||||
- [Service Catalog](https://wiki/services)
|
||||
- [Incident Slack](https://slack.com/incidents)
|
||||
@@ -198,6 +216,7 @@ redis-cli FLUSHDB
|
||||
## Handoff Checklist
|
||||
|
||||
### Outgoing Engineer
|
||||
|
||||
- [x] Document active incidents
|
||||
- [x] Document ongoing investigations
|
||||
- [x] List recent changes
|
||||
@@ -206,13 +225,15 @@ redis-cli FLUSHDB
|
||||
- [x] Sync with incoming engineer
|
||||
|
||||
### Incoming Engineer
|
||||
|
||||
- [ ] Read this document
|
||||
- [ ] Join sync call
|
||||
- [ ] Verify PagerDuty is routing to you
|
||||
- [ ] Verify Slack notifications working
|
||||
- [ ] Check VPN/access working
|
||||
- [ ] Review critical dashboards
|
||||
```
|
||||
|
||||
````
|
||||
|
||||
### Template 2: Quick Handoff (Async)
|
||||
|
||||
@@ -238,7 +259,7 @@ redis-cli FLUSHDB
|
||||
|
||||
## Questions?
|
||||
I'll be available on Slack until 17:00 today.
|
||||
```
|
||||
````
|
||||
|
||||
### Template 3: Incident Handoff (Mid-Incident)
|
||||
|
||||
@@ -252,36 +273,43 @@ I'll be available on Slack until 17:00 today.
|
||||
---
|
||||
|
||||
## Current State
|
||||
|
||||
- Error rate: 15% (down from 40%)
|
||||
- Mitigation in progress: scaling up pods
|
||||
- ETA to resolution: ~30 min
|
||||
|
||||
## What We Know
|
||||
|
||||
1. Root cause: Memory pressure on payment-service pods
|
||||
2. Triggered by: Unusual traffic spike (3x normal)
|
||||
3. Contributing: Inefficient query in checkout flow
|
||||
|
||||
## What We've Done
|
||||
|
||||
- Scaled payment-service from 5 → 15 pods
|
||||
- Enabled rate limiting on checkout endpoint
|
||||
- Disabled non-critical features
|
||||
|
||||
## What Needs to Happen
|
||||
|
||||
1. Monitor error rate - should reach <1% in ~15 min
|
||||
2. If not improving, escalate to @payments-manager
|
||||
3. Once stable, begin root cause investigation
|
||||
|
||||
## Key People
|
||||
|
||||
- Incident Commander: @alice (handing off)
|
||||
- Comms Lead: @charlie
|
||||
- Technical Lead: @bob (incoming)
|
||||
|
||||
## Communication
|
||||
|
||||
- Status page: Updated at 08:45
|
||||
- Customer support: Notified
|
||||
- Exec team: Aware
|
||||
|
||||
## Resources
|
||||
|
||||
- Incident channel: #inc-20240122-payment
|
||||
- Dashboard: [Payment Service](https://grafana/d/payments)
|
||||
- Runbook: [Payment Degradation](https://wiki/runbooks/payments)
|
||||
@@ -289,6 +317,7 @@ I'll be available on Slack until 17:00 today.
|
||||
---
|
||||
|
||||
**Incoming on-call (@bob) - Please confirm you have:**
|
||||
|
||||
- [ ] Joined #inc-20240122-payment
|
||||
- [ ] Access to dashboards
|
||||
- [ ] Understand current state
|
||||
@@ -331,6 +360,7 @@ I'll be available on Slack until 17:00 today.
|
||||
## Pre-Shift Checklist
|
||||
|
||||
### Access Verification
|
||||
|
||||
- [ ] VPN working
|
||||
- [ ] kubectl access to all clusters
|
||||
- [ ] Database read access
|
||||
@@ -338,18 +368,21 @@ I'll be available on Slack until 17:00 today.
|
||||
- [ ] PagerDuty app installed and logged in
|
||||
|
||||
### Alerting Setup
|
||||
|
||||
- [ ] PagerDuty schedule shows you as primary
|
||||
- [ ] Phone notifications enabled
|
||||
- [ ] Slack notifications for incident channels
|
||||
- [ ] Test alert received and acknowledged
|
||||
|
||||
### Knowledge Refresh
|
||||
|
||||
- [ ] Review recent incidents (past 2 weeks)
|
||||
- [ ] Check service changelog
|
||||
- [ ] Skim critical runbooks
|
||||
- [ ] Know escalation contacts
|
||||
|
||||
### Environment Ready
|
||||
|
||||
- [ ] Laptop charged and accessible
|
||||
- [ ] Phone charged
|
||||
- [ ] Quiet space available for calls
|
||||
@@ -362,18 +395,21 @@ I'll be available on Slack until 17:00 today.
|
||||
## Daily On-Call Routine
|
||||
|
||||
### Morning (start of day)
|
||||
|
||||
- [ ] Check overnight alerts
|
||||
- [ ] Review dashboards for anomalies
|
||||
- [ ] Check for any P0/P1 tickets created
|
||||
- [ ] Skim incident channels for context
|
||||
|
||||
### Throughout Day
|
||||
|
||||
- [ ] Respond to alerts within SLA
|
||||
- [ ] Document investigation progress
|
||||
- [ ] Update team on significant issues
|
||||
- [ ] Triage incoming pages
|
||||
|
||||
### End of Day
|
||||
|
||||
- [ ] Hand off any active issues
|
||||
- [ ] Update investigation docs
|
||||
- [ ] Note anything for next shift
|
||||
@@ -400,18 +436,21 @@ I'll be available on Slack until 17:00 today.
|
||||
## Escalation Triggers
|
||||
|
||||
### Immediate Escalation
|
||||
|
||||
- SEV1 incident declared
|
||||
- Data breach suspected
|
||||
- Unable to diagnose within 30 min
|
||||
- Customer or legal escalation received
|
||||
|
||||
### Consider Escalation
|
||||
|
||||
- Issue spans multiple teams
|
||||
- Requires expertise you don't have
|
||||
- Business impact exceeds threshold
|
||||
- You're uncertain about next steps
|
||||
|
||||
### How to Escalate
|
||||
|
||||
1. Page the appropriate escalation path
|
||||
2. Provide brief context in Slack
|
||||
3. Stay engaged until escalation acknowledges
|
||||
@@ -421,6 +460,7 @@ I'll be available on Slack until 17:00 today.
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
|
||||
- **Document everything** - Future you will thank you
|
||||
- **Escalate early** - Better safe than sorry
|
||||
- **Take breaks** - Alert fatigue is real
|
||||
@@ -428,6 +468,7 @@ I'll be available on Slack until 17:00 today.
|
||||
- **Test your setup** - Before incidents, not during
|
||||
|
||||
### Don'ts
|
||||
|
||||
- **Don't skip handoffs** - Context loss causes incidents
|
||||
- **Don't hero** - Escalate when needed
|
||||
- **Don't ignore alerts** - Even if they seem minor
|
||||
|
||||
@@ -20,13 +20,13 @@ Comprehensive guide to writing effective, blameless postmortems that drive organ
|
||||
|
||||
### 1. Blameless Culture
|
||||
|
||||
| Blame-Focused | Blameless |
|
||||
|---------------|-----------|
|
||||
| "Who caused this?" | "What conditions allowed this?" |
|
||||
| Blame-Focused | Blameless |
|
||||
| ------------------------ | --------------------------------- |
|
||||
| "Who caused this?" | "What conditions allowed this?" |
|
||||
| "Someone made a mistake" | "The system allowed this mistake" |
|
||||
| Punish individuals | Improve systems |
|
||||
| Hide information | Share learnings |
|
||||
| Fear of speaking up | Psychological safety |
|
||||
| Punish individuals | Improve systems |
|
||||
| Hide information | Share learnings |
|
||||
| Fear of speaking up | Psychological safety |
|
||||
|
||||
### 2. Postmortem Triggers
|
||||
|
||||
@@ -40,6 +40,7 @@ Comprehensive guide to writing effective, blameless postmortems that drive organ
|
||||
## Quick Start
|
||||
|
||||
### Postmortem Timeline
|
||||
|
||||
```
|
||||
Day 0: Incident occurs
|
||||
Day 1-2: Draft postmortem document
|
||||
@@ -67,6 +68,7 @@ Quarterly: Review patterns across incidents
|
||||
On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
|
||||
|
||||
**Impact**:
|
||||
|
||||
- 12,000 customers unable to complete purchases
|
||||
- Estimated revenue loss: $45,000
|
||||
- 847 support tickets created
|
||||
@@ -74,18 +76,18 @@ On January 15, 2024, the payment processing service experienced a 47-minute outa
|
||||
|
||||
## Timeline (All times UTC)
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| 14:23 | Deployment v2.3.4 completed to production |
|
||||
| 14:31 | First alert: `payment_error_rate > 5%` |
|
||||
| 14:33 | On-call engineer @alice acknowledges alert |
|
||||
| Time | Event |
|
||||
| ----- | ----------------------------------------------- |
|
||||
| 14:23 | Deployment v2.3.4 completed to production |
|
||||
| 14:31 | First alert: `payment_error_rate > 5%` |
|
||||
| 14:33 | On-call engineer @alice acknowledges alert |
|
||||
| 14:35 | Initial investigation begins, error rate at 23% |
|
||||
| 14:41 | Incident declared SEV2, @bob joins |
|
||||
| 14:45 | Database connection exhaustion identified |
|
||||
| 14:52 | Decision to rollback deployment |
|
||||
| 14:58 | Rollback to v2.3.3 initiated |
|
||||
| 15:10 | Rollback complete, error rate dropping |
|
||||
| 15:18 | Service fully recovered, incident resolved |
|
||||
| 14:41 | Incident declared SEV2, @bob joins |
|
||||
| 14:45 | Database connection exhaustion identified |
|
||||
| 14:52 | Decision to rollback deployment |
|
||||
| 14:58 | Rollback to v2.3.3 initiated |
|
||||
| 15:10 | Rollback complete, error rate dropping |
|
||||
| 15:18 | Service fully recovered, incident resolved |
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
@@ -111,13 +113,14 @@ The v2.3.4 deployment included a change to the database query pattern that inadv
|
||||
- Why was developer unfamiliar? → No documentation on connection management patterns
|
||||
|
||||
### System Diagram
|
||||
|
||||
```
|
||||
|
||||
[Client] → [Load Balancer] → [Payment Service] → [Database]
|
||||
↓
|
||||
Connection Pool (broken)
|
||||
↓
|
||||
Direct connections (cause)
|
||||
↓
|
||||
Connection Pool (broken)
|
||||
↓
|
||||
Direct connections (cause)
|
||||
|
||||
```
|
||||
|
||||
## Detection
|
||||
@@ -219,11 +222,13 @@ The deployment completed at 14:23, but the first alert didn't fire until 14:31 (
|
||||
# 5 Whys Analysis: [Incident]
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Payment service experienced 47-minute outage due to database connection exhaustion.
|
||||
|
||||
## Analysis
|
||||
|
||||
### Why #1: Why did the service fail?
|
||||
|
||||
**Answer**: Database connections were exhausted, causing all new requests to fail.
|
||||
|
||||
**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
|
||||
@@ -231,6 +236,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
|
||||
---
|
||||
|
||||
### Why #2: Why were database connections exhausted?
|
||||
|
||||
**Answer**: Each incoming request opened a new database connection instead of using the connection pool.
|
||||
|
||||
**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.
|
||||
@@ -238,6 +244,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
|
||||
---
|
||||
|
||||
### Why #3: Why did the code bypass the connection pool?
|
||||
|
||||
**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.
|
||||
|
||||
**Evidence**: PR #1234 shows the change, made while fixing a different bug.
|
||||
@@ -245,6 +252,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
|
||||
---
|
||||
|
||||
### Why #4: Why wasn't this caught in code review?
|
||||
|
||||
**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
|
||||
|
||||
**Evidence**: Review comments only discuss business logic.
|
||||
@@ -252,6 +260,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
|
||||
---
|
||||
|
||||
### Why #5: Why isn't there a safety net for this type of change?
|
||||
|
||||
**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
|
||||
|
||||
**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.
|
||||
@@ -264,12 +273,12 @@ Payment service experienced 47-minute outage due to database connection exhausti
|
||||
|
||||
## Systemic Improvements
|
||||
|
||||
| Root Cause | Improvement | Type |
|
||||
|------------|-------------|------|
|
||||
| Root Cause | Improvement | Type |
|
||||
| ------------- | --------------------------------- | ---------- |
|
||||
| Missing tests | Add infrastructure behavior tests | Prevention |
|
||||
| Missing docs | Document connection patterns | Prevention |
|
||||
| Review gaps | Update review checklist | Detection |
|
||||
| No canary | Implement canary deployments | Mitigation |
|
||||
| Missing docs | Document connection patterns | Prevention |
|
||||
| Review gaps | Update review checklist | Detection |
|
||||
| No canary | Implement canary deployments | Mitigation |
|
||||
```
|
||||
|
||||
### Template 3: Quick Postmortem (Minor Incidents)
|
||||
@@ -280,9 +289,11 @@ Payment service experienced 47-minute outage due to database connection exhausti
|
||||
**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3
|
||||
|
||||
## What Happened
|
||||
|
||||
API latency spiked to 5s due to cache miss storm after cache flush.
|
||||
|
||||
## Timeline
|
||||
|
||||
- 10:00 - Cache flush initiated for config update
|
||||
- 10:02 - Latency alerts fire
|
||||
- 10:05 - Identified as cache miss storm
|
||||
@@ -290,13 +301,16 @@ API latency spiked to 5s due to cache miss storm after cache flush.
|
||||
- 10:12 - Latency normalized
|
||||
|
||||
## Root Cause
|
||||
|
||||
Full cache flush for minor config update caused thundering herd.
|
||||
|
||||
## Fix
|
||||
|
||||
- Immediate: Enabled cache warming
|
||||
- Long-term: Implement partial cache invalidation (ENG-999)
|
||||
|
||||
## Lessons
|
||||
|
||||
Don't full-flush cache in production; use targeted invalidation.
|
||||
```
|
||||
|
||||
@@ -308,32 +322,38 @@ Don't full-flush cache in production; use targeted invalidation.
|
||||
## Meeting Structure (60 minutes)
|
||||
|
||||
### 1. Opening (5 min)
|
||||
|
||||
- Remind everyone of blameless culture
|
||||
- "We're here to learn, not to blame"
|
||||
- Review meeting norms
|
||||
|
||||
### 2. Timeline Review (15 min)
|
||||
|
||||
- Walk through events chronologically
|
||||
- Ask clarifying questions
|
||||
- Identify gaps in timeline
|
||||
|
||||
### 3. Analysis Discussion (20 min)
|
||||
|
||||
- What failed?
|
||||
- Why did it fail?
|
||||
- What conditions allowed this?
|
||||
- What would have prevented it?
|
||||
|
||||
### 4. Action Items (15 min)
|
||||
|
||||
- Brainstorm improvements
|
||||
- Prioritize by impact and effort
|
||||
- Assign owners and due dates
|
||||
|
||||
### 5. Closing (5 min)
|
||||
|
||||
- Summarize key learnings
|
||||
- Confirm action item owners
|
||||
- Schedule follow-up if needed
|
||||
|
||||
## Facilitation Tips
|
||||
|
||||
- Keep discussion on track
|
||||
- Redirect blame to systems
|
||||
- Encourage quiet participants
|
||||
@@ -343,17 +363,18 @@ Don't full-flush cache in production; use targeted invalidation.
|
||||
|
||||
## Anti-Patterns to Avoid
|
||||
|
||||
| Anti-Pattern | Problem | Better Approach |
|
||||
|--------------|---------|-----------------|
|
||||
| **Blame game** | Shuts down learning | Focus on systems |
|
||||
| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
|
||||
| **No action items** | Waste of time | Always have concrete next steps |
|
||||
| **Unrealistic actions** | Never completed | Scope to achievable tasks |
|
||||
| **No follow-up** | Actions forgotten | Track in ticketing system |
|
||||
| Anti-Pattern | Problem | Better Approach |
|
||||
| ----------------------- | -------------------------- | ------------------------------- |
|
||||
| **Blame game** | Shuts down learning | Focus on systems |
|
||||
| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
|
||||
| **No action items** | Waste of time | Always have concrete next steps |
|
||||
| **Unrealistic actions** | Never completed | Scope to achievable tasks |
|
||||
| **No follow-up** | Actions forgotten | Track in ticketing system |
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
|
||||
- **Start immediately** - Memory fades fast
|
||||
- **Be specific** - Exact times, exact errors
|
||||
- **Include graphs** - Visual evidence
|
||||
@@ -361,6 +382,7 @@ Don't full-flush cache in production; use targeted invalidation.
|
||||
- **Share widely** - Organizational learning
|
||||
|
||||
### Don'ts
|
||||
|
||||
- **Don't name and shame** - Ever
|
||||
- **Don't skip small incidents** - They reveal patterns
|
||||
- **Don't make it a blame doc** - That kills learning
|
||||
|
||||
Reference in New Issue
Block a user