mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 17:47:16 +00:00
style: format all files with prettier
This commit is contained in:
@@ -20,13 +20,13 @@ Comprehensive guide to writing effective, blameless postmortems that drive organ
|
||||
|
||||
### 1. Blameless Culture
|
||||
|
||||
| Blame-Focused | Blameless |
|
||||
|---------------|-----------|
|
||||
| "Who caused this?" | "What conditions allowed this?" |
|
||||
| Blame-Focused | Blameless |
|
||||
| ------------------------ | --------------------------------- |
|
||||
| "Who caused this?" | "What conditions allowed this?" |
|
||||
| "Someone made a mistake" | "The system allowed this mistake" |
|
||||
| Punish individuals | Improve systems |
|
||||
| Hide information | Share learnings |
|
||||
| Fear of speaking up | Psychological safety |
|
||||
| Punish individuals | Improve systems |
|
||||
| Hide information | Share learnings |
|
||||
| Fear of speaking up | Psychological safety |
|
||||
|
||||
### 2. Postmortem Triggers
|
||||
|
||||
@@ -40,6 +40,7 @@ Comprehensive guide to writing effective, blameless postmortems that drive organ
|
||||
## Quick Start
|
||||
|
||||
### Postmortem Timeline
|
||||
|
||||
```
|
||||
Day 0: Incident occurs
|
||||
Day 1-2: Draft postmortem document
|
||||
@@ -67,6 +68,7 @@ Quarterly: Review patterns across incidents
|
||||
On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
|
||||
|
||||
**Impact**:
|
||||
|
||||
- 12,000 customers unable to complete purchases
|
||||
- Estimated revenue loss: $45,000
|
||||
- 847 support tickets created
|
||||
@@ -74,18 +76,18 @@ On January 15, 2024, the payment processing service experienced a 47-minute outa
|
||||
|
||||
## Timeline (All times UTC)
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| 14:23 | Deployment v2.3.4 completed to production |
|
||||
| 14:31 | First alert: `payment_error_rate > 5%` |
|
||||
| 14:33 | On-call engineer @alice acknowledges alert |
|
||||
| Time | Event |
|
||||
| ----- | ----------------------------------------------- |
|
||||
| 14:23 | Deployment v2.3.4 completed to production |
|
||||
| 14:31 | First alert: `payment_error_rate > 5%` |
|
||||
| 14:33 | On-call engineer @alice acknowledges alert |
|
||||
| 14:35 | Initial investigation begins, error rate at 23% |
|
||||
| 14:41 | Incident declared SEV2, @bob joins |
|
||||
| 14:45 | Database connection exhaustion identified |
|
||||
| 14:52 | Decision to rollback deployment |
|
||||
| 14:58 | Rollback to v2.3.3 initiated |
|
||||
| 15:10 | Rollback complete, error rate dropping |
|
||||
| 15:18 | Service fully recovered, incident resolved |
|
||||
| 14:41 | Incident declared SEV2, @bob joins |
|
||||
| 14:45 | Database connection exhaustion identified |
|
||||
| 14:52 | Decision to rollback deployment |
|
||||
| 14:58 | Rollback to v2.3.3 initiated |
|
||||
| 15:10 | Rollback complete, error rate dropping |
|
||||
| 15:18 | Service fully recovered, incident resolved |
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
@@ -111,13 +113,14 @@ The v2.3.4 deployment included a change to the database query pattern that inadv
|
||||
- Why was developer unfamiliar? → No documentation on connection management patterns
|
||||
|
||||
### System Diagram
|
||||
|
||||
```
|
||||
|
||||
[Client] → [Load Balancer] → [Payment Service] → [Database]
|
||||
↓
|
||||
Connection Pool (broken)
|
||||
↓
|
||||
Direct connections (cause)
|
||||
↓
|
||||
Connection Pool (broken)
|
||||
↓
|
||||
Direct connections (cause)
|
||||
|
||||
```
|
||||
|
||||
## Detection
|
||||
@@ -219,11 +222,13 @@ The deployment completed at 14:23, but the first alert didn't fire until 14:31 (
|
||||
# 5 Whys Analysis: [Incident]
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Payment service experienced 47-minute outage due to database connection exhaustion.
|
||||
|
||||
## Analysis
|
||||
|
||||
### Why #1: Why did the service fail?
|
||||
|
||||
**Answer**: Database connections were exhausted, causing all new requests to fail.
|
||||
|
||||
**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
|
||||
@@ -231,6 +236,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
|
||||
---
|
||||
|
||||
### Why #2: Why were database connections exhausted?
|
||||
|
||||
**Answer**: Each incoming request opened a new database connection instead of using the connection pool.
|
||||
|
||||
**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.
|
||||
@@ -238,6 +244,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
|
||||
---
|
||||
|
||||
### Why #3: Why did the code bypass the connection pool?
|
||||
|
||||
**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.
|
||||
|
||||
**Evidence**: PR #1234 shows the change, made while fixing a different bug.
|
||||
@@ -245,6 +252,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
|
||||
---
|
||||
|
||||
### Why #4: Why wasn't this caught in code review?
|
||||
|
||||
**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
|
||||
|
||||
**Evidence**: Review comments only discuss business logic.
|
||||
@@ -252,6 +260,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
|
||||
---
|
||||
|
||||
### Why #5: Why isn't there a safety net for this type of change?
|
||||
|
||||
**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
|
||||
|
||||
**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.
|
||||
@@ -264,12 +273,12 @@ Payment service experienced 47-minute outage due to database connection exhausti
|
||||
|
||||
## Systemic Improvements
|
||||
|
||||
| Root Cause | Improvement | Type |
|
||||
|------------|-------------|------|
|
||||
| Root Cause | Improvement | Type |
|
||||
| ------------- | --------------------------------- | ---------- |
|
||||
| Missing tests | Add infrastructure behavior tests | Prevention |
|
||||
| Missing docs | Document connection patterns | Prevention |
|
||||
| Review gaps | Update review checklist | Detection |
|
||||
| No canary | Implement canary deployments | Mitigation |
|
||||
| Missing docs | Document connection patterns | Prevention |
|
||||
| Review gaps | Update review checklist | Detection |
|
||||
| No canary | Implement canary deployments | Mitigation |
|
||||
```
|
||||
|
||||
### Template 3: Quick Postmortem (Minor Incidents)
|
||||
@@ -280,9 +289,11 @@ Payment service experienced 47-minute outage due to database connection exhausti
|
||||
**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3
|
||||
|
||||
## What Happened
|
||||
|
||||
API latency spiked to 5s due to cache miss storm after cache flush.
|
||||
|
||||
## Timeline
|
||||
|
||||
- 10:00 - Cache flush initiated for config update
|
||||
- 10:02 - Latency alerts fire
|
||||
- 10:05 - Identified as cache miss storm
|
||||
@@ -290,13 +301,16 @@ API latency spiked to 5s due to cache miss storm after cache flush.
|
||||
- 10:12 - Latency normalized
|
||||
|
||||
## Root Cause
|
||||
|
||||
Full cache flush for minor config update caused thundering herd.
|
||||
|
||||
## Fix
|
||||
|
||||
- Immediate: Enabled cache warming
|
||||
- Long-term: Implement partial cache invalidation (ENG-999)
|
||||
|
||||
## Lessons
|
||||
|
||||
Don't full-flush cache in production; use targeted invalidation.
|
||||
```
|
||||
|
||||
@@ -308,32 +322,38 @@ Don't full-flush cache in production; use targeted invalidation.
|
||||
## Meeting Structure (60 minutes)
|
||||
|
||||
### 1. Opening (5 min)
|
||||
|
||||
- Remind everyone of blameless culture
|
||||
- "We're here to learn, not to blame"
|
||||
- Review meeting norms
|
||||
|
||||
### 2. Timeline Review (15 min)
|
||||
|
||||
- Walk through events chronologically
|
||||
- Ask clarifying questions
|
||||
- Identify gaps in timeline
|
||||
|
||||
### 3. Analysis Discussion (20 min)
|
||||
|
||||
- What failed?
|
||||
- Why did it fail?
|
||||
- What conditions allowed this?
|
||||
- What would have prevented it?
|
||||
|
||||
### 4. Action Items (15 min)
|
||||
|
||||
- Brainstorm improvements
|
||||
- Prioritize by impact and effort
|
||||
- Assign owners and due dates
|
||||
|
||||
### 5. Closing (5 min)
|
||||
|
||||
- Summarize key learnings
|
||||
- Confirm action item owners
|
||||
- Schedule follow-up if needed
|
||||
|
||||
## Facilitation Tips
|
||||
|
||||
- Keep discussion on track
|
||||
- Redirect blame to systems
|
||||
- Encourage quiet participants
|
||||
@@ -343,17 +363,18 @@ Don't full-flush cache in production; use targeted invalidation.
|
||||
|
||||
## Anti-Patterns to Avoid
|
||||
|
||||
| Anti-Pattern | Problem | Better Approach |
|
||||
|--------------|---------|-----------------|
|
||||
| **Blame game** | Shuts down learning | Focus on systems |
|
||||
| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
|
||||
| **No action items** | Waste of time | Always have concrete next steps |
|
||||
| **Unrealistic actions** | Never completed | Scope to achievable tasks |
|
||||
| **No follow-up** | Actions forgotten | Track in ticketing system |
|
||||
| Anti-Pattern | Problem | Better Approach |
|
||||
| ----------------------- | -------------------------- | ------------------------------- |
|
||||
| **Blame game** | Shuts down learning | Focus on systems |
|
||||
| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
|
||||
| **No action items** | Waste of time | Always have concrete next steps |
|
||||
| **Unrealistic actions** | Never completed | Scope to achievable tasks |
|
||||
| **No follow-up** | Actions forgotten | Track in ticketing system |
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
|
||||
- **Start immediately** - Memory fades fast
|
||||
- **Be specific** - Exact times, exact errors
|
||||
- **Include graphs** - Visual evidence
|
||||
@@ -361,6 +382,7 @@ Don't full-flush cache in production; use targeted invalidation.
|
||||
- **Share widely** - Organizational learning
|
||||
|
||||
### Don'ts
|
||||
|
||||
- **Don't name and shame** - Ever
|
||||
- **Don't skip small incidents** - They reveal patterns
|
||||
- **Don't make it a blame doc** - That kills learning
|
||||
|
||||
Reference in New Issue
Block a user