mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 09:37:15 +00:00
feat: add 5 new specialized agents with 20 skills
Add domain expert agents with comprehensive skill sets: - service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability - event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas - vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search - monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces - threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements Update all documentation to reflect correct counts: - 67 plugins, 99 agents, 107 skills, 71 commands
This commit is contained in:
374
plugins/incident-response/skills/postmortem-writing/SKILL.md
Normal file
374
plugins/incident-response/skills/postmortem-writing/SKILL.md
Normal file
@@ -0,0 +1,374 @@
|
||||
---
|
||||
name: postmortem-writing
|
||||
description: Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processes.
|
||||
---
|
||||
|
||||
# Postmortem Writing
|
||||
|
||||
Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
- Conducting post-incident reviews
|
||||
- Writing postmortem documents
|
||||
- Facilitating blameless postmortem meetings
|
||||
- Identifying root causes and contributing factors
|
||||
- Creating actionable follow-up items
|
||||
- Building organizational learning culture
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Blameless Culture
|
||||
|
||||
| Blame-Focused | Blameless |
|
||||
|---------------|-----------|
|
||||
| "Who caused this?" | "What conditions allowed this?" |
|
||||
| "Someone made a mistake" | "The system allowed this mistake" |
|
||||
| Punish individuals | Improve systems |
|
||||
| Hide information | Share learnings |
|
||||
| Fear of speaking up | Psychological safety |
|
||||
|
||||
### 2. Postmortem Triggers
|
||||
|
||||
- SEV1 or SEV2 incidents
|
||||
- Customer-facing outages > 15 minutes
|
||||
- Data loss or security incidents
|
||||
- Near-misses that could have been severe
|
||||
- Novel failure modes
|
||||
- Incidents requiring unusual intervention
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Postmortem Timeline
|
||||
```
|
||||
Day 0: Incident occurs
|
||||
Day 1-2: Draft postmortem document
|
||||
Day 3-5: Postmortem meeting
|
||||
Day 5-7: Finalize document, create tickets
|
||||
Week 2+: Action item completion
|
||||
Quarterly: Review patterns across incidents
|
||||
```
|
||||
|
||||
## Templates
|
||||
|
||||
### Template 1: Standard Postmortem
|
||||
|
||||
```markdown
|
||||
# Postmortem: [Incident Title]
|
||||
|
||||
**Date**: 2024-01-15
|
||||
**Authors**: @alice, @bob
|
||||
**Status**: Draft | In Review | Final
|
||||
**Incident Severity**: SEV2
|
||||
**Incident Duration**: 47 minutes
|
||||
|
||||
## Executive Summary
|
||||
|
||||
On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
|
||||
|
||||
**Impact**:
|
||||
- 12,000 customers unable to complete purchases
|
||||
- Estimated revenue loss: $45,000
|
||||
- 847 support tickets created
|
||||
- No data loss or security implications
|
||||
|
||||
## Timeline (All times UTC)
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| 14:23 | Deployment v2.3.4 completed to production |
|
||||
| 14:31 | First alert: `payment_error_rate > 5%` |
|
||||
| 14:33 | On-call engineer @alice acknowledges alert |
|
||||
| 14:35 | Initial investigation begins, error rate at 23% |
|
||||
| 14:41 | Incident declared SEV2, @bob joins |
|
||||
| 14:45 | Database connection exhaustion identified |
|
||||
| 14:52 | Decision to rollback deployment |
|
||||
| 14:58 | Rollback to v2.3.3 initiated |
|
||||
| 15:10 | Rollback complete, error rate dropping |
|
||||
| 15:18 | Service fully recovered, incident resolved |
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### What Happened
|
||||
|
||||
The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.
|
||||
|
||||
### Why It Happened
|
||||
|
||||
1. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls.
|
||||
|
||||
2. **Contributing Factors**:
|
||||
- Code review did not catch the connection handling change
|
||||
- No integration tests specifically for connection pool behavior
|
||||
- Staging environment has lower traffic, masking the issue
|
||||
- Database connection metrics alert threshold was too high (90%)
|
||||
|
||||
3. **5 Whys Analysis**:
|
||||
- Why did the service fail? → Database connections exhausted
|
||||
- Why were connections exhausted? → Each request opened new connection
|
||||
- Why did each request open new connection? → Code bypassed connection pool
|
||||
- Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
|
||||
- Why was developer unfamiliar? → No documentation on connection management patterns
|
||||
|
||||
### System Diagram
|
||||
|
||||
```
|
||||
[Client] → [Load Balancer] → [Payment Service] → [Database]
|
||||
↓
|
||||
Connection Pool (broken)
|
||||
↓
|
||||
Direct connections (cause)
|
||||
```
|
||||
|
||||
## Detection
|
||||
|
||||
### What Worked
|
||||
- Error rate alert fired within 8 minutes of deployment
|
||||
- Grafana dashboard clearly showed connection spike
|
||||
- On-call response was swift (2 minute acknowledgment)
|
||||
|
||||
### What Didn't Work
|
||||
- Database connection metric alert threshold too high
|
||||
- No deployment-correlated alerting
|
||||
- Canary deployment would have caught this earlier
|
||||
|
||||
### Detection Gap
|
||||
The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.
|
||||
|
||||
## Response
|
||||
|
||||
### What Worked
|
||||
- On-call engineer quickly identified database as the issue
|
||||
- Rollback decision was made decisively
|
||||
- Clear communication in incident channel
|
||||
|
||||
### What Could Be Improved
|
||||
- Took 10 minutes to correlate issue with recent deployment
|
||||
- Had to manually check deployment history
|
||||
- Rollback took 12 minutes (could be faster)
|
||||
|
||||
## Impact
|
||||
|
||||
### Customer Impact
|
||||
- 12,000 unique customers affected
|
||||
- Average impact duration: 35 minutes
|
||||
- 847 support tickets (23% of affected users)
|
||||
- Customer satisfaction score dropped 12 points
|
||||
|
||||
### Business Impact
|
||||
- Estimated revenue loss: $45,000
|
||||
- Support cost: ~$2,500 (agent time)
|
||||
- Engineering time: ~8 person-hours
|
||||
|
||||
### Technical Impact
|
||||
- Database primary experienced elevated load
|
||||
- Some replica lag during incident
|
||||
- No permanent damage to systems
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### What Went Well
|
||||
1. Alerting detected the issue before customer reports
|
||||
2. Team collaborated effectively under pressure
|
||||
3. Rollback procedure worked smoothly
|
||||
4. Communication was clear and timely
|
||||
|
||||
### What Went Wrong
|
||||
1. Code review missed critical change
|
||||
2. Test coverage gap for connection pooling
|
||||
3. Staging environment doesn't reflect production traffic
|
||||
4. Alert thresholds were not tuned properly
|
||||
|
||||
### Where We Got Lucky
|
||||
1. Incident occurred during business hours with full team available
|
||||
2. Database handled the load without failing completely
|
||||
3. No other incidents occurred simultaneously
|
||||
|
||||
## Action Items
|
||||
|
||||
| Priority | Action | Owner | Due Date | Ticket |
|
||||
|----------|--------|-------|----------|--------|
|
||||
| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |
|
||||
| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
|
||||
| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
|
||||
| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |
|
||||
| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |
|
||||
| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |
|
||||
|
||||
## Appendix
|
||||
|
||||
### Supporting Data
|
||||
|
||||
#### Error Rate Graph
|
||||
[Link to Grafana dashboard snapshot]
|
||||
|
||||
#### Database Connection Graph
|
||||
[Link to metrics]
|
||||
|
||||
### Related Incidents
|
||||
- 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)
|
||||
|
||||
### References
|
||||
- [Connection Pool Best Practices](internal-wiki/connection-pools)
|
||||
- [Deployment Runbook](internal-wiki/deployment-runbook)
|
||||
```
|
||||
|
||||
### Template 2: 5 Whys Analysis
|
||||
|
||||
```markdown
|
||||
# 5 Whys Analysis: [Incident]
|
||||
|
||||
## Problem Statement
|
||||
Payment service experienced 47-minute outage due to database connection exhaustion.
|
||||
|
||||
## Analysis
|
||||
|
||||
### Why #1: Why did the service fail?
|
||||
**Answer**: Database connections were exhausted, causing all new requests to fail.
|
||||
|
||||
**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
|
||||
|
||||
---
|
||||
|
||||
### Why #2: Why were database connections exhausted?
|
||||
**Answer**: Each incoming request opened a new database connection instead of using the connection pool.
|
||||
|
||||
**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.
|
||||
|
||||
---
|
||||
|
||||
### Why #3: Why did the code bypass the connection pool?
|
||||
**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.
|
||||
|
||||
**Evidence**: PR #1234 shows the change, made while fixing a different bug.
|
||||
|
||||
---
|
||||
|
||||
### Why #4: Why wasn't this caught in code review?
|
||||
**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
|
||||
|
||||
**Evidence**: Review comments only discuss business logic.
|
||||
|
||||
---
|
||||
|
||||
### Why #5: Why isn't there a safety net for this type of change?
|
||||
**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
|
||||
|
||||
**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.
|
||||
|
||||
## Root Causes Identified
|
||||
|
||||
1. **Primary**: Missing automated tests for infrastructure behavior
|
||||
2. **Secondary**: Insufficient documentation of architectural patterns
|
||||
3. **Tertiary**: Code review checklist doesn't include infrastructure considerations
|
||||
|
||||
## Systemic Improvements
|
||||
|
||||
| Root Cause | Improvement | Type |
|
||||
|------------|-------------|------|
|
||||
| Missing tests | Add infrastructure behavior tests | Prevention |
|
||||
| Missing docs | Document connection patterns | Prevention |
|
||||
| Review gaps | Update review checklist | Detection |
|
||||
| No canary | Implement canary deployments | Mitigation |
|
||||
```
|
||||
|
||||
### Template 3: Quick Postmortem (Minor Incidents)
|
||||
|
||||
```markdown
|
||||
# Quick Postmortem: [Brief Title]
|
||||
|
||||
**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3
|
||||
|
||||
## What Happened
|
||||
API latency spiked to 5s due to cache miss storm after cache flush.
|
||||
|
||||
## Timeline
|
||||
- 10:00 - Cache flush initiated for config update
|
||||
- 10:02 - Latency alerts fire
|
||||
- 10:05 - Identified as cache miss storm
|
||||
- 10:08 - Enabled cache warming
|
||||
- 10:12 - Latency normalized
|
||||
|
||||
## Root Cause
|
||||
Full cache flush for minor config update caused thundering herd.
|
||||
|
||||
## Fix
|
||||
- Immediate: Enabled cache warming
|
||||
- Long-term: Implement partial cache invalidation (ENG-999)
|
||||
|
||||
## Lessons
|
||||
Don't full-flush cache in production; use targeted invalidation.
|
||||
```
|
||||
|
||||
## Facilitation Guide
|
||||
|
||||
### Running a Postmortem Meeting
|
||||
|
||||
```markdown
|
||||
## Meeting Structure (60 minutes)
|
||||
|
||||
### 1. Opening (5 min)
|
||||
- Remind everyone of blameless culture
|
||||
- "We're here to learn, not to blame"
|
||||
- Review meeting norms
|
||||
|
||||
### 2. Timeline Review (15 min)
|
||||
- Walk through events chronologically
|
||||
- Ask clarifying questions
|
||||
- Identify gaps in timeline
|
||||
|
||||
### 3. Analysis Discussion (20 min)
|
||||
- What failed?
|
||||
- Why did it fail?
|
||||
- What conditions allowed this?
|
||||
- What would have prevented it?
|
||||
|
||||
### 4. Action Items (15 min)
|
||||
- Brainstorm improvements
|
||||
- Prioritize by impact and effort
|
||||
- Assign owners and due dates
|
||||
|
||||
### 5. Closing (5 min)
|
||||
- Summarize key learnings
|
||||
- Confirm action item owners
|
||||
- Schedule follow-up if needed
|
||||
|
||||
## Facilitation Tips
|
||||
- Keep discussion on track
|
||||
- Redirect blame to systems
|
||||
- Encourage quiet participants
|
||||
- Document dissenting views
|
||||
- Time-box tangents
|
||||
```
|
||||
|
||||
## Anti-Patterns to Avoid
|
||||
|
||||
| Anti-Pattern | Problem | Better Approach |
|
||||
|--------------|---------|-----------------|
|
||||
| **Blame game** | Shuts down learning | Focus on systems |
|
||||
| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
|
||||
| **No action items** | Waste of time | Always have concrete next steps |
|
||||
| **Unrealistic actions** | Never completed | Scope to achievable tasks |
|
||||
| **No follow-up** | Actions forgotten | Track in ticketing system |
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
- **Start immediately** - Memory fades fast
|
||||
- **Be specific** - Exact times, exact errors
|
||||
- **Include graphs** - Visual evidence
|
||||
- **Assign owners** - No orphan action items
|
||||
- **Share widely** - Organizational learning
|
||||
|
||||
### Don'ts
|
||||
- **Don't name and shame** - Ever
|
||||
- **Don't skip small incidents** - They reveal patterns
|
||||
- **Don't make it a blame doc** - That kills learning
|
||||
- **Don't create busywork** - Actions should be meaningful
|
||||
- **Don't skip follow-up** - Verify actions completed
|
||||
|
||||
## Resources
|
||||
|
||||
- [Google SRE - Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
|
||||
- [Etsy's Blameless Postmortems](https://codeascraft.com/2012/05/22/blameless-postmortems/)
|
||||
- [PagerDuty Postmortem Guide](https://postmortems.pagerduty.com/)
|
||||
Reference in New Issue
Block a user