feat: add 5 new specialized agents with 20 skills

Add domain expert agents with comprehensive skill sets: - service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability - event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas - vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search - monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces - threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements Update all documentation to reflect correct counts: - 67 plugins, 99 agents, 107 skills, 71 commands
2026-03-18 09:37:15 +00:00 · 2025-12-16 16:00:58 -05:00
parent c7ad381360
commit 01d93fc227
58 changed files with 24830 additions and 50 deletions
--- a/plugins/incident-response/skills/postmortem-writing/SKILL.md
+++ b/plugins/incident-response/skills/postmortem-writing/SKILL.md
@@ -0,0 +1,374 @@
+---
+name: postmortem-writing
+description: Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processes.
+---
+
+# Postmortem Writing
+
+Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
+
+## When to Use This Skill
+
+- Conducting post-incident reviews
+- Writing postmortem documents
+- Facilitating blameless postmortem meetings
+- Identifying root causes and contributing factors
+- Creating actionable follow-up items
+- Building organizational learning culture
+
+## Core Concepts
+
+### 1. Blameless Culture
+
+| Blame-Focused | Blameless |
+|---------------|-----------|
+| "Who caused this?" | "What conditions allowed this?" |
+| "Someone made a mistake" | "The system allowed this mistake" |
+| Punish individuals | Improve systems |
+| Hide information | Share learnings |
+| Fear of speaking up | Psychological safety |
+
+### 2. Postmortem Triggers
+
+- SEV1 or SEV2 incidents
+- Customer-facing outages > 15 minutes
+- Data loss or security incidents
+- Near-misses that could have been severe
+- Novel failure modes
+- Incidents requiring unusual intervention
+
+## Quick Start
+
+### Postmortem Timeline
+```
+Day 0: Incident occurs
+Day 1-2: Draft postmortem document
+Day 3-5: Postmortem meeting
+Day 5-7: Finalize document, create tickets
+Week 2+: Action item completion
+Quarterly: Review patterns across incidents
+```
+
+## Templates
+
+### Template 1: Standard Postmortem
+
+```markdown
+# Postmortem: [Incident Title]
+
+**Date**: 2024-01-15
+**Authors**: @alice, @bob
+**Status**: Draft | In Review | Final
+**Incident Severity**: SEV2
+**Incident Duration**: 47 minutes
+
+## Executive Summary
+
+On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
+
+**Impact**:
+- 12,000 customers unable to complete purchases
+- Estimated revenue loss: $45,000
+- 847 support tickets created
+- No data loss or security implications
+
+## Timeline (All times UTC)
+
+| Time | Event |
+|------|-------|
+| 14:23 | Deployment v2.3.4 completed to production |
+| 14:31 | First alert: `payment_error_rate > 5%` |
+| 14:33 | On-call engineer @alice acknowledges alert |
+| 14:35 | Initial investigation begins, error rate at 23% |
+| 14:41 | Incident declared SEV2, @bob joins |
+| 14:45 | Database connection exhaustion identified |
+| 14:52 | Decision to rollback deployment |
+| 14:58 | Rollback to v2.3.3 initiated |
+| 15:10 | Rollback complete, error rate dropping |
+| 15:18 | Service fully recovered, incident resolved |
+
+## Root Cause Analysis
+
+### What Happened
+
+The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.
+
+### Why It Happened
+
+1. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls.
+
+2. **Contributing Factors**:
+   - Code review did not catch the connection handling change
+   - No integration tests specifically for connection pool behavior
+   - Staging environment has lower traffic, masking the issue
+   - Database connection metrics alert threshold was too high (90%)
+
+3. **5 Whys Analysis**:
+   - Why did the service fail? → Database connections exhausted
+   - Why were connections exhausted? → Each request opened new connection
+   - Why did each request open new connection? → Code bypassed connection pool
+   - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
+   - Why was developer unfamiliar? → No documentation on connection management patterns
+
+### System Diagram
+
+```
+[Client] → [Load Balancer] → [Payment Service] → [Database]
+                                    ↓
+                            Connection Pool (broken)
+                                    ↓
+                            Direct connections (cause)
+```
+
+## Detection
+
+### What Worked
+- Error rate alert fired within 8 minutes of deployment
+- Grafana dashboard clearly showed connection spike
+- On-call response was swift (2 minute acknowledgment)
+
+### What Didn't Work
+- Database connection metric alert threshold too high
+- No deployment-correlated alerting
+- Canary deployment would have caught this earlier
+
+### Detection Gap
+The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.
+
+## Response
+
+### What Worked
+- On-call engineer quickly identified database as the issue
+- Rollback decision was made decisively
+- Clear communication in incident channel
+
+### What Could Be Improved
+- Took 10 minutes to correlate issue with recent deployment
+- Had to manually check deployment history
+- Rollback took 12 minutes (could be faster)
+
+## Impact
+
+### Customer Impact
+- 12,000 unique customers affected
+- Average impact duration: 35 minutes
+- 847 support tickets (23% of affected users)
+- Customer satisfaction score dropped 12 points
+
+### Business Impact
+- Estimated revenue loss: $45,000
+- Support cost: ~$2,500 (agent time)
+- Engineering time: ~8 person-hours
+
+### Technical Impact
+- Database primary experienced elevated load
+- Some replica lag during incident
+- No permanent damage to systems
+
+## Lessons Learned
+
+### What Went Well
+1. Alerting detected the issue before customer reports
+2. Team collaborated effectively under pressure
+3. Rollback procedure worked smoothly
+4. Communication was clear and timely
+
+### What Went Wrong
+1. Code review missed critical change
+2. Test coverage gap for connection pooling
+3. Staging environment doesn't reflect production traffic
+4. Alert thresholds were not tuned properly
+
+### Where We Got Lucky
+1. Incident occurred during business hours with full team available
+2. Database handled the load without failing completely
+3. No other incidents occurred simultaneously
+
+## Action Items
+
+| Priority | Action | Owner | Due Date | Ticket |
+|----------|--------|-------|----------|--------|
+| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |
+| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
+| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
+| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |
+| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |
+| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |
+
+## Appendix
+
+### Supporting Data
+
+#### Error Rate Graph
+[Link to Grafana dashboard snapshot]
+
+#### Database Connection Graph
+[Link to metrics]
+
+### Related Incidents
+- 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)
+
+### References
+- [Connection Pool Best Practices](internal-wiki/connection-pools)
+- [Deployment Runbook](internal-wiki/deployment-runbook)
+```
+
+### Template 2: 5 Whys Analysis
+
+```markdown
+# 5 Whys Analysis: [Incident]
+
+## Problem Statement
+Payment service experienced 47-minute outage due to database connection exhaustion.
+
+## Analysis
+
+### Why #1: Why did the service fail?
+**Answer**: Database connections were exhausted, causing all new requests to fail.
+
+**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
+
+---
+
+### Why #2: Why were database connections exhausted?
+**Answer**: Each incoming request opened a new database connection instead of using the connection pool.
+
+**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.
+
+---
+
+### Why #3: Why did the code bypass the connection pool?
+**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.
+
+**Evidence**: PR #1234 shows the change, made while fixing a different bug.
+
+---
+
+### Why #4: Why wasn't this caught in code review?
+**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
+
+**Evidence**: Review comments only discuss business logic.
+
+---
+
+### Why #5: Why isn't there a safety net for this type of change?
+**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
+
+**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.
+
+## Root Causes Identified
+
+1. **Primary**: Missing automated tests for infrastructure behavior
+2. **Secondary**: Insufficient documentation of architectural patterns
+3. **Tertiary**: Code review checklist doesn't include infrastructure considerations
+
+## Systemic Improvements
+
+| Root Cause | Improvement | Type |
+|------------|-------------|------|
+| Missing tests | Add infrastructure behavior tests | Prevention |
+| Missing docs | Document connection patterns | Prevention |
+| Review gaps | Update review checklist | Detection |
+| No canary | Implement canary deployments | Mitigation |
+```
+
+### Template 3: Quick Postmortem (Minor Incidents)
+
+```markdown
+# Quick Postmortem: [Brief Title]
+
+**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3
+
+## What Happened
+API latency spiked to 5s due to cache miss storm after cache flush.
+
+## Timeline
+- 10:00 - Cache flush initiated for config update
+- 10:02 - Latency alerts fire
+- 10:05 - Identified as cache miss storm
+- 10:08 - Enabled cache warming
+- 10:12 - Latency normalized
+
+## Root Cause
+Full cache flush for minor config update caused thundering herd.
+
+## Fix
+- Immediate: Enabled cache warming
+- Long-term: Implement partial cache invalidation (ENG-999)
+
+## Lessons
+Don't full-flush cache in production; use targeted invalidation.
+```
+
+## Facilitation Guide
+
+### Running a Postmortem Meeting
+
+```markdown
+## Meeting Structure (60 minutes)
+
+### 1. Opening (5 min)
+- Remind everyone of blameless culture
+- "We're here to learn, not to blame"
+- Review meeting norms
+
+### 2. Timeline Review (15 min)
+- Walk through events chronologically
+- Ask clarifying questions
+- Identify gaps in timeline
+
+### 3. Analysis Discussion (20 min)
+- What failed?
+- Why did it fail?
+- What conditions allowed this?
+- What would have prevented it?
+
+### 4. Action Items (15 min)
+- Brainstorm improvements
+- Prioritize by impact and effort
+- Assign owners and due dates
+
+### 5. Closing (5 min)
+- Summarize key learnings
+- Confirm action item owners
+- Schedule follow-up if needed
+
+## Facilitation Tips
+- Keep discussion on track
+- Redirect blame to systems
+- Encourage quiet participants
+- Document dissenting views
+- Time-box tangents
+```
+
+## Anti-Patterns to Avoid
+
+| Anti-Pattern | Problem | Better Approach |
+|--------------|---------|-----------------|
+| **Blame game** | Shuts down learning | Focus on systems |
+| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
+| **No action items** | Waste of time | Always have concrete next steps |
+| **Unrealistic actions** | Never completed | Scope to achievable tasks |
+| **No follow-up** | Actions forgotten | Track in ticketing system |
+
+## Best Practices
+
+### Do's
+- **Start immediately** - Memory fades fast
+- **Be specific** - Exact times, exact errors
+- **Include graphs** - Visual evidence
+- **Assign owners** - No orphan action items
+- **Share widely** - Organizational learning
+
+### Don'ts
+- **Don't name and shame** - Ever
+- **Don't skip small incidents** - They reveal patterns
+- **Don't make it a blame doc** - That kills learning
+- **Don't create busywork** - Actions should be meaningful
+- **Don't skip follow-up** - Verify actions completed
+
+## Resources
+
+- [Google SRE - Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
+- [Etsy's Blameless Postmortems](https://codeascraft.com/2012/05/22/blameless-postmortems/)
+- [PagerDuty Postmortem Guide](https://postmortems.pagerduty.com/)