feat: add 5 new specialized agents with 20 skills

Add domain expert agents with comprehensive skill sets: - service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability - event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas - vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search - monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces - threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements Update all documentation to reflect correct counts: - 67 plugins, 99 agents, 107 skills, 71 commands
2026-03-18 09:37:15 +00:00 · 2025-12-16 16:00:58 -05:00
parent c7ad381360
commit 01d93fc227
58 changed files with 24830 additions and 50 deletions
--- a/plugins/incident-response/skills/incident-runbook-templates/SKILL.md
+++ b/plugins/incident-response/skills/incident-runbook-templates/SKILL.md
@@ -0,0 +1,383 @@
+---
+name: incident-runbook-templates
+description: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.
+---
+
+# Incident Runbook Templates
+
+Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
+
+## When to Use This Skill
+
+- Creating incident response procedures
+- Building service-specific runbooks
+- Establishing escalation paths
+- Documenting recovery procedures
+- Responding to active incidents
+- Onboarding on-call engineers
+
+## Core Concepts
+
+### 1. Incident Severity Levels
+
+| Severity | Impact | Response Time | Example |
+|----------|--------|---------------|---------|
+| **SEV1** | Complete outage, data loss | 15 min | Production down |
+| **SEV2** | Major degradation | 30 min | Critical feature broken |
+| **SEV3** | Minor impact | 2 hours | Non-critical bug |
+| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
+
+### 2. Runbook Structure
+
+```
+1. Overview & Impact
+2. Detection & Alerts
+3. Initial Triage
+4. Mitigation Steps
+5. Root Cause Investigation
+6. Resolution Procedures
+7. Verification & Rollback
+8. Communication Templates
+9. Escalation Matrix
+```
+
+## Runbook Templates
+
+### Template 1: Service Outage Runbook
+
+```markdown
+# [Service Name] Outage Runbook
+
+## Overview
+**Service**: Payment Processing Service
+**Owner**: Platform Team
+**Slack**: #payments-incidents
+**PagerDuty**: payments-oncall
+
+## Impact Assessment
+- [ ] Which customers are affected?
+- [ ] What percentage of traffic is impacted?
+- [ ] Are there financial implications?
+- [ ] What's the blast radius?
+
+## Detection
+### Alerts
+- `payment_error_rate > 5%` (PagerDuty)
+- `payment_latency_p99 > 2s` (Slack)
+- `payment_success_rate < 95%` (PagerDuty)
+
+### Dashboards
+- [Payment Service Dashboard](https://grafana/d/payments)
+- [Error Tracking](https://sentry.io/payments)
+- [Dependency Status](https://status.stripe.com)
+
+## Initial Triage (First 5 Minutes)
+
+### 1. Assess Scope
+```bash
+# Check service health
+kubectl get pods -n payments -l app=payment-service
+
+# Check recent deployments
+kubectl rollout history deployment/payment-service -n payments
+
+# Check error rates
+curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
+```
+
+### 2. Quick Health Checks
+- [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`
+- [ ] Database connectivity? Check connection pool metrics
+- [ ] External dependencies? Check Stripe, bank API status
+- [ ] Recent changes? Check deploy history
+
+### 3. Initial Classification
+| Symptom | Likely Cause | Go To Section |
+|---------|--------------|---------------|
+| All requests failing | Service down | Section 4.1 |
+| High latency | Database/dependency | Section 4.2 |
+| Partial failures | Code bug | Section 4.3 |
+| Spike in errors | Traffic surge | Section 4.4 |
+
+## Mitigation Procedures
+
+### 4.1 Service Completely Down
+```bash
+# Step 1: Check pod status
+kubectl get pods -n payments
+
+# Step 2: If pods are crash-looping, check logs
+kubectl logs -n payments -l app=payment-service --tail=100
+
+# Step 3: Check recent deployments
+kubectl rollout history deployment/payment-service -n payments
+
+# Step 4: ROLLBACK if recent deploy is suspect
+kubectl rollout undo deployment/payment-service -n payments
+
+# Step 5: Scale up if resource constrained
+kubectl scale deployment/payment-service -n payments --replicas=10
+
+# Step 6: Verify recovery
+kubectl rollout status deployment/payment-service -n payments
+```
+
+### 4.2 High Latency
+```bash
+# Step 1: Check database connections
+kubectl exec -n payments deploy/payment-service -- \
+  curl localhost:8080/metrics | grep db_pool
+
+# Step 2: Check slow queries (if DB issue)
+psql -h $DB_HOST -U $DB_USER -c "
+  SELECT pid, now() - query_start AS duration, query
+  FROM pg_stat_activity
+  WHERE state = 'active' AND duration > interval '5 seconds'
+  ORDER BY duration DESC;"
+
+# Step 3: Kill long-running queries if needed
+psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
+
+# Step 4: Check external dependency latency
+curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
+
+# Step 5: Enable circuit breaker if dependency is slow
+kubectl set env deployment/payment-service \
+  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
+```
+
+### 4.3 Partial Failures (Specific Errors)
+```bash
+# Step 1: Identify error pattern
+kubectl logs -n payments -l app=payment-service --tail=500 | \
+  grep -i error | sort | uniq -c | sort -rn | head -20
+
+# Step 2: Check error tracking
+# Go to Sentry: https://sentry.io/payments
+
+# Step 3: If specific endpoint, enable feature flag to disable
+curl -X POST https://api.company.com/internal/feature-flags \
+  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
+
+# Step 4: If data issue, check recent data changes
+psql -h $DB_HOST -c "
+  SELECT * FROM audit_log
+  WHERE table_name = 'payment_methods'
+  AND created_at > now() - interval '1 hour';"
+```
+
+### 4.4 Traffic Surge
+```bash
+# Step 1: Check current request rate
+kubectl top pods -n payments
+
+# Step 2: Scale horizontally
+kubectl scale deployment/payment-service -n payments --replicas=20
+
+# Step 3: Enable rate limiting
+kubectl set env deployment/payment-service \
+  RATE_LIMIT_ENABLED=true \
+  RATE_LIMIT_RPS=1000 -n payments
+
+# Step 4: If attack, block suspicious IPs
+kubectl apply -f - <<EOF
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: block-suspicious
+  namespace: payments
+spec:
+  podSelector:
+    matchLabels:
+      app: payment-service
+  ingress:
+  - from:
+    - ipBlock:
+        cidr: 0.0.0.0/0
+        except:
+        - 192.168.1.0/24  # Suspicious range
+EOF
+```
+
+## Verification Steps
+```bash
+# Verify service is healthy
+curl -s https://api.company.com/payments/health | jq
+
+# Verify error rate is back to normal
+curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
+
+# Verify latency is acceptable
+curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
+
+# Smoke test critical flows
+./scripts/smoke-test-payments.sh
+```
+
+## Rollback Procedures
+```bash
+# Rollback Kubernetes deployment
+kubectl rollout undo deployment/payment-service -n payments
+
+# Rollback database migration (if applicable)
+./scripts/db-rollback.sh $MIGRATION_VERSION
+
+# Rollback feature flag
+curl -X POST https://api.company.com/internal/feature-flags \
+  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
+```
+
+## Escalation Matrix
+
+| Condition | Escalate To | Contact |
+|-----------|-------------|---------|
+| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
+| Data breach suspected | Security Team | #security-incidents |
+| Financial impact > $10k | Finance + Legal | @finance-oncall |
+| Customer communication needed | Support Lead | @support-lead |
+
+## Communication Templates
+
+### Initial Notification (Internal)
+```
+🚨 INCIDENT: Payment Service Degradation
+
+Severity: SEV2
+Status: Investigating
+Impact: ~20% of payment requests failing
+Start Time: [TIME]
+Incident Commander: [NAME]
+
+Current Actions:
+- Investigating root cause
+- Scaling up service
+- Monitoring dashboards
+
+Updates in #payments-incidents
+```
+
+### Status Update
+```
+📊 UPDATE: Payment Service Incident
+
+Status: Mitigating
+Impact: Reduced to ~5% failure rate
+Duration: 25 minutes
+
+Actions Taken:
+- Rolled back deployment v2.3.4 → v2.3.3
+- Scaled service from 5 → 10 replicas
+
+Next Steps:
+- Continuing to monitor
+- Root cause analysis in progress
+
+ETA to Resolution: ~15 minutes
+```
+
+### Resolution Notification
+```
+✅ RESOLVED: Payment Service Incident
+
+Duration: 45 minutes
+Impact: ~5,000 affected transactions
+Root Cause: Memory leak in v2.3.4
+
+Resolution:
+- Rolled back to v2.3.3
+- Transactions auto-retried successfully
+
+Follow-up:
+- Postmortem scheduled for [DATE]
+- Bug fix in progress
+```
+```
+
+### Template 2: Database Incident Runbook
+
+```markdown
+# Database Incident Runbook
+
+## Quick Reference
+| Issue | Command |
+|-------|---------|
+| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
+| Kill query | `SELECT pg_terminate_backend(pid);` |
+| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
+| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
+
+## Connection Pool Exhaustion
+```sql
+-- Check current connections
+SELECT datname, usename, state, count(*)
+FROM pg_stat_activity
+GROUP BY datname, usename, state
+ORDER BY count(*) DESC;
+
+-- Identify long-running connections
+SELECT pid, usename, datname, state, query_start, query
+FROM pg_stat_activity
+WHERE state != 'idle'
+ORDER BY query_start;
+
+-- Terminate idle connections
+SELECT pg_terminate_backend(pid)
+FROM pg_stat_activity
+WHERE state = 'idle'
+AND query_start < now() - interval '10 minutes';
+```
+
+## Replication Lag
+```sql
+-- Check lag on replica
+SELECT
+  CASE
+    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
+    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
+  END AS lag_seconds;
+
+-- If lag > 60s, consider:
+-- 1. Check network between primary/replica
+-- 2. Check replica disk I/O
+-- 3. Consider failover if unrecoverable
+```
+
+## Disk Space Critical
+```bash
+# Check disk usage
+df -h /var/lib/postgresql/data
+
+# Find large tables
+psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
+FROM pg_catalog.pg_statio_user_tables
+ORDER BY pg_total_relation_size(relid) DESC
+LIMIT 10;"
+
+# VACUUM to reclaim space
+psql -c "VACUUM FULL large_table;"
+
+# If emergency, delete old data or expand disk
+```
+```
+
+## Best Practices
+
+### Do's
+- **Keep runbooks updated** - Review after every incident
+- **Test runbooks regularly** - Game days, chaos engineering
+- **Include rollback steps** - Always have an escape hatch
+- **Document assumptions** - What must be true for steps to work
+- **Link to dashboards** - Quick access during stress
+
+### Don'ts
+- **Don't assume knowledge** - Write for 3 AM brain
+- **Don't skip verification** - Confirm each step worked
+- **Don't forget communication** - Keep stakeholders informed
+- **Don't work alone** - Escalate early
+- **Don't skip postmortems** - Learn from every incident
+
+## Resources
+
+- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
+- [PagerDuty Incident Response](https://response.pagerduty.com/)
+- [Atlassian Incident Management](https://www.atlassian.com/incident-management)
--- a/plugins/incident-response/skills/on-call-handoff-patterns/SKILL.md
+++ b/plugins/incident-response/skills/on-call-handoff-patterns/SKILL.md
@@ -0,0 +1,441 @@
+---
+name: on-call-handoff-patterns
+description: Master on-call shift handoffs with context transfer, escalation procedures, and documentation. Use when transitioning on-call responsibilities, documenting shift summaries, or improving on-call processes.
+---
+
+# On-Call Handoff Patterns
+
+Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.
+
+## When to Use This Skill
+
+- Transitioning on-call responsibilities
+- Writing shift handoff summaries
+- Documenting ongoing investigations
+- Establishing on-call rotation procedures
+- Improving handoff quality
+- Onboarding new on-call engineers
+
+## Core Concepts
+
+### 1. Handoff Components
+
+| Component | Purpose |
+|-----------|---------|
+| **Active Incidents** | What's currently broken |
+| **Ongoing Investigations** | Issues being debugged |
+| **Recent Changes** | Deployments, configs |
+| **Known Issues** | Workarounds in place |
+| **Upcoming Events** | Maintenance, releases |
+
+### 2. Handoff Timing
+
+```
+Recommended: 30 min overlap between shifts
+
+Outgoing:
+├── 15 min: Write handoff document
+└── 15 min: Sync call with incoming
+
+Incoming:
+├── 15 min: Review handoff document
+├── 15 min: Sync call with outgoing
+└── 5 min: Verify alerting setup
+```
+
+## Templates
+
+### Template 1: Shift Handoff Document
+
+```markdown
+# On-Call Handoff: Platform Team
+
+**Outgoing**: @alice (2024-01-15 to 2024-01-22)
+**Incoming**: @bob (2024-01-22 to 2024-01-29)
+**Handoff Time**: 2024-01-22 09:00 UTC
+
+---
+
+## 🔴 Active Incidents
+
+### None currently active
+No active incidents at handoff time.
+
+---
+
+## 🟡 Ongoing Investigations
+
+### 1. Intermittent API Timeouts (ENG-1234)
+**Status**: Investigating
+**Started**: 2024-01-20
+**Impact**: ~0.1% of requests timing out
+
+**Context**:
+- Timeouts correlate with database backup window (02:00-03:00 UTC)
+- Suspect backup process causing lock contention
+- Added extra logging in PR #567 (deployed 01/21)
+
+**Next Steps**:
+- [ ] Review new logs after tonight's backup
+- [ ] Consider moving backup window if confirmed
+
+**Resources**:
+- Dashboard: [API Latency](https://grafana/d/api-latency)
+- Thread: #platform-eng (01/20, 14:32)
+
+---
+
+### 2. Memory Growth in Auth Service (ENG-1235)
+**Status**: Monitoring
+**Started**: 2024-01-18
+**Impact**: None yet (proactive)
+
+**Context**:
+- Memory usage growing ~5% per day
+- No memory leak found in profiling
+- Suspect connection pool not releasing properly
+
+**Next Steps**:
+- [ ] Review heap dump from 01/21
+- [ ] Consider restart if usage > 80%
+
+**Resources**:
+- Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)
+- Analysis doc: [Memory Investigation](https://docs/eng-1235)
+
+---
+
+## 🟢 Resolved This Shift
+
+### Payment Service Outage (2024-01-19)
+- **Duration**: 23 minutes
+- **Root Cause**: Database connection exhaustion
+- **Resolution**: Rolled back v2.3.4, increased pool size
+- **Postmortem**: [POSTMORTEM-89](https://docs/postmortem-89)
+- **Follow-up tickets**: ENG-1230, ENG-1231
+
+---
+
+## 📋 Recent Changes
+
+### Deployments
+| Service | Version | Time | Notes |
+|---------|---------|------|-------|
+| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |
+| user-service | v2.8.0 | 01/20 10:00 | New profile features |
+| auth-service | v4.1.2 | 01/19 16:00 | Security patch |
+
+### Configuration Changes
+- 01/21: Increased API rate limit from 1000 to 1500 RPS
+- 01/20: Updated database connection pool max from 50 to 75
+
+### Infrastructure
+- 01/20: Added 2 nodes to Kubernetes cluster
+- 01/19: Upgraded Redis from 6.2 to 7.0
+
+---
+
+## ⚠️ Known Issues & Workarounds
+
+### 1. Slow Dashboard Loading
+**Issue**: Grafana dashboards slow on Monday mornings
+**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up
+**Ticket**: OPS-456 (P3)
+
+### 2. Flaky Integration Test
+**Issue**: `test_payment_flow` fails intermittently in CI
+**Workaround**: Re-run failed job (usually passes on retry)
+**Ticket**: ENG-1200 (P2)
+
+---
+
+## 📅 Upcoming Events
+
+| Date | Event | Impact | Contact |
+|------|-------|--------|---------|
+| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |
+| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |
+| 01/25 | Marketing campaign | 2x traffic expected | @platform |
+
+---
+
+## 📞 Escalation Reminders
+
+| Issue Type | First Escalation | Second Escalation |
+|------------|------------------|-------------------|
+| Payment issues | @payments-oncall | @payments-manager |
+| Auth issues | @auth-oncall | @security-team |
+| Database issues | @dba-team | @infra-manager |
+| Unknown/severe | @engineering-manager | @vp-engineering |
+
+---
+
+## 🔧 Quick Reference
+
+### Common Commands
+```bash
+# Check service health
+kubectl get pods -A | grep -v Running
+
+# Recent deployments
+kubectl get events --sort-by='.lastTimestamp' | tail -20
+
+# Database connections
+psql -c "SELECT count(*) FROM pg_stat_activity;"
+
+# Clear cache (emergency only)
+redis-cli FLUSHDB
+```
+
+### Important Links
+- [Runbooks](https://wiki/runbooks)
+- [Service Catalog](https://wiki/services)
+- [Incident Slack](https://slack.com/incidents)
+- [PagerDuty](https://pagerduty.com/schedules)
+
+---
+
+## Handoff Checklist
+
+### Outgoing Engineer
+- [x] Document active incidents
+- [x] Document ongoing investigations
+- [x] List recent changes
+- [x] Note known issues
+- [x] Add upcoming events
+- [x] Sync with incoming engineer
+
+### Incoming Engineer
+- [ ] Read this document
+- [ ] Join sync call
+- [ ] Verify PagerDuty is routing to you
+- [ ] Verify Slack notifications working
+- [ ] Check VPN/access working
+- [ ] Review critical dashboards
+```
+
+### Template 2: Quick Handoff (Async)
+
+```markdown
+# Quick Handoff: @alice → @bob
+
+## TL;DR
+- No active incidents
+- 1 investigation ongoing (API timeouts, see ENG-1234)
+- Major release tomorrow (01/24) - be ready for issues
+
+## Watch List
+1. API latency around 02:00-03:00 UTC (backup window)
+2. Auth service memory (restart if > 80%)
+
+## Recent
+- Deployed api-gateway v3.2.1 yesterday (stable)
+- Increased rate limits to 1500 RPS
+
+## Coming Up
+- 01/23 02:00 - DB maintenance (5 min read-only)
+- 01/24 14:00 - v5.0 release
+
+## Questions?
+I'll be available on Slack until 17:00 today.
+```
+
+### Template 3: Incident Handoff (Mid-Incident)
+
+```markdown
+# INCIDENT HANDOFF: Payment Service Degradation
+
+**Incident Start**: 2024-01-22 08:15 UTC
+**Current Status**: Mitigating
+**Severity**: SEV2
+
+---
+
+## Current State
+- Error rate: 15% (down from 40%)
+- Mitigation in progress: scaling up pods
+- ETA to resolution: ~30 min
+
+## What We Know
+1. Root cause: Memory pressure on payment-service pods
+2. Triggered by: Unusual traffic spike (3x normal)
+3. Contributing: Inefficient query in checkout flow
+
+## What We've Done
+- Scaled payment-service from 5 → 15 pods
+- Enabled rate limiting on checkout endpoint
+- Disabled non-critical features
+
+## What Needs to Happen
+1. Monitor error rate - should reach <1% in ~15 min
+2. If not improving, escalate to @payments-manager
+3. Once stable, begin root cause investigation
+
+## Key People
+- Incident Commander: @alice (handing off)
+- Comms Lead: @charlie
+- Technical Lead: @bob (incoming)
+
+## Communication
+- Status page: Updated at 08:45
+- Customer support: Notified
+- Exec team: Aware
+
+## Resources
+- Incident channel: #inc-20240122-payment
+- Dashboard: [Payment Service](https://grafana/d/payments)
+- Runbook: [Payment Degradation](https://wiki/runbooks/payments)
+
+---
+
+**Incoming on-call (@bob) - Please confirm you have:**
+- [ ] Joined #inc-20240122-payment
+- [ ] Access to dashboards
+- [ ] Understand current state
+- [ ] Know escalation path
+```
+
+## Handoff Sync Meeting
+
+### Agenda (15 minutes)
+
+```markdown
+## Handoff Sync: @alice → @bob
+
+1. **Active Issues** (5 min)
+   - Walk through any ongoing incidents
+   - Discuss investigation status
+   - Transfer context and theories
+
+2. **Recent Changes** (3 min)
+   - Deployments to watch
+   - Config changes
+   - Known regressions
+
+3. **Upcoming Events** (3 min)
+   - Maintenance windows
+   - Expected traffic changes
+   - Releases planned
+
+4. **Questions** (4 min)
+   - Clarify anything unclear
+   - Confirm access and alerting
+   - Exchange contact info
+```
+
+## On-Call Best Practices
+
+### Before Your Shift
+
+```markdown
+## Pre-Shift Checklist
+
+### Access Verification
+- [ ] VPN working
+- [ ] kubectl access to all clusters
+- [ ] Database read access
+- [ ] Log aggregator access (Splunk/Datadog)
+- [ ] PagerDuty app installed and logged in
+
+### Alerting Setup
+- [ ] PagerDuty schedule shows you as primary
+- [ ] Phone notifications enabled
+- [ ] Slack notifications for incident channels
+- [ ] Test alert received and acknowledged
+
+### Knowledge Refresh
+- [ ] Review recent incidents (past 2 weeks)
+- [ ] Check service changelog
+- [ ] Skim critical runbooks
+- [ ] Know escalation contacts
+
+### Environment Ready
+- [ ] Laptop charged and accessible
+- [ ] Phone charged
+- [ ] Quiet space available for calls
+- [ ] Secondary contact identified (if traveling)
+```
+
+### During Your Shift
+
+```markdown
+## Daily On-Call Routine
+
+### Morning (start of day)
+- [ ] Check overnight alerts
+- [ ] Review dashboards for anomalies
+- [ ] Check for any P0/P1 tickets created
+- [ ] Skim incident channels for context
+
+### Throughout Day
+- [ ] Respond to alerts within SLA
+- [ ] Document investigation progress
+- [ ] Update team on significant issues
+- [ ] Triage incoming pages
+
+### End of Day
+- [ ] Hand off any active issues
+- [ ] Update investigation docs
+- [ ] Note anything for next shift
+```
+
+### After Your Shift
+
+```markdown
+## Post-Shift Checklist
+
+- [ ] Complete handoff document
+- [ ] Sync with incoming on-call
+- [ ] Verify PagerDuty routing changed
+- [ ] Close/update investigation tickets
+- [ ] File postmortems for any incidents
+- [ ] Take time off if shift was stressful
+```
+
+## Escalation Guidelines
+
+### When to Escalate
+
+```markdown
+## Escalation Triggers
+
+### Immediate Escalation
+- SEV1 incident declared
+- Data breach suspected
+- Unable to diagnose within 30 min
+- Customer or legal escalation received
+
+### Consider Escalation
+- Issue spans multiple teams
+- Requires expertise you don't have
+- Business impact exceeds threshold
+- You're uncertain about next steps
+
+### How to Escalate
+1. Page the appropriate escalation path
+2. Provide brief context in Slack
+3. Stay engaged until escalation acknowledges
+4. Hand off cleanly, don't just disappear
+```
+
+## Best Practices
+
+### Do's
+- **Document everything** - Future you will thank you
+- **Escalate early** - Better safe than sorry
+- **Take breaks** - Alert fatigue is real
+- **Keep handoffs synchronous** - Async loses context
+- **Test your setup** - Before incidents, not during
+
+### Don'ts
+- **Don't skip handoffs** - Context loss causes incidents
+- **Don't hero** - Escalate when needed
+- **Don't ignore alerts** - Even if they seem minor
+- **Don't work sick** - Swap shifts instead
+- **Don't disappear** - Stay reachable during shift
+
+## Resources
+
+- [Google SRE - Being On-Call](https://sre.google/sre-book/being-on-call/)
+- [PagerDuty On-Call Guide](https://www.pagerduty.com/resources/learn/on-call-management/)
+- [Increment On-Call Issue](https://increment.com/on-call/)
--- a/plugins/incident-response/skills/postmortem-writing/SKILL.md
+++ b/plugins/incident-response/skills/postmortem-writing/SKILL.md
@@ -0,0 +1,374 @@
+---
+name: postmortem-writing
+description: Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processes.
+---
+
+# Postmortem Writing
+
+Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
+
+## When to Use This Skill
+
+- Conducting post-incident reviews
+- Writing postmortem documents
+- Facilitating blameless postmortem meetings
+- Identifying root causes and contributing factors
+- Creating actionable follow-up items
+- Building organizational learning culture
+
+## Core Concepts
+
+### 1. Blameless Culture
+
+| Blame-Focused | Blameless |
+|---------------|-----------|
+| "Who caused this?" | "What conditions allowed this?" |
+| "Someone made a mistake" | "The system allowed this mistake" |
+| Punish individuals | Improve systems |
+| Hide information | Share learnings |
+| Fear of speaking up | Psychological safety |
+
+### 2. Postmortem Triggers
+
+- SEV1 or SEV2 incidents
+- Customer-facing outages > 15 minutes
+- Data loss or security incidents
+- Near-misses that could have been severe
+- Novel failure modes
+- Incidents requiring unusual intervention
+
+## Quick Start
+
+### Postmortem Timeline
+```
+Day 0: Incident occurs
+Day 1-2: Draft postmortem document
+Day 3-5: Postmortem meeting
+Day 5-7: Finalize document, create tickets
+Week 2+: Action item completion
+Quarterly: Review patterns across incidents
+```
+
+## Templates
+
+### Template 1: Standard Postmortem
+
+```markdown
+# Postmortem: [Incident Title]
+
+**Date**: 2024-01-15
+**Authors**: @alice, @bob
+**Status**: Draft | In Review | Final
+**Incident Severity**: SEV2
+**Incident Duration**: 47 minutes
+
+## Executive Summary
+
+On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
+
+**Impact**:
+- 12,000 customers unable to complete purchases
+- Estimated revenue loss: $45,000
+- 847 support tickets created
+- No data loss or security implications
+
+## Timeline (All times UTC)
+
+| Time | Event |
+|------|-------|
+| 14:23 | Deployment v2.3.4 completed to production |
+| 14:31 | First alert: `payment_error_rate > 5%` |
+| 14:33 | On-call engineer @alice acknowledges alert |
+| 14:35 | Initial investigation begins, error rate at 23% |
+| 14:41 | Incident declared SEV2, @bob joins |
+| 14:45 | Database connection exhaustion identified |
+| 14:52 | Decision to rollback deployment |
+| 14:58 | Rollback to v2.3.3 initiated |
+| 15:10 | Rollback complete, error rate dropping |
+| 15:18 | Service fully recovered, incident resolved |
+
+## Root Cause Analysis
+
+### What Happened
+
+The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.
+
+### Why It Happened
+
+1. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls.
+
+2. **Contributing Factors**:
+   - Code review did not catch the connection handling change
+   - No integration tests specifically for connection pool behavior
+   - Staging environment has lower traffic, masking the issue
+   - Database connection metrics alert threshold was too high (90%)
+
+3. **5 Whys Analysis**:
+   - Why did the service fail? → Database connections exhausted
+   - Why were connections exhausted? → Each request opened new connection
+   - Why did each request open new connection? → Code bypassed connection pool
+   - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
+   - Why was developer unfamiliar? → No documentation on connection management patterns
+
+### System Diagram
+
+```
+[Client] → [Load Balancer] → [Payment Service] → [Database]
+                                    ↓
+                            Connection Pool (broken)
+                                    ↓
+                            Direct connections (cause)
+```
+
+## Detection
+
+### What Worked
+- Error rate alert fired within 8 minutes of deployment
+- Grafana dashboard clearly showed connection spike
+- On-call response was swift (2 minute acknowledgment)
+
+### What Didn't Work
+- Database connection metric alert threshold too high
+- No deployment-correlated alerting
+- Canary deployment would have caught this earlier
+
+### Detection Gap
+The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.
+
+## Response
+
+### What Worked
+- On-call engineer quickly identified database as the issue
+- Rollback decision was made decisively
+- Clear communication in incident channel
+
+### What Could Be Improved
+- Took 10 minutes to correlate issue with recent deployment
+- Had to manually check deployment history
+- Rollback took 12 minutes (could be faster)
+
+## Impact
+
+### Customer Impact
+- 12,000 unique customers affected
+- Average impact duration: 35 minutes
+- 847 support tickets (23% of affected users)
+- Customer satisfaction score dropped 12 points
+
+### Business Impact
+- Estimated revenue loss: $45,000
+- Support cost: ~$2,500 (agent time)
+- Engineering time: ~8 person-hours
+
+### Technical Impact
+- Database primary experienced elevated load
+- Some replica lag during incident
+- No permanent damage to systems
+
+## Lessons Learned
+
+### What Went Well
+1. Alerting detected the issue before customer reports
+2. Team collaborated effectively under pressure
+3. Rollback procedure worked smoothly
+4. Communication was clear and timely
+
+### What Went Wrong
+1. Code review missed critical change
+2. Test coverage gap for connection pooling
+3. Staging environment doesn't reflect production traffic
+4. Alert thresholds were not tuned properly
+
+### Where We Got Lucky
+1. Incident occurred during business hours with full team available
+2. Database handled the load without failing completely
+3. No other incidents occurred simultaneously
+
+## Action Items
+
+| Priority | Action | Owner | Due Date | Ticket |
+|----------|--------|-------|----------|--------|
+| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |
+| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
+| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
+| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |
+| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |
+| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |
+
+## Appendix
+
+### Supporting Data
+
+#### Error Rate Graph
+[Link to Grafana dashboard snapshot]
+
+#### Database Connection Graph
+[Link to metrics]
+
+### Related Incidents
+- 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)
+
+### References
+- [Connection Pool Best Practices](internal-wiki/connection-pools)
+- [Deployment Runbook](internal-wiki/deployment-runbook)
+```
+
+### Template 2: 5 Whys Analysis
+
+```markdown
+# 5 Whys Analysis: [Incident]
+
+## Problem Statement
+Payment service experienced 47-minute outage due to database connection exhaustion.
+
+## Analysis
+
+### Why #1: Why did the service fail?
+**Answer**: Database connections were exhausted, causing all new requests to fail.
+
+**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
+
+---
+
+### Why #2: Why were database connections exhausted?
+**Answer**: Each incoming request opened a new database connection instead of using the connection pool.
+
+**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.
+
+---
+
+### Why #3: Why did the code bypass the connection pool?
+**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.
+
+**Evidence**: PR #1234 shows the change, made while fixing a different bug.
+
+---
+
+### Why #4: Why wasn't this caught in code review?
+**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
+
+**Evidence**: Review comments only discuss business logic.
+
+---
+
+### Why #5: Why isn't there a safety net for this type of change?
+**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
+
+**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.
+
+## Root Causes Identified
+
+1. **Primary**: Missing automated tests for infrastructure behavior
+2. **Secondary**: Insufficient documentation of architectural patterns
+3. **Tertiary**: Code review checklist doesn't include infrastructure considerations
+
+## Systemic Improvements
+
+| Root Cause | Improvement | Type |
+|------------|-------------|------|
+| Missing tests | Add infrastructure behavior tests | Prevention |
+| Missing docs | Document connection patterns | Prevention |
+| Review gaps | Update review checklist | Detection |
+| No canary | Implement canary deployments | Mitigation |
+```
+
+### Template 3: Quick Postmortem (Minor Incidents)
+
+```markdown
+# Quick Postmortem: [Brief Title]
+
+**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3
+
+## What Happened
+API latency spiked to 5s due to cache miss storm after cache flush.
+
+## Timeline
+- 10:00 - Cache flush initiated for config update
+- 10:02 - Latency alerts fire
+- 10:05 - Identified as cache miss storm
+- 10:08 - Enabled cache warming
+- 10:12 - Latency normalized
+
+## Root Cause
+Full cache flush for minor config update caused thundering herd.
+
+## Fix
+- Immediate: Enabled cache warming
+- Long-term: Implement partial cache invalidation (ENG-999)
+
+## Lessons
+Don't full-flush cache in production; use targeted invalidation.
+```
+
+## Facilitation Guide
+
+### Running a Postmortem Meeting
+
+```markdown
+## Meeting Structure (60 minutes)
+
+### 1. Opening (5 min)
+- Remind everyone of blameless culture
+- "We're here to learn, not to blame"
+- Review meeting norms
+
+### 2. Timeline Review (15 min)
+- Walk through events chronologically
+- Ask clarifying questions
+- Identify gaps in timeline
+
+### 3. Analysis Discussion (20 min)
+- What failed?
+- Why did it fail?
+- What conditions allowed this?
+- What would have prevented it?
+
+### 4. Action Items (15 min)
+- Brainstorm improvements
+- Prioritize by impact and effort
+- Assign owners and due dates
+
+### 5. Closing (5 min)
+- Summarize key learnings
+- Confirm action item owners
+- Schedule follow-up if needed
+
+## Facilitation Tips
+- Keep discussion on track
+- Redirect blame to systems
+- Encourage quiet participants
+- Document dissenting views
+- Time-box tangents
+```
+
+## Anti-Patterns to Avoid
+
+| Anti-Pattern | Problem | Better Approach |
+|--------------|---------|-----------------|
+| **Blame game** | Shuts down learning | Focus on systems |
+| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
+| **No action items** | Waste of time | Always have concrete next steps |
+| **Unrealistic actions** | Never completed | Scope to achievable tasks |
+| **No follow-up** | Actions forgotten | Track in ticketing system |
+
+## Best Practices
+
+### Do's
+- **Start immediately** - Memory fades fast
+- **Be specific** - Exact times, exact errors
+- **Include graphs** - Visual evidence
+- **Assign owners** - No orphan action items
+- **Share widely** - Organizational learning
+
+### Don'ts
+- **Don't name and shame** - Ever
+- **Don't skip small incidents** - They reveal patterns
+- **Don't make it a blame doc** - That kills learning
+- **Don't create busywork** - Actions should be meaningful
+- **Don't skip follow-up** - Verify actions completed
+
+## Resources
+
+- [Google SRE - Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
+- [Etsy's Blameless Postmortems](https://codeascraft.com/2012/05/22/blameless-postmortems/)
+- [PagerDuty Postmortem Guide](https://postmortems.pagerduty.com/)