style: format all files with prettier

2026-03-18 09:37:15 +00:00 · 2026-01-19 17:07:03 -05:00
parent 8d37048deb
commit 56848874a2
355 changed files with 15215 additions and 10241 deletions
--- a/plugins/incident-response/skills/incident-runbook-templates/SKILL.md
+++ b/plugins/incident-response/skills/incident-runbook-templates/SKILL.md
@@ -20,12 +20,12 @@ Production-ready templates for incident response runbooks covering detection, tr

 ### 1. Incident Severity Levels

-| Severity | Impact | Response Time | Example |
-|----------|--------|---------------|---------|
-| **SEV1** | Complete outage, data loss | 15 min | Production down |
-| **SEV2** | Major degradation | 30 min | Critical feature broken |
-| **SEV3** | Minor impact | 2 hours | Non-critical bug |
-| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
+| Severity | Impact                     | Response Time     | Example                 |
+| -------- | -------------------------- | ----------------- | ----------------------- |
+| **SEV1** | Complete outage, data loss | 15 min            | Production down         |
+| **SEV2** | Major degradation          | 30 min            | Critical feature broken |
+| **SEV3** | Minor impact               | 2 hours           | Non-critical bug        |
+| **SEV4** | Minimal impact             | Next business day | Cosmetic issue          |

 ### 2. Runbook Structure

@@ -45,28 +45,33 @@ Production-ready templates for incident response runbooks covering detection, tr

 ### Template 1: Service Outage Runbook

-```markdown
+````markdown
 # [Service Name] Outage Runbook

 ## Overview
+
 **Service**: Payment Processing Service
 **Owner**: Platform Team
 **Slack**: #payments-incidents
 **PagerDuty**: payments-oncall

 ## Impact Assessment
+
 - [ ] Which customers are affected?
 - [ ] What percentage of traffic is impacted?
 - [ ] Are there financial implications?
 - [ ] What's the blast radius?

 ## Detection
+
 ### Alerts
+
 - `payment_error_rate > 5%` (PagerDuty)
 - `payment_latency_p99 > 2s` (Slack)
 - `payment_success_rate < 95%` (PagerDuty)

 ### Dashboards
+
 - [Payment Service Dashboard](https://grafana/d/payments)
 - [Error Tracking](https://sentry.io/payments)
 - [Dependency Status](https://status.stripe.com)
@@ -74,6 +79,7 @@ Production-ready templates for incident response runbooks covering detection, tr
 ## Initial Triage (First 5 Minutes)

 ### 1. Assess Scope
+
 ```bash
 # Check service health
 kubectl get pods -n payments -l app=payment-service
@@ -84,24 +90,28 @@ kubectl rollout history deployment/payment-service -n payments
 # Check error rates
 curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
 ```
+````

 ### 2. Quick Health Checks
+
 - [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`
 - [ ] Database connectivity? Check connection pool metrics
 - [ ] External dependencies? Check Stripe, bank API status
 - [ ] Recent changes? Check deploy history

 ### 3. Initial Classification
-| Symptom | Likely Cause | Go To Section |
-|---------|--------------|---------------|
-| All requests failing | Service down | Section 4.1 |
-| High latency | Database/dependency | Section 4.2 |
-| Partial failures | Code bug | Section 4.3 |
-| Spike in errors | Traffic surge | Section 4.4 |
+
+| Symptom              | Likely Cause        | Go To Section |
+| -------------------- | ------------------- | ------------- |
+| All requests failing | Service down        | Section 4.1   |
+| High latency         | Database/dependency | Section 4.2   |
+| Partial failures     | Code bug            | Section 4.3   |
+| Spike in errors      | Traffic surge       | Section 4.4   |

 ## Mitigation Procedures

 ### 4.1 Service Completely Down
+
 ```bash
 # Step 1: Check pod status
 kubectl get pods -n payments
@@ -123,6 +133,7 @@ kubectl rollout status deployment/payment-service -n payments
 ```

 ### 4.2 High Latency
+
 ```bash
 # Step 1: Check database connections
 kubectl exec -n payments deploy/payment-service -- \
@@ -147,6 +158,7 @@ kubectl set env deployment/payment-service \
 ```

 ### 4.3 Partial Failures (Specific Errors)
+
 ```bash
 # Step 1: Identify error pattern
 kubectl logs -n payments -l app=payment-service --tail=500 | \
@@ -167,6 +179,7 @@ psql -h $DB_HOST -c "
 ```

 ### 4.4 Traffic Surge
+
 ```bash
 # Step 1: Check current request rate
 kubectl top pods -n payments
@@ -200,6 +213,7 @@ EOF
 ```

 ## Verification Steps
+
 ```bash
 # Verify service is healthy
 curl -s https://api.company.com/payments/health | jq
@@ -215,6 +229,7 @@ curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(r
 ```

 ## Rollback Procedures
+
 ```bash
 # Rollback Kubernetes deployment
 kubectl rollout undo deployment/payment-service -n payments
@@ -229,16 +244,17 @@ curl -X POST https://api.company.com/internal/feature-flags \

 ## Escalation Matrix

-| Condition | Escalate To | Contact |
-|-----------|-------------|---------|
-| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
-| Data breach suspected | Security Team | #security-incidents |
-| Financial impact > $10k | Finance + Legal | @finance-oncall |
-| Customer communication needed | Support Lead | @support-lead |
+| Condition                     | Escalate To         | Contact             |
+| ----------------------------- | ------------------- | ------------------- |
+| > 15 min unresolved SEV1      | Engineering Manager | @manager (Slack)    |
+| Data breach suspected         | Security Team       | #security-incidents |
+| Financial impact > $10k       | Finance + Legal     | @finance-oncall     |
+| Customer communication needed | Support Lead        | @support-lead       |

 ## Communication Templates

 ### Initial Notification (Internal)
+
 ```
 🚨 INCIDENT: Payment Service Degradation

@@ -257,6 +273,7 @@ Updates in #payments-incidents
 ```

 ### Status Update
+
 ```
 📊 UPDATE: Payment Service Incident

@@ -276,6 +293,7 @@ ETA to Resolution: ~15 minutes
 ```

 ### Resolution Notification
+
 ```
 ✅ RESOLVED: Payment Service Incident

@@ -291,7 +309,8 @@ Follow-up:
 - Postmortem scheduled for [DATE]
 - Bug fix in progress
 ```
-```
+
+````

 ### Template 2: Database Incident Runbook

@@ -325,9 +344,10 @@ SELECT pg_terminate_backend(pid)
 FROM pg_stat_activity
 WHERE state = 'idle'
 AND query_start < now() - interval '10 minutes';
-```
+````

 ## Replication Lag
+
 ```sql
 -- Check lag on replica
 SELECT
@@ -343,6 +363,7 @@ SELECT
 ```

 ## Disk Space Critical
+
 ```bash
 # Check disk usage
 df -h /var/lib/postgresql/data
@@ -358,6 +379,7 @@ psql -c "VACUUM FULL large_table;"

 # If emergency, delete old data or expand disk
 ```
+
 ```

 ## Best Practices
@@ -381,3 +403,4 @@ psql -c "VACUUM FULL large_table;"
 - [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
 - [PagerDuty Incident Response](https://response.pagerduty.com/)
 - [Atlassian Incident Management](https://www.atlassian.com/incident-management)
+```
--- a/plugins/incident-response/skills/on-call-handoff-patterns/SKILL.md
+++ b/plugins/incident-response/skills/on-call-handoff-patterns/SKILL.md
@@ -20,13 +20,13 @@ Effective patterns for on-call shift transitions, ensuring continuity, context t

 ### 1. Handoff Components

-| Component | Purpose |
-|-----------|---------|
-| **Active Incidents** | What's currently broken |
-| **Ongoing Investigations** | Issues being debugged |
-| **Recent Changes** | Deployments, configs |
-| **Known Issues** | Workarounds in place |
-| **Upcoming Events** | Maintenance, releases |
+| Component                  | Purpose                 |
+| -------------------------- | ----------------------- |
+| **Active Incidents**       | What's currently broken |
+| **Ongoing Investigations** | Issues being debugged   |
+| **Recent Changes**         | Deployments, configs    |
+| **Known Issues**           | Workarounds in place    |
+| **Upcoming Events**        | Maintenance, releases   |

 ### 2. Handoff Timing

@@ -47,7 +47,7 @@ Incoming:

 ### Template 1: Shift Handoff Document

-```markdown
+````markdown
 # On-Call Handoff: Platform Team

 **Outgoing**: @alice (2024-01-15 to 2024-01-22)
@@ -59,6 +59,7 @@ Incoming:
 ## 🔴 Active Incidents

 ### None currently active
+
 No active incidents at handoff time.

 ---
@@ -66,40 +67,48 @@ No active incidents at handoff time.
 ## 🟡 Ongoing Investigations

 ### 1. Intermittent API Timeouts (ENG-1234)
+
 **Status**: Investigating
 **Started**: 2024-01-20
 **Impact**: ~0.1% of requests timing out

 **Context**:
+
 - Timeouts correlate with database backup window (02:00-03:00 UTC)
 - Suspect backup process causing lock contention
 - Added extra logging in PR #567 (deployed 01/21)

 **Next Steps**:
+
 - [ ] Review new logs after tonight's backup
 - [ ] Consider moving backup window if confirmed

 **Resources**:
+
 - Dashboard: [API Latency](https://grafana/d/api-latency)
 - Thread: #platform-eng (01/20, 14:32)

 ---

 ### 2. Memory Growth in Auth Service (ENG-1235)
+
 **Status**: Monitoring
 **Started**: 2024-01-18
 **Impact**: None yet (proactive)

 **Context**:
+
 - Memory usage growing ~5% per day
 - No memory leak found in profiling
 - Suspect connection pool not releasing properly

 **Next Steps**:
+
 - [ ] Review heap dump from 01/21
 - [ ] Consider restart if usage > 80%

 **Resources**:
+
 - Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)
 - Analysis doc: [Memory Investigation](https://docs/eng-1235)

@@ -108,6 +117,7 @@ No active incidents at handoff time.
 ## 🟢 Resolved This Shift

 ### Payment Service Outage (2024-01-19)
+
 - **Duration**: 23 minutes
 - **Root Cause**: Database connection exhaustion
 - **Resolution**: Rolled back v2.3.4, increased pool size
@@ -119,17 +129,20 @@ No active incidents at handoff time.
 ## 📋 Recent Changes

 ### Deployments
-| Service | Version | Time | Notes |
-|---------|---------|------|-------|
-| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |
-| user-service | v2.8.0 | 01/20 10:00 | New profile features |
-| auth-service | v4.1.2 | 01/19 16:00 | Security patch |
+
+| Service      | Version | Time        | Notes                      |
+| ------------ | ------- | ----------- | -------------------------- |
+| api-gateway  | v3.2.1  | 01/21 14:00 | Bug fix for header parsing |
+| user-service | v2.8.0  | 01/20 10:00 | New profile features       |
+| auth-service | v4.1.2  | 01/19 16:00 | Security patch             |

 ### Configuration Changes
+
 - 01/21: Increased API rate limit from 1000 to 1500 RPS
 - 01/20: Updated database connection pool max from 50 to 75

 ### Infrastructure
+
 - 01/20: Added 2 nodes to Kubernetes cluster
 - 01/19: Upgraded Redis from 6.2 to 7.0

@@ -138,11 +151,13 @@ No active incidents at handoff time.
 ## ⚠️ Known Issues & Workarounds

 ### 1. Slow Dashboard Loading
+
 **Issue**: Grafana dashboards slow on Monday mornings
 **Workaround**: Wait 5 min after 08:00 UTC for cache warm-up
 **Ticket**: OPS-456 (P3)

 ### 2. Flaky Integration Test
+
 **Issue**: `test_payment_flow` fails intermittently in CI
 **Workaround**: Re-run failed job (usually passes on retry)
 **Ticket**: ENG-1200 (P2)
@@ -151,28 +166,29 @@ No active incidents at handoff time.

 ## 📅 Upcoming Events

-| Date | Event | Impact | Contact |
-|------|-------|--------|---------|
-| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |
-| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |
-| 01/25 | Marketing campaign | 2x traffic expected | @platform |
+| Date        | Event                | Impact              | Contact       |
+| ----------- | -------------------- | ------------------- | ------------- |
+| 01/23 02:00 | Database maintenance | 5 min read-only     | @dba-team     |
+| 01/24 14:00 | Major release v5.0   | Monitor closely     | @release-team |
+| 01/25       | Marketing campaign   | 2x traffic expected | @platform     |

 ---

 ## 📞 Escalation Reminders

-| Issue Type | First Escalation | Second Escalation |
-|------------|------------------|-------------------|
-| Payment issues | @payments-oncall | @payments-manager |
-| Auth issues | @auth-oncall | @security-team |
-| Database issues | @dba-team | @infra-manager |
-| Unknown/severe | @engineering-manager | @vp-engineering |
+| Issue Type      | First Escalation     | Second Escalation |
+| --------------- | -------------------- | ----------------- |
+| Payment issues  | @payments-oncall     | @payments-manager |
+| Auth issues     | @auth-oncall         | @security-team    |
+| Database issues | @dba-team            | @infra-manager    |
+| Unknown/severe  | @engineering-manager | @vp-engineering   |

 ---

 ## 🔧 Quick Reference

 ### Common Commands
+
 ```bash
 # Check service health
 kubectl get pods -A | grep -v Running
@@ -186,8 +202,10 @@ psql -c "SELECT count(*) FROM pg_stat_activity;"
 # Clear cache (emergency only)
 redis-cli FLUSHDB
 ```
+````

 ### Important Links
+
 - [Runbooks](https://wiki/runbooks)
 - [Service Catalog](https://wiki/services)
 - [Incident Slack](https://slack.com/incidents)
@@ -198,6 +216,7 @@ redis-cli FLUSHDB
 ## Handoff Checklist

 ### Outgoing Engineer
+
 - [x] Document active incidents
 - [x] Document ongoing investigations
 - [x] List recent changes
@@ -206,13 +225,15 @@ redis-cli FLUSHDB
 - [x] Sync with incoming engineer

 ### Incoming Engineer
+
 - [ ] Read this document
 - [ ] Join sync call
 - [ ] Verify PagerDuty is routing to you
 - [ ] Verify Slack notifications working
 - [ ] Check VPN/access working
 - [ ] Review critical dashboards
-```
+
+````

 ### Template 2: Quick Handoff (Async)

@@ -238,7 +259,7 @@ redis-cli FLUSHDB

 ## Questions?
 I'll be available on Slack until 17:00 today.
-```
+````

 ### Template 3: Incident Handoff (Mid-Incident)

@@ -252,36 +273,43 @@ I'll be available on Slack until 17:00 today.
 ---

 ## Current State
+
 - Error rate: 15% (down from 40%)
 - Mitigation in progress: scaling up pods
 - ETA to resolution: ~30 min

 ## What We Know
+
 1. Root cause: Memory pressure on payment-service pods
 2. Triggered by: Unusual traffic spike (3x normal)
 3. Contributing: Inefficient query in checkout flow

 ## What We've Done
+
 - Scaled payment-service from 5 → 15 pods
 - Enabled rate limiting on checkout endpoint
 - Disabled non-critical features

 ## What Needs to Happen
+
 1. Monitor error rate - should reach <1% in ~15 min
 2. If not improving, escalate to @payments-manager
 3. Once stable, begin root cause investigation

 ## Key People
+
 - Incident Commander: @alice (handing off)
 - Comms Lead: @charlie
 - Technical Lead: @bob (incoming)

 ## Communication
+
 - Status page: Updated at 08:45
 - Customer support: Notified
 - Exec team: Aware

 ## Resources
+
 - Incident channel: #inc-20240122-payment
 - Dashboard: [Payment Service](https://grafana/d/payments)
 - Runbook: [Payment Degradation](https://wiki/runbooks/payments)
@@ -289,6 +317,7 @@ I'll be available on Slack until 17:00 today.
 ---

 **Incoming on-call (@bob) - Please confirm you have:**
+
 - [ ] Joined #inc-20240122-payment
 - [ ] Access to dashboards
 - [ ] Understand current state
@@ -331,6 +360,7 @@ I'll be available on Slack until 17:00 today.
 ## Pre-Shift Checklist

 ### Access Verification
+
 - [ ] VPN working
 - [ ] kubectl access to all clusters
 - [ ] Database read access
@@ -338,18 +368,21 @@ I'll be available on Slack until 17:00 today.
 - [ ] PagerDuty app installed and logged in

 ### Alerting Setup
+
 - [ ] PagerDuty schedule shows you as primary
 - [ ] Phone notifications enabled
 - [ ] Slack notifications for incident channels
 - [ ] Test alert received and acknowledged

 ### Knowledge Refresh
+
 - [ ] Review recent incidents (past 2 weeks)
 - [ ] Check service changelog
 - [ ] Skim critical runbooks
 - [ ] Know escalation contacts

 ### Environment Ready
+
 - [ ] Laptop charged and accessible
 - [ ] Phone charged
 - [ ] Quiet space available for calls
@@ -362,18 +395,21 @@ I'll be available on Slack until 17:00 today.
 ## Daily On-Call Routine

 ### Morning (start of day)
+
 - [ ] Check overnight alerts
 - [ ] Review dashboards for anomalies
 - [ ] Check for any P0/P1 tickets created
 - [ ] Skim incident channels for context

 ### Throughout Day
+
 - [ ] Respond to alerts within SLA
 - [ ] Document investigation progress
 - [ ] Update team on significant issues
 - [ ] Triage incoming pages

 ### End of Day
+
 - [ ] Hand off any active issues
 - [ ] Update investigation docs
 - [ ] Note anything for next shift
@@ -400,18 +436,21 @@ I'll be available on Slack until 17:00 today.
 ## Escalation Triggers

 ### Immediate Escalation
+
 - SEV1 incident declared
 - Data breach suspected
 - Unable to diagnose within 30 min
 - Customer or legal escalation received

 ### Consider Escalation
+
 - Issue spans multiple teams
 - Requires expertise you don't have
 - Business impact exceeds threshold
 - You're uncertain about next steps

 ### How to Escalate
+
 1. Page the appropriate escalation path
 2. Provide brief context in Slack
 3. Stay engaged until escalation acknowledges
@@ -421,6 +460,7 @@ I'll be available on Slack until 17:00 today.
 ## Best Practices

 ### Do's
+
 - **Document everything** - Future you will thank you
 - **Escalate early** - Better safe than sorry
 - **Take breaks** - Alert fatigue is real
@@ -428,6 +468,7 @@ I'll be available on Slack until 17:00 today.
 - **Test your setup** - Before incidents, not during

 ### Don'ts
+
 - **Don't skip handoffs** - Context loss causes incidents
 - **Don't hero** - Escalate when needed
 - **Don't ignore alerts** - Even if they seem minor
--- a/plugins/incident-response/skills/postmortem-writing/SKILL.md
+++ b/plugins/incident-response/skills/postmortem-writing/SKILL.md
@@ -20,13 +20,13 @@ Comprehensive guide to writing effective, blameless postmortems that drive organ

 ### 1. Blameless Culture

-| Blame-Focused | Blameless |
-|---------------|-----------|
-| "Who caused this?" | "What conditions allowed this?" |
+| Blame-Focused            | Blameless                         |
+| ------------------------ | --------------------------------- |
+| "Who caused this?"       | "What conditions allowed this?"   |
 | "Someone made a mistake" | "The system allowed this mistake" |
-| Punish individuals | Improve systems |
-| Hide information | Share learnings |
-| Fear of speaking up | Psychological safety |
+| Punish individuals       | Improve systems                   |
+| Hide information         | Share learnings                   |
+| Fear of speaking up      | Psychological safety              |

 ### 2. Postmortem Triggers

@@ -40,6 +40,7 @@ Comprehensive guide to writing effective, blameless postmortems that drive organ
 ## Quick Start

 ### Postmortem Timeline
+
 ```
 Day 0: Incident occurs
 Day 1-2: Draft postmortem document
@@ -67,6 +68,7 @@ Quarterly: Review patterns across incidents
 On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.

 **Impact**:
+
 - 12,000 customers unable to complete purchases
 - Estimated revenue loss: $45,000
 - 847 support tickets created
@@ -74,18 +76,18 @@ On January 15, 2024, the payment processing service experienced a 47-minute outa

 ## Timeline (All times UTC)

-| Time | Event |
-|------|-------|
-| 14:23 | Deployment v2.3.4 completed to production |
-| 14:31 | First alert: `payment_error_rate > 5%` |
-| 14:33 | On-call engineer @alice acknowledges alert |
+| Time  | Event                                           |
+| ----- | ----------------------------------------------- |
+| 14:23 | Deployment v2.3.4 completed to production       |
+| 14:31 | First alert: `payment_error_rate > 5%`          |
+| 14:33 | On-call engineer @alice acknowledges alert      |
 | 14:35 | Initial investigation begins, error rate at 23% |
-| 14:41 | Incident declared SEV2, @bob joins |
-| 14:45 | Database connection exhaustion identified |
-| 14:52 | Decision to rollback deployment |
-| 14:58 | Rollback to v2.3.3 initiated |
-| 15:10 | Rollback complete, error rate dropping |
-| 15:18 | Service fully recovered, incident resolved |
+| 14:41 | Incident declared SEV2, @bob joins              |
+| 14:45 | Database connection exhaustion identified       |
+| 14:52 | Decision to rollback deployment                 |
+| 14:58 | Rollback to v2.3.3 initiated                    |
+| 15:10 | Rollback complete, error rate dropping          |
+| 15:18 | Service fully recovered, incident resolved      |

 ## Root Cause Analysis

@@ -111,13 +113,14 @@ The v2.3.4 deployment included a change to the database query pattern that inadv
   - Why was developer unfamiliar? → No documentation on connection management patterns

 ### System Diagram
-
 ```
+
 [Client] → [Load Balancer] → [Payment Service] → [Database]
-                                    ↓
-                            Connection Pool (broken)
-                                    ↓
-                            Direct connections (cause)
+↓
+Connection Pool (broken)
+↓
+Direct connections (cause)
+
 ```

 ## Detection
@@ -219,11 +222,13 @@ The deployment completed at 14:23, but the first alert didn't fire until 14:31 (
 # 5 Whys Analysis: [Incident]

 ## Problem Statement
+
 Payment service experienced 47-minute outage due to database connection exhaustion.

 ## Analysis

 ### Why #1: Why did the service fail?
+
 **Answer**: Database connections were exhausted, causing all new requests to fail.

 **Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
@@ -231,6 +236,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
 ---

 ### Why #2: Why were database connections exhausted?
+
 **Answer**: Each incoming request opened a new database connection instead of using the connection pool.

 **Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.
@@ -238,6 +244,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
 ---

 ### Why #3: Why did the code bypass the connection pool?
+
 **Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.

 **Evidence**: PR #1234 shows the change, made while fixing a different bug.
@@ -245,6 +252,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
 ---

 ### Why #4: Why wasn't this caught in code review?
+
 **Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.

 **Evidence**: Review comments only discuss business logic.
@@ -252,6 +260,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
 ---

 ### Why #5: Why isn't there a safety net for this type of change?
+
 **Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.

 **Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.
@@ -264,12 +273,12 @@ Payment service experienced 47-minute outage due to database connection exhausti

 ## Systemic Improvements

-| Root Cause | Improvement | Type |
-|------------|-------------|------|
+| Root Cause    | Improvement                       | Type       |
+| ------------- | --------------------------------- | ---------- |
 | Missing tests | Add infrastructure behavior tests | Prevention |
-| Missing docs | Document connection patterns | Prevention |
-| Review gaps | Update review checklist | Detection |
-| No canary | Implement canary deployments | Mitigation |
+| Missing docs  | Document connection patterns      | Prevention |
+| Review gaps   | Update review checklist           | Detection  |
+| No canary     | Implement canary deployments      | Mitigation |
 ```

 ### Template 3: Quick Postmortem (Minor Incidents)
@@ -280,9 +289,11 @@ Payment service experienced 47-minute outage due to database connection exhausti
 **Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3

 ## What Happened
+
 API latency spiked to 5s due to cache miss storm after cache flush.

 ## Timeline
+
 - 10:00 - Cache flush initiated for config update
 - 10:02 - Latency alerts fire
 - 10:05 - Identified as cache miss storm
@@ -290,13 +301,16 @@ API latency spiked to 5s due to cache miss storm after cache flush.
 - 10:12 - Latency normalized

 ## Root Cause
+
 Full cache flush for minor config update caused thundering herd.

 ## Fix
+
 - Immediate: Enabled cache warming
 - Long-term: Implement partial cache invalidation (ENG-999)

 ## Lessons
+
 Don't full-flush cache in production; use targeted invalidation.
 ```

@@ -308,32 +322,38 @@ Don't full-flush cache in production; use targeted invalidation.
 ## Meeting Structure (60 minutes)

 ### 1. Opening (5 min)
+
 - Remind everyone of blameless culture
 - "We're here to learn, not to blame"
 - Review meeting norms

 ### 2. Timeline Review (15 min)
+
 - Walk through events chronologically
 - Ask clarifying questions
 - Identify gaps in timeline

 ### 3. Analysis Discussion (20 min)
+
 - What failed?
 - Why did it fail?
 - What conditions allowed this?
 - What would have prevented it?

 ### 4. Action Items (15 min)
+
 - Brainstorm improvements
 - Prioritize by impact and effort
 - Assign owners and due dates

 ### 5. Closing (5 min)
+
 - Summarize key learnings
 - Confirm action item owners
 - Schedule follow-up if needed

 ## Facilitation Tips
+
 - Keep discussion on track
 - Redirect blame to systems
 - Encourage quiet participants
@@ -343,17 +363,18 @@ Don't full-flush cache in production; use targeted invalidation.

 ## Anti-Patterns to Avoid

-| Anti-Pattern | Problem | Better Approach |
-|--------------|---------|-----------------|
-| **Blame game** | Shuts down learning | Focus on systems |
-| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
-| **No action items** | Waste of time | Always have concrete next steps |
-| **Unrealistic actions** | Never completed | Scope to achievable tasks |
-| **No follow-up** | Actions forgotten | Track in ticketing system |
+| Anti-Pattern            | Problem                    | Better Approach                 |
+| ----------------------- | -------------------------- | ------------------------------- |
+| **Blame game**          | Shuts down learning        | Focus on systems                |
+| **Shallow analysis**    | Doesn't prevent recurrence | Ask "why" 5 times               |
+| **No action items**     | Waste of time              | Always have concrete next steps |
+| **Unrealistic actions** | Never completed            | Scope to achievable tasks       |
+| **No follow-up**        | Actions forgotten          | Track in ticketing system       |

 ## Best Practices

 ### Do's
+
 - **Start immediately** - Memory fades fast
 - **Be specific** - Exact times, exact errors
 - **Include graphs** - Visual evidence
@@ -361,6 +382,7 @@ Don't full-flush cache in production; use targeted invalidation.
 - **Share widely** - Organizational learning

 ### Don'ts
+
 - **Don't name and shame** - Ever
 - **Don't skip small incidents** - They reveal patterns
 - **Don't make it a blame doc** - That kills learning