style: format all files with prettier

2026-03-18 09:37:15 +00:00 · 2026-01-19 17:07:03 -05:00
parent 8d37048deb
commit 56848874a2
355 changed files with 15215 additions and 10241 deletions
--- a/plugins/incident-response/skills/postmortem-writing/SKILL.md
+++ b/plugins/incident-response/skills/postmortem-writing/SKILL.md
@@ -20,13 +20,13 @@ Comprehensive guide to writing effective, blameless postmortems that drive organ

 ### 1. Blameless Culture

-| Blame-Focused | Blameless |
-|---------------|-----------|
-| "Who caused this?" | "What conditions allowed this?" |
+| Blame-Focused            | Blameless                         |
+| ------------------------ | --------------------------------- |
+| "Who caused this?"       | "What conditions allowed this?"   |
 | "Someone made a mistake" | "The system allowed this mistake" |
-| Punish individuals | Improve systems |
-| Hide information | Share learnings |
-| Fear of speaking up | Psychological safety |
+| Punish individuals       | Improve systems                   |
+| Hide information         | Share learnings                   |
+| Fear of speaking up      | Psychological safety              |

 ### 2. Postmortem Triggers

@@ -40,6 +40,7 @@ Comprehensive guide to writing effective, blameless postmortems that drive organ
 ## Quick Start

 ### Postmortem Timeline
+
 ```
 Day 0: Incident occurs
 Day 1-2: Draft postmortem document
@@ -67,6 +68,7 @@ Quarterly: Review patterns across incidents
 On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.

 **Impact**:
+
 - 12,000 customers unable to complete purchases
 - Estimated revenue loss: $45,000
 - 847 support tickets created
@@ -74,18 +76,18 @@ On January 15, 2024, the payment processing service experienced a 47-minute outa

 ## Timeline (All times UTC)

-| Time | Event |
-|------|-------|
-| 14:23 | Deployment v2.3.4 completed to production |
-| 14:31 | First alert: `payment_error_rate > 5%` |
-| 14:33 | On-call engineer @alice acknowledges alert |
+| Time  | Event                                           |
+| ----- | ----------------------------------------------- |
+| 14:23 | Deployment v2.3.4 completed to production       |
+| 14:31 | First alert: `payment_error_rate > 5%`          |
+| 14:33 | On-call engineer @alice acknowledges alert      |
 | 14:35 | Initial investigation begins, error rate at 23% |
-| 14:41 | Incident declared SEV2, @bob joins |
-| 14:45 | Database connection exhaustion identified |
-| 14:52 | Decision to rollback deployment |
-| 14:58 | Rollback to v2.3.3 initiated |
-| 15:10 | Rollback complete, error rate dropping |
-| 15:18 | Service fully recovered, incident resolved |
+| 14:41 | Incident declared SEV2, @bob joins              |
+| 14:45 | Database connection exhaustion identified       |
+| 14:52 | Decision to rollback deployment                 |
+| 14:58 | Rollback to v2.3.3 initiated                    |
+| 15:10 | Rollback complete, error rate dropping          |
+| 15:18 | Service fully recovered, incident resolved      |

 ## Root Cause Analysis

@@ -111,13 +113,14 @@ The v2.3.4 deployment included a change to the database query pattern that inadv
   - Why was developer unfamiliar? → No documentation on connection management patterns

 ### System Diagram
-
 ```
+
 [Client] → [Load Balancer] → [Payment Service] → [Database]
-                                    ↓
-                            Connection Pool (broken)
-                                    ↓
-                            Direct connections (cause)
+↓
+Connection Pool (broken)
+↓
+Direct connections (cause)
+
 ```

 ## Detection
@@ -219,11 +222,13 @@ The deployment completed at 14:23, but the first alert didn't fire until 14:31 (
 # 5 Whys Analysis: [Incident]

 ## Problem Statement
+
 Payment service experienced 47-minute outage due to database connection exhaustion.

 ## Analysis

 ### Why #1: Why did the service fail?
+
 **Answer**: Database connections were exhausted, causing all new requests to fail.

 **Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
@@ -231,6 +236,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
 ---

 ### Why #2: Why were database connections exhausted?
+
 **Answer**: Each incoming request opened a new database connection instead of using the connection pool.

 **Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.
@@ -238,6 +244,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
 ---

 ### Why #3: Why did the code bypass the connection pool?
+
 **Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.

 **Evidence**: PR #1234 shows the change, made while fixing a different bug.
@@ -245,6 +252,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
 ---

 ### Why #4: Why wasn't this caught in code review?
+
 **Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.

 **Evidence**: Review comments only discuss business logic.
@@ -252,6 +260,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
 ---

 ### Why #5: Why isn't there a safety net for this type of change?
+
 **Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.

 **Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.
@@ -264,12 +273,12 @@ Payment service experienced 47-minute outage due to database connection exhausti

 ## Systemic Improvements

-| Root Cause | Improvement | Type |
-|------------|-------------|------|
+| Root Cause    | Improvement                       | Type       |
+| ------------- | --------------------------------- | ---------- |
 | Missing tests | Add infrastructure behavior tests | Prevention |
-| Missing docs | Document connection patterns | Prevention |
-| Review gaps | Update review checklist | Detection |
-| No canary | Implement canary deployments | Mitigation |
+| Missing docs  | Document connection patterns      | Prevention |
+| Review gaps   | Update review checklist           | Detection  |
+| No canary     | Implement canary deployments      | Mitigation |
 ```

 ### Template 3: Quick Postmortem (Minor Incidents)
@@ -280,9 +289,11 @@ Payment service experienced 47-minute outage due to database connection exhausti
 **Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3

 ## What Happened
+
 API latency spiked to 5s due to cache miss storm after cache flush.

 ## Timeline
+
 - 10:00 - Cache flush initiated for config update
 - 10:02 - Latency alerts fire
 - 10:05 - Identified as cache miss storm
@@ -290,13 +301,16 @@ API latency spiked to 5s due to cache miss storm after cache flush.
 - 10:12 - Latency normalized

 ## Root Cause
+
 Full cache flush for minor config update caused thundering herd.

 ## Fix
+
 - Immediate: Enabled cache warming
 - Long-term: Implement partial cache invalidation (ENG-999)

 ## Lessons
+
 Don't full-flush cache in production; use targeted invalidation.
 ```

@@ -308,32 +322,38 @@ Don't full-flush cache in production; use targeted invalidation.
 ## Meeting Structure (60 minutes)

 ### 1. Opening (5 min)
+
 - Remind everyone of blameless culture
 - "We're here to learn, not to blame"
 - Review meeting norms

 ### 2. Timeline Review (15 min)
+
 - Walk through events chronologically
 - Ask clarifying questions
 - Identify gaps in timeline

 ### 3. Analysis Discussion (20 min)
+
 - What failed?
 - Why did it fail?
 - What conditions allowed this?
 - What would have prevented it?

 ### 4. Action Items (15 min)
+
 - Brainstorm improvements
 - Prioritize by impact and effort
 - Assign owners and due dates

 ### 5. Closing (5 min)
+
 - Summarize key learnings
 - Confirm action item owners
 - Schedule follow-up if needed

 ## Facilitation Tips
+
 - Keep discussion on track
 - Redirect blame to systems
 - Encourage quiet participants
@@ -343,17 +363,18 @@ Don't full-flush cache in production; use targeted invalidation.

 ## Anti-Patterns to Avoid

-| Anti-Pattern | Problem | Better Approach |
-|--------------|---------|-----------------|
-| **Blame game** | Shuts down learning | Focus on systems |
-| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
-| **No action items** | Waste of time | Always have concrete next steps |
-| **Unrealistic actions** | Never completed | Scope to achievable tasks |
-| **No follow-up** | Actions forgotten | Track in ticketing system |
+| Anti-Pattern            | Problem                    | Better Approach                 |
+| ----------------------- | -------------------------- | ------------------------------- |
+| **Blame game**          | Shuts down learning        | Focus on systems                |
+| **Shallow analysis**    | Doesn't prevent recurrence | Ask "why" 5 times               |
+| **No action items**     | Waste of time              | Always have concrete next steps |
+| **Unrealistic actions** | Never completed            | Scope to achievable tasks       |
+| **No follow-up**        | Actions forgotten          | Track in ticketing system       |

 ## Best Practices

 ### Do's
+
 - **Start immediately** - Memory fades fast
 - **Be specific** - Exact times, exact errors
 - **Include graphs** - Visual evidence
@@ -361,6 +382,7 @@ Don't full-flush cache in production; use targeted invalidation.
 - **Share widely** - Organizational learning

 ### Don'ts
+
 - **Don't name and shame** - Ever
 - **Don't skip small incidents** - They reveal patterns
 - **Don't make it a blame doc** - That kills learning