style: format all files with prettier

This commit is contained in:
Seth Hobson
2026-01-19 17:07:03 -05:00
parent 8d37048deb
commit 56848874a2
355 changed files with 15215 additions and 10241 deletions

View File

@@ -20,12 +20,12 @@ Production-ready templates for incident response runbooks covering detection, tr
### 1. Incident Severity Levels
| Severity | Impact | Response Time | Example |
|----------|--------|---------------|---------|
| **SEV1** | Complete outage, data loss | 15 min | Production down |
| **SEV2** | Major degradation | 30 min | Critical feature broken |
| **SEV3** | Minor impact | 2 hours | Non-critical bug |
| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
| Severity | Impact | Response Time | Example |
| -------- | -------------------------- | ----------------- | ----------------------- |
| **SEV1** | Complete outage, data loss | 15 min | Production down |
| **SEV2** | Major degradation | 30 min | Critical feature broken |
| **SEV3** | Minor impact | 2 hours | Non-critical bug |
| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
### 2. Runbook Structure
@@ -45,28 +45,33 @@ Production-ready templates for incident response runbooks covering detection, tr
### Template 1: Service Outage Runbook
```markdown
````markdown
# [Service Name] Outage Runbook
## Overview
**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall
## Impact Assessment
- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?
## Detection
### Alerts
- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate < 95%` (PagerDuty)
### Dashboards
- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)
@@ -74,6 +79,7 @@ Production-ready templates for incident response runbooks covering detection, tr
## Initial Triage (First 5 Minutes)
### 1. Assess Scope
```bash
# Check service health
kubectl get pods -n payments -l app=payment-service
@@ -84,24 +90,28 @@ kubectl rollout history deployment/payment-service -n payments
# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
```
````
### 2. Quick Health Checks
- [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`
- [ ] Database connectivity? Check connection pool metrics
- [ ] External dependencies? Check Stripe, bank API status
- [ ] Recent changes? Check deploy history
### 3. Initial Classification
| Symptom | Likely Cause | Go To Section |
|---------|--------------|---------------|
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |
| Symptom | Likely Cause | Go To Section |
| -------------------- | ------------------- | ------------- |
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |
## Mitigation Procedures
### 4.1 Service Completely Down
```bash
# Step 1: Check pod status
kubectl get pods -n payments
@@ -123,6 +133,7 @@ kubectl rollout status deployment/payment-service -n payments
```
### 4.2 High Latency
```bash
# Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
@@ -147,6 +158,7 @@ kubectl set env deployment/payment-service \
```
### 4.3 Partial Failures (Specific Errors)
```bash
# Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
@@ -167,6 +179,7 @@ psql -h $DB_HOST -c "
```
### 4.4 Traffic Surge
```bash
# Step 1: Check current request rate
kubectl top pods -n payments
@@ -200,6 +213,7 @@ EOF
```
## Verification Steps
```bash
# Verify service is healthy
curl -s https://api.company.com/payments/health | jq
@@ -215,6 +229,7 @@ curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(r
```
## Rollback Procedures
```bash
# Rollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments
@@ -229,16 +244,17 @@ curl -X POST https://api.company.com/internal/feature-flags \
## Escalation Matrix
| Condition | Escalate To | Contact |
|-----------|-------------|---------|
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
| Data breach suspected | Security Team | #security-incidents |
| Financial impact > $10k | Finance + Legal | @finance-oncall |
| Customer communication needed | Support Lead | @support-lead |
| Condition | Escalate To | Contact |
| ----------------------------- | ------------------- | ------------------- |
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
| Data breach suspected | Security Team | #security-incidents |
| Financial impact > $10k | Finance + Legal | @finance-oncall |
| Customer communication needed | Support Lead | @support-lead |
## Communication Templates
### Initial Notification (Internal)
```
🚨 INCIDENT: Payment Service Degradation
@@ -257,6 +273,7 @@ Updates in #payments-incidents
```
### Status Update
```
📊 UPDATE: Payment Service Incident
@@ -276,6 +293,7 @@ ETA to Resolution: ~15 minutes
```
### Resolution Notification
```
✅ RESOLVED: Payment Service Incident
@@ -291,7 +309,8 @@ Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress
```
```
````
### Template 2: Database Incident Runbook
@@ -325,9 +344,10 @@ SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
```
````
## Replication Lag
```sql
-- Check lag on replica
SELECT
@@ -343,6 +363,7 @@ SELECT
```
## Disk Space Critical
```bash
# Check disk usage
df -h /var/lib/postgresql/data
@@ -358,6 +379,7 @@ psql -c "VACUUM FULL large_table;"
# If emergency, delete old data or expand disk
```
```
## Best Practices
@@ -381,3 +403,4 @@ psql -c "VACUUM FULL large_table;"
- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)
```