mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 09:37:15 +00:00
style: format all files with prettier
This commit is contained in:
@@ -20,12 +20,12 @@ Production-ready templates for incident response runbooks covering detection, tr
|
||||
|
||||
### 1. Incident Severity Levels
|
||||
|
||||
| Severity | Impact | Response Time | Example |
|
||||
|----------|--------|---------------|---------|
|
||||
| **SEV1** | Complete outage, data loss | 15 min | Production down |
|
||||
| **SEV2** | Major degradation | 30 min | Critical feature broken |
|
||||
| **SEV3** | Minor impact | 2 hours | Non-critical bug |
|
||||
| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
|
||||
| Severity | Impact | Response Time | Example |
|
||||
| -------- | -------------------------- | ----------------- | ----------------------- |
|
||||
| **SEV1** | Complete outage, data loss | 15 min | Production down |
|
||||
| **SEV2** | Major degradation | 30 min | Critical feature broken |
|
||||
| **SEV3** | Minor impact | 2 hours | Non-critical bug |
|
||||
| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
|
||||
|
||||
### 2. Runbook Structure
|
||||
|
||||
@@ -45,28 +45,33 @@ Production-ready templates for incident response runbooks covering detection, tr
|
||||
|
||||
### Template 1: Service Outage Runbook
|
||||
|
||||
```markdown
|
||||
````markdown
|
||||
# [Service Name] Outage Runbook
|
||||
|
||||
## Overview
|
||||
|
||||
**Service**: Payment Processing Service
|
||||
**Owner**: Platform Team
|
||||
**Slack**: #payments-incidents
|
||||
**PagerDuty**: payments-oncall
|
||||
|
||||
## Impact Assessment
|
||||
|
||||
- [ ] Which customers are affected?
|
||||
- [ ] What percentage of traffic is impacted?
|
||||
- [ ] Are there financial implications?
|
||||
- [ ] What's the blast radius?
|
||||
|
||||
## Detection
|
||||
|
||||
### Alerts
|
||||
|
||||
- `payment_error_rate > 5%` (PagerDuty)
|
||||
- `payment_latency_p99 > 2s` (Slack)
|
||||
- `payment_success_rate < 95%` (PagerDuty)
|
||||
|
||||
### Dashboards
|
||||
|
||||
- [Payment Service Dashboard](https://grafana/d/payments)
|
||||
- [Error Tracking](https://sentry.io/payments)
|
||||
- [Dependency Status](https://status.stripe.com)
|
||||
@@ -74,6 +79,7 @@ Production-ready templates for incident response runbooks covering detection, tr
|
||||
## Initial Triage (First 5 Minutes)
|
||||
|
||||
### 1. Assess Scope
|
||||
|
||||
```bash
|
||||
# Check service health
|
||||
kubectl get pods -n payments -l app=payment-service
|
||||
@@ -84,24 +90,28 @@ kubectl rollout history deployment/payment-service -n payments
|
||||
# Check error rates
|
||||
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
|
||||
```
|
||||
````
|
||||
|
||||
### 2. Quick Health Checks
|
||||
|
||||
- [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`
|
||||
- [ ] Database connectivity? Check connection pool metrics
|
||||
- [ ] External dependencies? Check Stripe, bank API status
|
||||
- [ ] Recent changes? Check deploy history
|
||||
|
||||
### 3. Initial Classification
|
||||
| Symptom | Likely Cause | Go To Section |
|
||||
|---------|--------------|---------------|
|
||||
| All requests failing | Service down | Section 4.1 |
|
||||
| High latency | Database/dependency | Section 4.2 |
|
||||
| Partial failures | Code bug | Section 4.3 |
|
||||
| Spike in errors | Traffic surge | Section 4.4 |
|
||||
|
||||
| Symptom | Likely Cause | Go To Section |
|
||||
| -------------------- | ------------------- | ------------- |
|
||||
| All requests failing | Service down | Section 4.1 |
|
||||
| High latency | Database/dependency | Section 4.2 |
|
||||
| Partial failures | Code bug | Section 4.3 |
|
||||
| Spike in errors | Traffic surge | Section 4.4 |
|
||||
|
||||
## Mitigation Procedures
|
||||
|
||||
### 4.1 Service Completely Down
|
||||
|
||||
```bash
|
||||
# Step 1: Check pod status
|
||||
kubectl get pods -n payments
|
||||
@@ -123,6 +133,7 @@ kubectl rollout status deployment/payment-service -n payments
|
||||
```
|
||||
|
||||
### 4.2 High Latency
|
||||
|
||||
```bash
|
||||
# Step 1: Check database connections
|
||||
kubectl exec -n payments deploy/payment-service -- \
|
||||
@@ -147,6 +158,7 @@ kubectl set env deployment/payment-service \
|
||||
```
|
||||
|
||||
### 4.3 Partial Failures (Specific Errors)
|
||||
|
||||
```bash
|
||||
# Step 1: Identify error pattern
|
||||
kubectl logs -n payments -l app=payment-service --tail=500 | \
|
||||
@@ -167,6 +179,7 @@ psql -h $DB_HOST -c "
|
||||
```
|
||||
|
||||
### 4.4 Traffic Surge
|
||||
|
||||
```bash
|
||||
# Step 1: Check current request rate
|
||||
kubectl top pods -n payments
|
||||
@@ -200,6 +213,7 @@ EOF
|
||||
```
|
||||
|
||||
## Verification Steps
|
||||
|
||||
```bash
|
||||
# Verify service is healthy
|
||||
curl -s https://api.company.com/payments/health | jq
|
||||
@@ -215,6 +229,7 @@ curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(r
|
||||
```
|
||||
|
||||
## Rollback Procedures
|
||||
|
||||
```bash
|
||||
# Rollback Kubernetes deployment
|
||||
kubectl rollout undo deployment/payment-service -n payments
|
||||
@@ -229,16 +244,17 @@ curl -X POST https://api.company.com/internal/feature-flags \
|
||||
|
||||
## Escalation Matrix
|
||||
|
||||
| Condition | Escalate To | Contact |
|
||||
|-----------|-------------|---------|
|
||||
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
|
||||
| Data breach suspected | Security Team | #security-incidents |
|
||||
| Financial impact > $10k | Finance + Legal | @finance-oncall |
|
||||
| Customer communication needed | Support Lead | @support-lead |
|
||||
| Condition | Escalate To | Contact |
|
||||
| ----------------------------- | ------------------- | ------------------- |
|
||||
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
|
||||
| Data breach suspected | Security Team | #security-incidents |
|
||||
| Financial impact > $10k | Finance + Legal | @finance-oncall |
|
||||
| Customer communication needed | Support Lead | @support-lead |
|
||||
|
||||
## Communication Templates
|
||||
|
||||
### Initial Notification (Internal)
|
||||
|
||||
```
|
||||
🚨 INCIDENT: Payment Service Degradation
|
||||
|
||||
@@ -257,6 +273,7 @@ Updates in #payments-incidents
|
||||
```
|
||||
|
||||
### Status Update
|
||||
|
||||
```
|
||||
📊 UPDATE: Payment Service Incident
|
||||
|
||||
@@ -276,6 +293,7 @@ ETA to Resolution: ~15 minutes
|
||||
```
|
||||
|
||||
### Resolution Notification
|
||||
|
||||
```
|
||||
✅ RESOLVED: Payment Service Incident
|
||||
|
||||
@@ -291,7 +309,8 @@ Follow-up:
|
||||
- Postmortem scheduled for [DATE]
|
||||
- Bug fix in progress
|
||||
```
|
||||
```
|
||||
|
||||
````
|
||||
|
||||
### Template 2: Database Incident Runbook
|
||||
|
||||
@@ -325,9 +344,10 @@ SELECT pg_terminate_backend(pid)
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'idle'
|
||||
AND query_start < now() - interval '10 minutes';
|
||||
```
|
||||
````
|
||||
|
||||
## Replication Lag
|
||||
|
||||
```sql
|
||||
-- Check lag on replica
|
||||
SELECT
|
||||
@@ -343,6 +363,7 @@ SELECT
|
||||
```
|
||||
|
||||
## Disk Space Critical
|
||||
|
||||
```bash
|
||||
# Check disk usage
|
||||
df -h /var/lib/postgresql/data
|
||||
@@ -358,6 +379,7 @@ psql -c "VACUUM FULL large_table;"
|
||||
|
||||
# If emergency, delete old data or expand disk
|
||||
```
|
||||
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
@@ -381,3 +403,4 @@ psql -c "VACUUM FULL large_table;"
|
||||
- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
|
||||
- [PagerDuty Incident Response](https://response.pagerduty.com/)
|
||||
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user