mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 17:47:16 +00:00
Remove references to non-existent resource files (references/, assets/, scripts/, examples/) from 115 skill SKILL.md files. These sections pointed to directories and files that were never created, causing confusion when users install skills. Also fix broken Code of Conduct links in issue templates to use absolute GitHub URLs instead of relative paths that 404.
400 lines
10 KiB
Markdown
400 lines
10 KiB
Markdown
---
|
|
name: incident-runbook-templates
|
|
description: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.
|
|
---
|
|
|
|
# Incident Runbook Templates
|
|
|
|
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
|
|
|
|
## When to Use This Skill
|
|
|
|
- Creating incident response procedures
|
|
- Building service-specific runbooks
|
|
- Establishing escalation paths
|
|
- Documenting recovery procedures
|
|
- Responding to active incidents
|
|
- Onboarding on-call engineers
|
|
|
|
## Core Concepts
|
|
|
|
### 1. Incident Severity Levels
|
|
|
|
| Severity | Impact | Response Time | Example |
|
|
| -------- | -------------------------- | ----------------- | ----------------------- |
|
|
| **SEV1** | Complete outage, data loss | 15 min | Production down |
|
|
| **SEV2** | Major degradation | 30 min | Critical feature broken |
|
|
| **SEV3** | Minor impact | 2 hours | Non-critical bug |
|
|
| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
|
|
|
|
### 2. Runbook Structure
|
|
|
|
```
|
|
1. Overview & Impact
|
|
2. Detection & Alerts
|
|
3. Initial Triage
|
|
4. Mitigation Steps
|
|
5. Root Cause Investigation
|
|
6. Resolution Procedures
|
|
7. Verification & Rollback
|
|
8. Communication Templates
|
|
9. Escalation Matrix
|
|
```
|
|
|
|
## Runbook Templates
|
|
|
|
### Template 1: Service Outage Runbook
|
|
|
|
````markdown
|
|
# [Service Name] Outage Runbook
|
|
|
|
## Overview
|
|
|
|
**Service**: Payment Processing Service
|
|
**Owner**: Platform Team
|
|
**Slack**: #payments-incidents
|
|
**PagerDuty**: payments-oncall
|
|
|
|
## Impact Assessment
|
|
|
|
- [ ] Which customers are affected?
|
|
- [ ] What percentage of traffic is impacted?
|
|
- [ ] Are there financial implications?
|
|
- [ ] What's the blast radius?
|
|
|
|
## Detection
|
|
|
|
### Alerts
|
|
|
|
- `payment_error_rate > 5%` (PagerDuty)
|
|
- `payment_latency_p99 > 2s` (Slack)
|
|
- `payment_success_rate < 95%` (PagerDuty)
|
|
|
|
### Dashboards
|
|
|
|
- [Payment Service Dashboard](https://grafana/d/payments)
|
|
- [Error Tracking](https://sentry.io/payments)
|
|
- [Dependency Status](https://status.stripe.com)
|
|
|
|
## Initial Triage (First 5 Minutes)
|
|
|
|
### 1. Assess Scope
|
|
|
|
```bash
|
|
# Check service health
|
|
kubectl get pods -n payments -l app=payment-service
|
|
|
|
# Check recent deployments
|
|
kubectl rollout history deployment/payment-service -n payments
|
|
|
|
# Check error rates
|
|
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
|
|
```
|
|
````
|
|
|
|
### 2. Quick Health Checks
|
|
|
|
- [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`
|
|
- [ ] Database connectivity? Check connection pool metrics
|
|
- [ ] External dependencies? Check Stripe, bank API status
|
|
- [ ] Recent changes? Check deploy history
|
|
|
|
### 3. Initial Classification
|
|
|
|
| Symptom | Likely Cause | Go To Section |
|
|
| -------------------- | ------------------- | ------------- |
|
|
| All requests failing | Service down | Section 4.1 |
|
|
| High latency | Database/dependency | Section 4.2 |
|
|
| Partial failures | Code bug | Section 4.3 |
|
|
| Spike in errors | Traffic surge | Section 4.4 |
|
|
|
|
## Mitigation Procedures
|
|
|
|
### 4.1 Service Completely Down
|
|
|
|
```bash
|
|
# Step 1: Check pod status
|
|
kubectl get pods -n payments
|
|
|
|
# Step 2: If pods are crash-looping, check logs
|
|
kubectl logs -n payments -l app=payment-service --tail=100
|
|
|
|
# Step 3: Check recent deployments
|
|
kubectl rollout history deployment/payment-service -n payments
|
|
|
|
# Step 4: ROLLBACK if recent deploy is suspect
|
|
kubectl rollout undo deployment/payment-service -n payments
|
|
|
|
# Step 5: Scale up if resource constrained
|
|
kubectl scale deployment/payment-service -n payments --replicas=10
|
|
|
|
# Step 6: Verify recovery
|
|
kubectl rollout status deployment/payment-service -n payments
|
|
```
|
|
|
|
### 4.2 High Latency
|
|
|
|
```bash
|
|
# Step 1: Check database connections
|
|
kubectl exec -n payments deploy/payment-service -- \
|
|
curl localhost:8080/metrics | grep db_pool
|
|
|
|
# Step 2: Check slow queries (if DB issue)
|
|
psql -h $DB_HOST -U $DB_USER -c "
|
|
SELECT pid, now() - query_start AS duration, query
|
|
FROM pg_stat_activity
|
|
WHERE state = 'active' AND duration > interval '5 seconds'
|
|
ORDER BY duration DESC;"
|
|
|
|
# Step 3: Kill long-running queries if needed
|
|
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
|
|
|
|
# Step 4: Check external dependency latency
|
|
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
|
|
|
|
# Step 5: Enable circuit breaker if dependency is slow
|
|
kubectl set env deployment/payment-service \
|
|
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
|
|
```
|
|
|
|
### 4.3 Partial Failures (Specific Errors)
|
|
|
|
```bash
|
|
# Step 1: Identify error pattern
|
|
kubectl logs -n payments -l app=payment-service --tail=500 | \
|
|
grep -i error | sort | uniq -c | sort -rn | head -20
|
|
|
|
# Step 2: Check error tracking
|
|
# Go to Sentry: https://sentry.io/payments
|
|
|
|
# Step 3: If specific endpoint, enable feature flag to disable
|
|
curl -X POST https://api.company.com/internal/feature-flags \
|
|
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
|
|
|
|
# Step 4: If data issue, check recent data changes
|
|
psql -h $DB_HOST -c "
|
|
SELECT * FROM audit_log
|
|
WHERE table_name = 'payment_methods'
|
|
AND created_at > now() - interval '1 hour';"
|
|
```
|
|
|
|
### 4.4 Traffic Surge
|
|
|
|
```bash
|
|
# Step 1: Check current request rate
|
|
kubectl top pods -n payments
|
|
|
|
# Step 2: Scale horizontally
|
|
kubectl scale deployment/payment-service -n payments --replicas=20
|
|
|
|
# Step 3: Enable rate limiting
|
|
kubectl set env deployment/payment-service \
|
|
RATE_LIMIT_ENABLED=true \
|
|
RATE_LIMIT_RPS=1000 -n payments
|
|
|
|
# Step 4: If attack, block suspicious IPs
|
|
kubectl apply -f - <<EOF
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: NetworkPolicy
|
|
metadata:
|
|
name: block-suspicious
|
|
namespace: payments
|
|
spec:
|
|
podSelector:
|
|
matchLabels:
|
|
app: payment-service
|
|
ingress:
|
|
- from:
|
|
- ipBlock:
|
|
cidr: 0.0.0.0/0
|
|
except:
|
|
- 192.168.1.0/24 # Suspicious range
|
|
EOF
|
|
```
|
|
|
|
## Verification Steps
|
|
|
|
```bash
|
|
# Verify service is healthy
|
|
curl -s https://api.company.com/payments/health | jq
|
|
|
|
# Verify error rate is back to normal
|
|
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
|
|
|
|
# Verify latency is acceptable
|
|
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
|
|
|
|
# Smoke test critical flows
|
|
./scripts/smoke-test-payments.sh
|
|
```
|
|
|
|
## Rollback Procedures
|
|
|
|
```bash
|
|
# Rollback Kubernetes deployment
|
|
kubectl rollout undo deployment/payment-service -n payments
|
|
|
|
# Rollback database migration (if applicable)
|
|
./scripts/db-rollback.sh $MIGRATION_VERSION
|
|
|
|
# Rollback feature flag
|
|
curl -X POST https://api.company.com/internal/feature-flags \
|
|
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
|
|
```
|
|
|
|
## Escalation Matrix
|
|
|
|
| Condition | Escalate To | Contact |
|
|
| ----------------------------- | ------------------- | ------------------- |
|
|
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
|
|
| Data breach suspected | Security Team | #security-incidents |
|
|
| Financial impact > $10k | Finance + Legal | @finance-oncall |
|
|
| Customer communication needed | Support Lead | @support-lead |
|
|
|
|
## Communication Templates
|
|
|
|
### Initial Notification (Internal)
|
|
|
|
```
|
|
🚨 INCIDENT: Payment Service Degradation
|
|
|
|
Severity: SEV2
|
|
Status: Investigating
|
|
Impact: ~20% of payment requests failing
|
|
Start Time: [TIME]
|
|
Incident Commander: [NAME]
|
|
|
|
Current Actions:
|
|
- Investigating root cause
|
|
- Scaling up service
|
|
- Monitoring dashboards
|
|
|
|
Updates in #payments-incidents
|
|
```
|
|
|
|
### Status Update
|
|
|
|
```
|
|
📊 UPDATE: Payment Service Incident
|
|
|
|
Status: Mitigating
|
|
Impact: Reduced to ~5% failure rate
|
|
Duration: 25 minutes
|
|
|
|
Actions Taken:
|
|
- Rolled back deployment v2.3.4 → v2.3.3
|
|
- Scaled service from 5 → 10 replicas
|
|
|
|
Next Steps:
|
|
- Continuing to monitor
|
|
- Root cause analysis in progress
|
|
|
|
ETA to Resolution: ~15 minutes
|
|
```
|
|
|
|
### Resolution Notification
|
|
|
|
```
|
|
✅ RESOLVED: Payment Service Incident
|
|
|
|
Duration: 45 minutes
|
|
Impact: ~5,000 affected transactions
|
|
Root Cause: Memory leak in v2.3.4
|
|
|
|
Resolution:
|
|
- Rolled back to v2.3.3
|
|
- Transactions auto-retried successfully
|
|
|
|
Follow-up:
|
|
- Postmortem scheduled for [DATE]
|
|
- Bug fix in progress
|
|
```
|
|
|
|
````
|
|
|
|
### Template 2: Database Incident Runbook
|
|
|
|
```markdown
|
|
# Database Incident Runbook
|
|
|
|
## Quick Reference
|
|
| Issue | Command |
|
|
|-------|---------|
|
|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
|
|
| Kill query | `SELECT pg_terminate_backend(pid);` |
|
|
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
|
|
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
|
|
|
|
## Connection Pool Exhaustion
|
|
```sql
|
|
-- Check current connections
|
|
SELECT datname, usename, state, count(*)
|
|
FROM pg_stat_activity
|
|
GROUP BY datname, usename, state
|
|
ORDER BY count(*) DESC;
|
|
|
|
-- Identify long-running connections
|
|
SELECT pid, usename, datname, state, query_start, query
|
|
FROM pg_stat_activity
|
|
WHERE state != 'idle'
|
|
ORDER BY query_start;
|
|
|
|
-- Terminate idle connections
|
|
SELECT pg_terminate_backend(pid)
|
|
FROM pg_stat_activity
|
|
WHERE state = 'idle'
|
|
AND query_start < now() - interval '10 minutes';
|
|
````
|
|
|
|
## Replication Lag
|
|
|
|
```sql
|
|
-- Check lag on replica
|
|
SELECT
|
|
CASE
|
|
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
|
|
ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
|
|
END AS lag_seconds;
|
|
|
|
-- If lag > 60s, consider:
|
|
-- 1. Check network between primary/replica
|
|
-- 2. Check replica disk I/O
|
|
-- 3. Consider failover if unrecoverable
|
|
```
|
|
|
|
## Disk Space Critical
|
|
|
|
```bash
|
|
# Check disk usage
|
|
df -h /var/lib/postgresql/data
|
|
|
|
# Find large tables
|
|
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
|
|
FROM pg_catalog.pg_statio_user_tables
|
|
ORDER BY pg_total_relation_size(relid) DESC
|
|
LIMIT 10;"
|
|
|
|
# VACUUM to reclaim space
|
|
psql -c "VACUUM FULL large_table;"
|
|
|
|
# If emergency, delete old data or expand disk
|
|
```
|
|
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Do's
|
|
- **Keep runbooks updated** - Review after every incident
|
|
- **Test runbooks regularly** - Game days, chaos engineering
|
|
- **Include rollback steps** - Always have an escape hatch
|
|
- **Document assumptions** - What must be true for steps to work
|
|
- **Link to dashboards** - Quick access during stress
|
|
|
|
### Don'ts
|
|
- **Don't assume knowledge** - Write for 3 AM brain
|
|
- **Don't skip verification** - Confirm each step worked
|
|
- **Don't forget communication** - Keep stakeholders informed
|
|
- **Don't work alone** - Escalate early
|
|
- **Don't skip postmortems** - Learn from every incident
|