mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 09:37:15 +00:00
feat: add 5 new specialized agents with 20 skills
Add domain expert agents with comprehensive skill sets: - service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability - event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas - vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search - monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces - threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements Update all documentation to reflect correct counts: - 67 plugins, 99 agents, 107 skills, 71 commands
This commit is contained in:
@@ -0,0 +1,383 @@
|
||||
---
|
||||
name: incident-runbook-templates
|
||||
description: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.
|
||||
---
|
||||
|
||||
# Incident Runbook Templates
|
||||
|
||||
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
- Creating incident response procedures
|
||||
- Building service-specific runbooks
|
||||
- Establishing escalation paths
|
||||
- Documenting recovery procedures
|
||||
- Responding to active incidents
|
||||
- Onboarding on-call engineers
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Incident Severity Levels
|
||||
|
||||
| Severity | Impact | Response Time | Example |
|
||||
|----------|--------|---------------|---------|
|
||||
| **SEV1** | Complete outage, data loss | 15 min | Production down |
|
||||
| **SEV2** | Major degradation | 30 min | Critical feature broken |
|
||||
| **SEV3** | Minor impact | 2 hours | Non-critical bug |
|
||||
| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
|
||||
|
||||
### 2. Runbook Structure
|
||||
|
||||
```
|
||||
1. Overview & Impact
|
||||
2. Detection & Alerts
|
||||
3. Initial Triage
|
||||
4. Mitigation Steps
|
||||
5. Root Cause Investigation
|
||||
6. Resolution Procedures
|
||||
7. Verification & Rollback
|
||||
8. Communication Templates
|
||||
9. Escalation Matrix
|
||||
```
|
||||
|
||||
## Runbook Templates
|
||||
|
||||
### Template 1: Service Outage Runbook
|
||||
|
||||
```markdown
|
||||
# [Service Name] Outage Runbook
|
||||
|
||||
## Overview
|
||||
**Service**: Payment Processing Service
|
||||
**Owner**: Platform Team
|
||||
**Slack**: #payments-incidents
|
||||
**PagerDuty**: payments-oncall
|
||||
|
||||
## Impact Assessment
|
||||
- [ ] Which customers are affected?
|
||||
- [ ] What percentage of traffic is impacted?
|
||||
- [ ] Are there financial implications?
|
||||
- [ ] What's the blast radius?
|
||||
|
||||
## Detection
|
||||
### Alerts
|
||||
- `payment_error_rate > 5%` (PagerDuty)
|
||||
- `payment_latency_p99 > 2s` (Slack)
|
||||
- `payment_success_rate < 95%` (PagerDuty)
|
||||
|
||||
### Dashboards
|
||||
- [Payment Service Dashboard](https://grafana/d/payments)
|
||||
- [Error Tracking](https://sentry.io/payments)
|
||||
- [Dependency Status](https://status.stripe.com)
|
||||
|
||||
## Initial Triage (First 5 Minutes)
|
||||
|
||||
### 1. Assess Scope
|
||||
```bash
|
||||
# Check service health
|
||||
kubectl get pods -n payments -l app=payment-service
|
||||
|
||||
# Check recent deployments
|
||||
kubectl rollout history deployment/payment-service -n payments
|
||||
|
||||
# Check error rates
|
||||
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
|
||||
```
|
||||
|
||||
### 2. Quick Health Checks
|
||||
- [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`
|
||||
- [ ] Database connectivity? Check connection pool metrics
|
||||
- [ ] External dependencies? Check Stripe, bank API status
|
||||
- [ ] Recent changes? Check deploy history
|
||||
|
||||
### 3. Initial Classification
|
||||
| Symptom | Likely Cause | Go To Section |
|
||||
|---------|--------------|---------------|
|
||||
| All requests failing | Service down | Section 4.1 |
|
||||
| High latency | Database/dependency | Section 4.2 |
|
||||
| Partial failures | Code bug | Section 4.3 |
|
||||
| Spike in errors | Traffic surge | Section 4.4 |
|
||||
|
||||
## Mitigation Procedures
|
||||
|
||||
### 4.1 Service Completely Down
|
||||
```bash
|
||||
# Step 1: Check pod status
|
||||
kubectl get pods -n payments
|
||||
|
||||
# Step 2: If pods are crash-looping, check logs
|
||||
kubectl logs -n payments -l app=payment-service --tail=100
|
||||
|
||||
# Step 3: Check recent deployments
|
||||
kubectl rollout history deployment/payment-service -n payments
|
||||
|
||||
# Step 4: ROLLBACK if recent deploy is suspect
|
||||
kubectl rollout undo deployment/payment-service -n payments
|
||||
|
||||
# Step 5: Scale up if resource constrained
|
||||
kubectl scale deployment/payment-service -n payments --replicas=10
|
||||
|
||||
# Step 6: Verify recovery
|
||||
kubectl rollout status deployment/payment-service -n payments
|
||||
```
|
||||
|
||||
### 4.2 High Latency
|
||||
```bash
|
||||
# Step 1: Check database connections
|
||||
kubectl exec -n payments deploy/payment-service -- \
|
||||
curl localhost:8080/metrics | grep db_pool
|
||||
|
||||
# Step 2: Check slow queries (if DB issue)
|
||||
psql -h $DB_HOST -U $DB_USER -c "
|
||||
SELECT pid, now() - query_start AS duration, query
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active' AND duration > interval '5 seconds'
|
||||
ORDER BY duration DESC;"
|
||||
|
||||
# Step 3: Kill long-running queries if needed
|
||||
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
|
||||
|
||||
# Step 4: Check external dependency latency
|
||||
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
|
||||
|
||||
# Step 5: Enable circuit breaker if dependency is slow
|
||||
kubectl set env deployment/payment-service \
|
||||
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
|
||||
```
|
||||
|
||||
### 4.3 Partial Failures (Specific Errors)
|
||||
```bash
|
||||
# Step 1: Identify error pattern
|
||||
kubectl logs -n payments -l app=payment-service --tail=500 | \
|
||||
grep -i error | sort | uniq -c | sort -rn | head -20
|
||||
|
||||
# Step 2: Check error tracking
|
||||
# Go to Sentry: https://sentry.io/payments
|
||||
|
||||
# Step 3: If specific endpoint, enable feature flag to disable
|
||||
curl -X POST https://api.company.com/internal/feature-flags \
|
||||
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
|
||||
|
||||
# Step 4: If data issue, check recent data changes
|
||||
psql -h $DB_HOST -c "
|
||||
SELECT * FROM audit_log
|
||||
WHERE table_name = 'payment_methods'
|
||||
AND created_at > now() - interval '1 hour';"
|
||||
```
|
||||
|
||||
### 4.4 Traffic Surge
|
||||
```bash
|
||||
# Step 1: Check current request rate
|
||||
kubectl top pods -n payments
|
||||
|
||||
# Step 2: Scale horizontally
|
||||
kubectl scale deployment/payment-service -n payments --replicas=20
|
||||
|
||||
# Step 3: Enable rate limiting
|
||||
kubectl set env deployment/payment-service \
|
||||
RATE_LIMIT_ENABLED=true \
|
||||
RATE_LIMIT_RPS=1000 -n payments
|
||||
|
||||
# Step 4: If attack, block suspicious IPs
|
||||
kubectl apply -f - <<EOF
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: block-suspicious
|
||||
namespace: payments
|
||||
spec:
|
||||
podSelector:
|
||||
matchLabels:
|
||||
app: payment-service
|
||||
ingress:
|
||||
- from:
|
||||
- ipBlock:
|
||||
cidr: 0.0.0.0/0
|
||||
except:
|
||||
- 192.168.1.0/24 # Suspicious range
|
||||
EOF
|
||||
```
|
||||
|
||||
## Verification Steps
|
||||
```bash
|
||||
# Verify service is healthy
|
||||
curl -s https://api.company.com/payments/health | jq
|
||||
|
||||
# Verify error rate is back to normal
|
||||
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
|
||||
|
||||
# Verify latency is acceptable
|
||||
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
|
||||
|
||||
# Smoke test critical flows
|
||||
./scripts/smoke-test-payments.sh
|
||||
```
|
||||
|
||||
## Rollback Procedures
|
||||
```bash
|
||||
# Rollback Kubernetes deployment
|
||||
kubectl rollout undo deployment/payment-service -n payments
|
||||
|
||||
# Rollback database migration (if applicable)
|
||||
./scripts/db-rollback.sh $MIGRATION_VERSION
|
||||
|
||||
# Rollback feature flag
|
||||
curl -X POST https://api.company.com/internal/feature-flags \
|
||||
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
|
||||
```
|
||||
|
||||
## Escalation Matrix
|
||||
|
||||
| Condition | Escalate To | Contact |
|
||||
|-----------|-------------|---------|
|
||||
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
|
||||
| Data breach suspected | Security Team | #security-incidents |
|
||||
| Financial impact > $10k | Finance + Legal | @finance-oncall |
|
||||
| Customer communication needed | Support Lead | @support-lead |
|
||||
|
||||
## Communication Templates
|
||||
|
||||
### Initial Notification (Internal)
|
||||
```
|
||||
🚨 INCIDENT: Payment Service Degradation
|
||||
|
||||
Severity: SEV2
|
||||
Status: Investigating
|
||||
Impact: ~20% of payment requests failing
|
||||
Start Time: [TIME]
|
||||
Incident Commander: [NAME]
|
||||
|
||||
Current Actions:
|
||||
- Investigating root cause
|
||||
- Scaling up service
|
||||
- Monitoring dashboards
|
||||
|
||||
Updates in #payments-incidents
|
||||
```
|
||||
|
||||
### Status Update
|
||||
```
|
||||
📊 UPDATE: Payment Service Incident
|
||||
|
||||
Status: Mitigating
|
||||
Impact: Reduced to ~5% failure rate
|
||||
Duration: 25 minutes
|
||||
|
||||
Actions Taken:
|
||||
- Rolled back deployment v2.3.4 → v2.3.3
|
||||
- Scaled service from 5 → 10 replicas
|
||||
|
||||
Next Steps:
|
||||
- Continuing to monitor
|
||||
- Root cause analysis in progress
|
||||
|
||||
ETA to Resolution: ~15 minutes
|
||||
```
|
||||
|
||||
### Resolution Notification
|
||||
```
|
||||
✅ RESOLVED: Payment Service Incident
|
||||
|
||||
Duration: 45 minutes
|
||||
Impact: ~5,000 affected transactions
|
||||
Root Cause: Memory leak in v2.3.4
|
||||
|
||||
Resolution:
|
||||
- Rolled back to v2.3.3
|
||||
- Transactions auto-retried successfully
|
||||
|
||||
Follow-up:
|
||||
- Postmortem scheduled for [DATE]
|
||||
- Bug fix in progress
|
||||
```
|
||||
```
|
||||
|
||||
### Template 2: Database Incident Runbook
|
||||
|
||||
```markdown
|
||||
# Database Incident Runbook
|
||||
|
||||
## Quick Reference
|
||||
| Issue | Command |
|
||||
|-------|---------|
|
||||
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
|
||||
| Kill query | `SELECT pg_terminate_backend(pid);` |
|
||||
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
|
||||
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
|
||||
|
||||
## Connection Pool Exhaustion
|
||||
```sql
|
||||
-- Check current connections
|
||||
SELECT datname, usename, state, count(*)
|
||||
FROM pg_stat_activity
|
||||
GROUP BY datname, usename, state
|
||||
ORDER BY count(*) DESC;
|
||||
|
||||
-- Identify long-running connections
|
||||
SELECT pid, usename, datname, state, query_start, query
|
||||
FROM pg_stat_activity
|
||||
WHERE state != 'idle'
|
||||
ORDER BY query_start;
|
||||
|
||||
-- Terminate idle connections
|
||||
SELECT pg_terminate_backend(pid)
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'idle'
|
||||
AND query_start < now() - interval '10 minutes';
|
||||
```
|
||||
|
||||
## Replication Lag
|
||||
```sql
|
||||
-- Check lag on replica
|
||||
SELECT
|
||||
CASE
|
||||
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
|
||||
ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
|
||||
END AS lag_seconds;
|
||||
|
||||
-- If lag > 60s, consider:
|
||||
-- 1. Check network between primary/replica
|
||||
-- 2. Check replica disk I/O
|
||||
-- 3. Consider failover if unrecoverable
|
||||
```
|
||||
|
||||
## Disk Space Critical
|
||||
```bash
|
||||
# Check disk usage
|
||||
df -h /var/lib/postgresql/data
|
||||
|
||||
# Find large tables
|
||||
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
|
||||
FROM pg_catalog.pg_statio_user_tables
|
||||
ORDER BY pg_total_relation_size(relid) DESC
|
||||
LIMIT 10;"
|
||||
|
||||
# VACUUM to reclaim space
|
||||
psql -c "VACUUM FULL large_table;"
|
||||
|
||||
# If emergency, delete old data or expand disk
|
||||
```
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
- **Keep runbooks updated** - Review after every incident
|
||||
- **Test runbooks regularly** - Game days, chaos engineering
|
||||
- **Include rollback steps** - Always have an escape hatch
|
||||
- **Document assumptions** - What must be true for steps to work
|
||||
- **Link to dashboards** - Quick access during stress
|
||||
|
||||
### Don'ts
|
||||
- **Don't assume knowledge** - Write for 3 AM brain
|
||||
- **Don't skip verification** - Confirm each step worked
|
||||
- **Don't forget communication** - Keep stakeholders informed
|
||||
- **Don't work alone** - Escalate early
|
||||
- **Don't skip postmortems** - Learn from every incident
|
||||
|
||||
## Resources
|
||||
|
||||
- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
|
||||
- [PagerDuty Incident Response](https://response.pagerduty.com/)
|
||||
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)
|
||||
Reference in New Issue
Block a user