feat: add 5 new specialized agents with 20 skills

Add domain expert agents with comprehensive skill sets: - service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability - event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas - vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search - monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces - threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements Update all documentation to reflect correct counts: - 67 plugins, 99 agents, 107 skills, 71 commands
2026-03-18 09:37:15 +00:00 · 2025-12-16 16:00:58 -05:00
parent c7ad381360
commit 01d93fc227
58 changed files with 24830 additions and 50 deletions
--- a/plugins/incident-response/skills/incident-runbook-templates/SKILL.md
+++ b/plugins/incident-response/skills/incident-runbook-templates/SKILL.md
@@ -0,0 +1,383 @@
+---
+name: incident-runbook-templates
+description: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.
+---
+
+# Incident Runbook Templates
+
+Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
+
+## When to Use This Skill
+
+- Creating incident response procedures
+- Building service-specific runbooks
+- Establishing escalation paths
+- Documenting recovery procedures
+- Responding to active incidents
+- Onboarding on-call engineers
+
+## Core Concepts
+
+### 1. Incident Severity Levels
+
+| Severity | Impact | Response Time | Example |
+|----------|--------|---------------|---------|
+| **SEV1** | Complete outage, data loss | 15 min | Production down |
+| **SEV2** | Major degradation | 30 min | Critical feature broken |
+| **SEV3** | Minor impact | 2 hours | Non-critical bug |
+| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
+
+### 2. Runbook Structure
+
+```
+1. Overview & Impact
+2. Detection & Alerts
+3. Initial Triage
+4. Mitigation Steps
+5. Root Cause Investigation
+6. Resolution Procedures
+7. Verification & Rollback
+8. Communication Templates
+9. Escalation Matrix
+```
+
+## Runbook Templates
+
+### Template 1: Service Outage Runbook
+
+```markdown
+# [Service Name] Outage Runbook
+
+## Overview
+**Service**: Payment Processing Service
+**Owner**: Platform Team
+**Slack**: #payments-incidents
+**PagerDuty**: payments-oncall
+
+## Impact Assessment
+- [ ] Which customers are affected?
+- [ ] What percentage of traffic is impacted?
+- [ ] Are there financial implications?
+- [ ] What's the blast radius?
+
+## Detection
+### Alerts
+- `payment_error_rate > 5%` (PagerDuty)
+- `payment_latency_p99 > 2s` (Slack)
+- `payment_success_rate < 95%` (PagerDuty)
+
+### Dashboards
+- [Payment Service Dashboard](https://grafana/d/payments)
+- [Error Tracking](https://sentry.io/payments)
+- [Dependency Status](https://status.stripe.com)
+
+## Initial Triage (First 5 Minutes)
+
+### 1. Assess Scope
+```bash
+# Check service health
+kubectl get pods -n payments -l app=payment-service
+
+# Check recent deployments
+kubectl rollout history deployment/payment-service -n payments
+
+# Check error rates
+curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
+```
+
+### 2. Quick Health Checks
+- [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`
+- [ ] Database connectivity? Check connection pool metrics
+- [ ] External dependencies? Check Stripe, bank API status
+- [ ] Recent changes? Check deploy history
+
+### 3. Initial Classification
+| Symptom | Likely Cause | Go To Section |
+|---------|--------------|---------------|
+| All requests failing | Service down | Section 4.1 |
+| High latency | Database/dependency | Section 4.2 |
+| Partial failures | Code bug | Section 4.3 |
+| Spike in errors | Traffic surge | Section 4.4 |
+
+## Mitigation Procedures
+
+### 4.1 Service Completely Down
+```bash
+# Step 1: Check pod status
+kubectl get pods -n payments
+
+# Step 2: If pods are crash-looping, check logs
+kubectl logs -n payments -l app=payment-service --tail=100
+
+# Step 3: Check recent deployments
+kubectl rollout history deployment/payment-service -n payments
+
+# Step 4: ROLLBACK if recent deploy is suspect
+kubectl rollout undo deployment/payment-service -n payments
+
+# Step 5: Scale up if resource constrained
+kubectl scale deployment/payment-service -n payments --replicas=10
+
+# Step 6: Verify recovery
+kubectl rollout status deployment/payment-service -n payments
+```
+
+### 4.2 High Latency
+```bash
+# Step 1: Check database connections
+kubectl exec -n payments deploy/payment-service -- \
+  curl localhost:8080/metrics | grep db_pool
+
+# Step 2: Check slow queries (if DB issue)
+psql -h $DB_HOST -U $DB_USER -c "
+  SELECT pid, now() - query_start AS duration, query
+  FROM pg_stat_activity
+  WHERE state = 'active' AND duration > interval '5 seconds'
+  ORDER BY duration DESC;"
+
+# Step 3: Kill long-running queries if needed
+psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
+
+# Step 4: Check external dependency latency
+curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
+
+# Step 5: Enable circuit breaker if dependency is slow
+kubectl set env deployment/payment-service \
+  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
+```
+
+### 4.3 Partial Failures (Specific Errors)
+```bash
+# Step 1: Identify error pattern
+kubectl logs -n payments -l app=payment-service --tail=500 | \
+  grep -i error | sort | uniq -c | sort -rn | head -20
+
+# Step 2: Check error tracking
+# Go to Sentry: https://sentry.io/payments
+
+# Step 3: If specific endpoint, enable feature flag to disable
+curl -X POST https://api.company.com/internal/feature-flags \
+  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
+
+# Step 4: If data issue, check recent data changes
+psql -h $DB_HOST -c "
+  SELECT * FROM audit_log
+  WHERE table_name = 'payment_methods'
+  AND created_at > now() - interval '1 hour';"
+```
+
+### 4.4 Traffic Surge
+```bash
+# Step 1: Check current request rate
+kubectl top pods -n payments
+
+# Step 2: Scale horizontally
+kubectl scale deployment/payment-service -n payments --replicas=20
+
+# Step 3: Enable rate limiting
+kubectl set env deployment/payment-service \
+  RATE_LIMIT_ENABLED=true \
+  RATE_LIMIT_RPS=1000 -n payments
+
+# Step 4: If attack, block suspicious IPs
+kubectl apply -f - <<EOF
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: block-suspicious
+  namespace: payments
+spec:
+  podSelector:
+    matchLabels:
+      app: payment-service
+  ingress:
+  - from:
+    - ipBlock:
+        cidr: 0.0.0.0/0
+        except:
+        - 192.168.1.0/24  # Suspicious range
+EOF
+```
+
+## Verification Steps
+```bash
+# Verify service is healthy
+curl -s https://api.company.com/payments/health | jq
+
+# Verify error rate is back to normal
+curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
+
+# Verify latency is acceptable
+curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
+
+# Smoke test critical flows
+./scripts/smoke-test-payments.sh
+```
+
+## Rollback Procedures
+```bash
+# Rollback Kubernetes deployment
+kubectl rollout undo deployment/payment-service -n payments
+
+# Rollback database migration (if applicable)
+./scripts/db-rollback.sh $MIGRATION_VERSION
+
+# Rollback feature flag
+curl -X POST https://api.company.com/internal/feature-flags \
+  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
+```
+
+## Escalation Matrix
+
+| Condition | Escalate To | Contact |
+|-----------|-------------|---------|
+| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
+| Data breach suspected | Security Team | #security-incidents |
+| Financial impact > $10k | Finance + Legal | @finance-oncall |
+| Customer communication needed | Support Lead | @support-lead |
+
+## Communication Templates
+
+### Initial Notification (Internal)
+```
+🚨 INCIDENT: Payment Service Degradation
+
+Severity: SEV2
+Status: Investigating
+Impact: ~20% of payment requests failing
+Start Time: [TIME]
+Incident Commander: [NAME]
+
+Current Actions:
+- Investigating root cause
+- Scaling up service
+- Monitoring dashboards
+
+Updates in #payments-incidents
+```
+
+### Status Update
+```
+📊 UPDATE: Payment Service Incident
+
+Status: Mitigating
+Impact: Reduced to ~5% failure rate
+Duration: 25 minutes
+
+Actions Taken:
+- Rolled back deployment v2.3.4 → v2.3.3
+- Scaled service from 5 → 10 replicas
+
+Next Steps:
+- Continuing to monitor
+- Root cause analysis in progress
+
+ETA to Resolution: ~15 minutes
+```
+
+### Resolution Notification
+```
+✅ RESOLVED: Payment Service Incident
+
+Duration: 45 minutes
+Impact: ~5,000 affected transactions
+Root Cause: Memory leak in v2.3.4
+
+Resolution:
+- Rolled back to v2.3.3
+- Transactions auto-retried successfully
+
+Follow-up:
+- Postmortem scheduled for [DATE]
+- Bug fix in progress
+```
+```
+
+### Template 2: Database Incident Runbook
+
+```markdown
+# Database Incident Runbook
+
+## Quick Reference
+| Issue | Command |
+|-------|---------|
+| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
+| Kill query | `SELECT pg_terminate_backend(pid);` |
+| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
+| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
+
+## Connection Pool Exhaustion
+```sql
+-- Check current connections
+SELECT datname, usename, state, count(*)
+FROM pg_stat_activity
+GROUP BY datname, usename, state
+ORDER BY count(*) DESC;
+
+-- Identify long-running connections
+SELECT pid, usename, datname, state, query_start, query
+FROM pg_stat_activity
+WHERE state != 'idle'
+ORDER BY query_start;
+
+-- Terminate idle connections
+SELECT pg_terminate_backend(pid)
+FROM pg_stat_activity
+WHERE state = 'idle'
+AND query_start < now() - interval '10 minutes';
+```
+
+## Replication Lag
+```sql
+-- Check lag on replica
+SELECT
+  CASE
+    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
+    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
+  END AS lag_seconds;
+
+-- If lag > 60s, consider:
+-- 1. Check network between primary/replica
+-- 2. Check replica disk I/O
+-- 3. Consider failover if unrecoverable
+```
+
+## Disk Space Critical
+```bash
+# Check disk usage
+df -h /var/lib/postgresql/data
+
+# Find large tables
+psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
+FROM pg_catalog.pg_statio_user_tables
+ORDER BY pg_total_relation_size(relid) DESC
+LIMIT 10;"
+
+# VACUUM to reclaim space
+psql -c "VACUUM FULL large_table;"
+
+# If emergency, delete old data or expand disk
+```
+```
+
+## Best Practices
+
+### Do's
+- **Keep runbooks updated** - Review after every incident
+- **Test runbooks regularly** - Game days, chaos engineering
+- **Include rollback steps** - Always have an escape hatch
+- **Document assumptions** - What must be true for steps to work
+- **Link to dashboards** - Quick access during stress
+
+### Don'ts
+- **Don't assume knowledge** - Write for 3 AM brain
+- **Don't skip verification** - Confirm each step worked
+- **Don't forget communication** - Keep stakeholders informed
+- **Don't work alone** - Escalate early
+- **Don't skip postmortems** - Learn from every incident
+
+## Resources
+
+- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
+- [PagerDuty Incident Response](https://response.pagerduty.com/)
+- [Atlassian Incident Management](https://www.atlassian.com/incident-management)