feat: add 5 new specialized agents with 20 skills

Add domain expert agents with comprehensive skill sets: - service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability - event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas - vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search - monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces - threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements Update all documentation to reflect correct counts: - 67 plugins, 99 agents, 107 skills, 71 commands
2026-03-18 09:37:15 +00:00 · 2025-12-16 16:00:58 -05:00
parent c7ad381360
commit 01d93fc227
58 changed files with 24830 additions and 50 deletions
--- a/plugins/incident-response/skills/on-call-handoff-patterns/SKILL.md
+++ b/plugins/incident-response/skills/on-call-handoff-patterns/SKILL.md
@@ -0,0 +1,441 @@
+---
+name: on-call-handoff-patterns
+description: Master on-call shift handoffs with context transfer, escalation procedures, and documentation. Use when transitioning on-call responsibilities, documenting shift summaries, or improving on-call processes.
+---
+
+# On-Call Handoff Patterns
+
+Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.
+
+## When to Use This Skill
+
+- Transitioning on-call responsibilities
+- Writing shift handoff summaries
+- Documenting ongoing investigations
+- Establishing on-call rotation procedures
+- Improving handoff quality
+- Onboarding new on-call engineers
+
+## Core Concepts
+
+### 1. Handoff Components
+
+| Component | Purpose |
+|-----------|---------|
+| **Active Incidents** | What's currently broken |
+| **Ongoing Investigations** | Issues being debugged |
+| **Recent Changes** | Deployments, configs |
+| **Known Issues** | Workarounds in place |
+| **Upcoming Events** | Maintenance, releases |
+
+### 2. Handoff Timing
+
+```
+Recommended: 30 min overlap between shifts
+
+Outgoing:
+├── 15 min: Write handoff document
+└── 15 min: Sync call with incoming
+
+Incoming:
+├── 15 min: Review handoff document
+├── 15 min: Sync call with outgoing
+└── 5 min: Verify alerting setup
+```
+
+## Templates
+
+### Template 1: Shift Handoff Document
+
+```markdown
+# On-Call Handoff: Platform Team
+
+**Outgoing**: @alice (2024-01-15 to 2024-01-22)
+**Incoming**: @bob (2024-01-22 to 2024-01-29)
+**Handoff Time**: 2024-01-22 09:00 UTC
+
+---
+
+## 🔴 Active Incidents
+
+### None currently active
+No active incidents at handoff time.
+
+---
+
+## 🟡 Ongoing Investigations
+
+### 1. Intermittent API Timeouts (ENG-1234)
+**Status**: Investigating
+**Started**: 2024-01-20
+**Impact**: ~0.1% of requests timing out
+
+**Context**:
+- Timeouts correlate with database backup window (02:00-03:00 UTC)
+- Suspect backup process causing lock contention
+- Added extra logging in PR #567 (deployed 01/21)
+
+**Next Steps**:
+- [ ] Review new logs after tonight's backup
+- [ ] Consider moving backup window if confirmed
+
+**Resources**:
+- Dashboard: [API Latency](https://grafana/d/api-latency)
+- Thread: #platform-eng (01/20, 14:32)
+
+---
+
+### 2. Memory Growth in Auth Service (ENG-1235)
+**Status**: Monitoring
+**Started**: 2024-01-18
+**Impact**: None yet (proactive)
+
+**Context**:
+- Memory usage growing ~5% per day
+- No memory leak found in profiling
+- Suspect connection pool not releasing properly
+
+**Next Steps**:
+- [ ] Review heap dump from 01/21
+- [ ] Consider restart if usage > 80%
+
+**Resources**:
+- Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)
+- Analysis doc: [Memory Investigation](https://docs/eng-1235)
+
+---
+
+## 🟢 Resolved This Shift
+
+### Payment Service Outage (2024-01-19)
+- **Duration**: 23 minutes
+- **Root Cause**: Database connection exhaustion
+- **Resolution**: Rolled back v2.3.4, increased pool size
+- **Postmortem**: [POSTMORTEM-89](https://docs/postmortem-89)
+- **Follow-up tickets**: ENG-1230, ENG-1231
+
+---
+
+## 📋 Recent Changes
+
+### Deployments
+| Service | Version | Time | Notes |
+|---------|---------|------|-------|
+| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |
+| user-service | v2.8.0 | 01/20 10:00 | New profile features |
+| auth-service | v4.1.2 | 01/19 16:00 | Security patch |
+
+### Configuration Changes
+- 01/21: Increased API rate limit from 1000 to 1500 RPS
+- 01/20: Updated database connection pool max from 50 to 75
+
+### Infrastructure
+- 01/20: Added 2 nodes to Kubernetes cluster
+- 01/19: Upgraded Redis from 6.2 to 7.0
+
+---
+
+## ⚠️ Known Issues & Workarounds
+
+### 1. Slow Dashboard Loading
+**Issue**: Grafana dashboards slow on Monday mornings
+**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up
+**Ticket**: OPS-456 (P3)
+
+### 2. Flaky Integration Test
+**Issue**: `test_payment_flow` fails intermittently in CI
+**Workaround**: Re-run failed job (usually passes on retry)
+**Ticket**: ENG-1200 (P2)
+
+---
+
+## 📅 Upcoming Events
+
+| Date | Event | Impact | Contact |
+|------|-------|--------|---------|
+| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |
+| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |
+| 01/25 | Marketing campaign | 2x traffic expected | @platform |
+
+---
+
+## 📞 Escalation Reminders
+
+| Issue Type | First Escalation | Second Escalation |
+|------------|------------------|-------------------|
+| Payment issues | @payments-oncall | @payments-manager |
+| Auth issues | @auth-oncall | @security-team |
+| Database issues | @dba-team | @infra-manager |
+| Unknown/severe | @engineering-manager | @vp-engineering |
+
+---
+
+## 🔧 Quick Reference
+
+### Common Commands
+```bash
+# Check service health
+kubectl get pods -A | grep -v Running
+
+# Recent deployments
+kubectl get events --sort-by='.lastTimestamp' | tail -20
+
+# Database connections
+psql -c "SELECT count(*) FROM pg_stat_activity;"
+
+# Clear cache (emergency only)
+redis-cli FLUSHDB
+```
+
+### Important Links
+- [Runbooks](https://wiki/runbooks)
+- [Service Catalog](https://wiki/services)
+- [Incident Slack](https://slack.com/incidents)
+- [PagerDuty](https://pagerduty.com/schedules)
+
+---
+
+## Handoff Checklist
+
+### Outgoing Engineer
+- [x] Document active incidents
+- [x] Document ongoing investigations
+- [x] List recent changes
+- [x] Note known issues
+- [x] Add upcoming events
+- [x] Sync with incoming engineer
+
+### Incoming Engineer
+- [ ] Read this document
+- [ ] Join sync call
+- [ ] Verify PagerDuty is routing to you
+- [ ] Verify Slack notifications working
+- [ ] Check VPN/access working
+- [ ] Review critical dashboards
+```
+
+### Template 2: Quick Handoff (Async)
+
+```markdown
+# Quick Handoff: @alice → @bob
+
+## TL;DR
+- No active incidents
+- 1 investigation ongoing (API timeouts, see ENG-1234)
+- Major release tomorrow (01/24) - be ready for issues
+
+## Watch List
+1. API latency around 02:00-03:00 UTC (backup window)
+2. Auth service memory (restart if > 80%)
+
+## Recent
+- Deployed api-gateway v3.2.1 yesterday (stable)
+- Increased rate limits to 1500 RPS
+
+## Coming Up
+- 01/23 02:00 - DB maintenance (5 min read-only)
+- 01/24 14:00 - v5.0 release
+
+## Questions?
+I'll be available on Slack until 17:00 today.
+```
+
+### Template 3: Incident Handoff (Mid-Incident)
+
+```markdown
+# INCIDENT HANDOFF: Payment Service Degradation
+
+**Incident Start**: 2024-01-22 08:15 UTC
+**Current Status**: Mitigating
+**Severity**: SEV2
+
+---
+
+## Current State
+- Error rate: 15% (down from 40%)
+- Mitigation in progress: scaling up pods
+- ETA to resolution: ~30 min
+
+## What We Know
+1. Root cause: Memory pressure on payment-service pods
+2. Triggered by: Unusual traffic spike (3x normal)
+3. Contributing: Inefficient query in checkout flow
+
+## What We've Done
+- Scaled payment-service from 5 → 15 pods
+- Enabled rate limiting on checkout endpoint
+- Disabled non-critical features
+
+## What Needs to Happen
+1. Monitor error rate - should reach <1% in ~15 min
+2. If not improving, escalate to @payments-manager
+3. Once stable, begin root cause investigation
+
+## Key People
+- Incident Commander: @alice (handing off)
+- Comms Lead: @charlie
+- Technical Lead: @bob (incoming)
+
+## Communication
+- Status page: Updated at 08:45
+- Customer support: Notified
+- Exec team: Aware
+
+## Resources
+- Incident channel: #inc-20240122-payment
+- Dashboard: [Payment Service](https://grafana/d/payments)
+- Runbook: [Payment Degradation](https://wiki/runbooks/payments)
+
+---
+
+**Incoming on-call (@bob) - Please confirm you have:**
+- [ ] Joined #inc-20240122-payment
+- [ ] Access to dashboards
+- [ ] Understand current state
+- [ ] Know escalation path
+```
+
+## Handoff Sync Meeting
+
+### Agenda (15 minutes)
+
+```markdown
+## Handoff Sync: @alice → @bob
+
+1. **Active Issues** (5 min)
+   - Walk through any ongoing incidents
+   - Discuss investigation status
+   - Transfer context and theories
+
+2. **Recent Changes** (3 min)
+   - Deployments to watch
+   - Config changes
+   - Known regressions
+
+3. **Upcoming Events** (3 min)
+   - Maintenance windows
+   - Expected traffic changes
+   - Releases planned
+
+4. **Questions** (4 min)
+   - Clarify anything unclear
+   - Confirm access and alerting
+   - Exchange contact info
+```
+
+## On-Call Best Practices
+
+### Before Your Shift
+
+```markdown
+## Pre-Shift Checklist
+
+### Access Verification
+- [ ] VPN working
+- [ ] kubectl access to all clusters
+- [ ] Database read access
+- [ ] Log aggregator access (Splunk/Datadog)
+- [ ] PagerDuty app installed and logged in
+
+### Alerting Setup
+- [ ] PagerDuty schedule shows you as primary
+- [ ] Phone notifications enabled
+- [ ] Slack notifications for incident channels
+- [ ] Test alert received and acknowledged
+
+### Knowledge Refresh
+- [ ] Review recent incidents (past 2 weeks)
+- [ ] Check service changelog
+- [ ] Skim critical runbooks
+- [ ] Know escalation contacts
+
+### Environment Ready
+- [ ] Laptop charged and accessible
+- [ ] Phone charged
+- [ ] Quiet space available for calls
+- [ ] Secondary contact identified (if traveling)
+```
+
+### During Your Shift
+
+```markdown
+## Daily On-Call Routine
+
+### Morning (start of day)
+- [ ] Check overnight alerts
+- [ ] Review dashboards for anomalies
+- [ ] Check for any P0/P1 tickets created
+- [ ] Skim incident channels for context
+
+### Throughout Day
+- [ ] Respond to alerts within SLA
+- [ ] Document investigation progress
+- [ ] Update team on significant issues
+- [ ] Triage incoming pages
+
+### End of Day
+- [ ] Hand off any active issues
+- [ ] Update investigation docs
+- [ ] Note anything for next shift
+```
+
+### After Your Shift
+
+```markdown
+## Post-Shift Checklist
+
+- [ ] Complete handoff document
+- [ ] Sync with incoming on-call
+- [ ] Verify PagerDuty routing changed
+- [ ] Close/update investigation tickets
+- [ ] File postmortems for any incidents
+- [ ] Take time off if shift was stressful
+```
+
+## Escalation Guidelines
+
+### When to Escalate
+
+```markdown
+## Escalation Triggers
+
+### Immediate Escalation
+- SEV1 incident declared
+- Data breach suspected
+- Unable to diagnose within 30 min
+- Customer or legal escalation received
+
+### Consider Escalation
+- Issue spans multiple teams
+- Requires expertise you don't have
+- Business impact exceeds threshold
+- You're uncertain about next steps
+
+### How to Escalate
+1. Page the appropriate escalation path
+2. Provide brief context in Slack
+3. Stay engaged until escalation acknowledges
+4. Hand off cleanly, don't just disappear
+```
+
+## Best Practices
+
+### Do's
+- **Document everything** - Future you will thank you
+- **Escalate early** - Better safe than sorry
+- **Take breaks** - Alert fatigue is real
+- **Keep handoffs synchronous** - Async loses context
+- **Test your setup** - Before incidents, not during
+
+### Don'ts
+- **Don't skip handoffs** - Context loss causes incidents
+- **Don't hero** - Escalate when needed
+- **Don't ignore alerts** - Even if they seem minor
+- **Don't work sick** - Swap shifts instead
+- **Don't disappear** - Stay reachable during shift
+
+## Resources
+
+- [Google SRE - Being On-Call](https://sre.google/sre-book/being-on-call/)
+- [PagerDuty On-Call Guide](https://www.pagerduty.com/resources/learn/on-call-management/)
+- [Increment On-Call Issue](https://increment.com/on-call/)