mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 09:37:15 +00:00
Rewrites 14 commands across 11 plugins to remove all cross-plugin subagent_type references (e.g., "unit-testing::test-automator"), which break when plugins are installed standalone. Each command now uses only local bundled agents or general-purpose with role context in the prompt. All rewritten commands follow conductor-style patterns: - CRITICAL BEHAVIORAL RULES with strong directives - State files for session tracking and resume support - Phase checkpoints requiring explicit user approval - File-based context passing between steps Also fixes 4 plugin.json files missing version/license fields and adds plugin.json for dotnet-contribution. Closes #433
602 lines
18 KiB
Markdown
602 lines
18 KiB
Markdown
---
|
|
description: "Orchestrate multi-agent incident response with modern SRE practices for rapid resolution and learning"
|
|
argument-hint: "<incident description> [--severity P0|P1|P2|P3]"
|
|
---
|
|
|
|
# Incident Response Orchestrator
|
|
|
|
## CRITICAL BEHAVIORAL RULES
|
|
|
|
You MUST follow these rules exactly. Violating any of them is a failure.
|
|
|
|
1. **Execute steps in order.** Do NOT skip ahead, reorder, or merge steps.
|
|
2. **Write output files.** Each step MUST produce its output file in `.incident-response/` before the next step begins. Read from prior step files — do NOT rely on context window memory.
|
|
3. **Stop at checkpoints.** When you reach a `PHASE CHECKPOINT`, you MUST stop and wait for explicit user approval before continuing. Use the AskUserQuestion tool with clear options.
|
|
4. **Halt on failure.** If any step fails (agent error, test failure, missing dependency), STOP immediately. Present the error and ask the user how to proceed. Do NOT silently continue.
|
|
5. **Use only local agents.** All `subagent_type` references use agents bundled with this plugin or `general-purpose`. No cross-plugin dependencies.
|
|
6. **Never enter plan mode autonomously.** Do NOT use EnterPlanMode. This command IS the plan — execute it.
|
|
|
|
## Pre-flight Checks
|
|
|
|
Before starting, perform these checks:
|
|
|
|
### 1. Check for existing session
|
|
|
|
Check if `.incident-response/state.json` exists:
|
|
|
|
- If it exists and `status` is `"in_progress"`: Read it, display the current step, and ask the user:
|
|
|
|
```
|
|
Found an in-progress incident response session:
|
|
Incident: [incident from state]
|
|
Severity: [severity from state]
|
|
Current step: [step from state]
|
|
|
|
1. Resume from where we left off
|
|
2. Start fresh (archives existing session)
|
|
```
|
|
|
|
- If it exists and `status` is `"complete"`: Ask whether to archive and start fresh.
|
|
|
|
### 2. Initialize state
|
|
|
|
Create `.incident-response/` directory and `state.json`:
|
|
|
|
```json
|
|
{
|
|
"incident": "$ARGUMENTS",
|
|
"status": "in_progress",
|
|
"severity": "P1",
|
|
"current_step": 1,
|
|
"current_phase": 1,
|
|
"completed_steps": [],
|
|
"files_created": [],
|
|
"started_at": "ISO_TIMESTAMP",
|
|
"last_updated": "ISO_TIMESTAMP"
|
|
}
|
|
```
|
|
|
|
Parse `$ARGUMENTS` for `--severity` flag. Default to P1 if not specified.
|
|
|
|
### 3. Parse incident description
|
|
|
|
Extract the incident description from `$ARGUMENTS` (everything before the flags). This is referenced as `$INCIDENT` in prompts below.
|
|
|
|
---
|
|
|
|
## Phase 1: Detection & Triage (Steps 1-3)
|
|
|
|
### Step 1: Incident Detection and Classification
|
|
|
|
Use the Task tool to launch the incident responder agent:
|
|
|
|
```
|
|
Task:
|
|
subagent_type: "incident-responder"
|
|
description: "URGENT: Classify incident: $INCIDENT"
|
|
prompt: |
|
|
URGENT: Detect and classify incident: $INCIDENT
|
|
|
|
Determine:
|
|
1. Incident severity (P0-P3) based on impact assessment
|
|
2. Affected services and their dependencies
|
|
3. User impact and business risk
|
|
4. Initial incident command structure needed
|
|
5. SLO violation status and error budget impact
|
|
|
|
Check: error budgets, recent deployments, configuration changes, and monitoring alerts.
|
|
|
|
Provide structured output with: SEVERITY, AFFECTED_SERVICES, USER_IMPACT,
|
|
BUSINESS_RISK, INCIDENT_COMMAND, SLO_STATUS.
|
|
```
|
|
|
|
Save output to `.incident-response/01-classification.md`.
|
|
|
|
Update `state.json`: set `current_step` to 2, update severity from classification, add step 1 to `completed_steps`.
|
|
|
|
### Step 2: Observability Analysis
|
|
|
|
Read `.incident-response/01-classification.md`.
|
|
|
|
```
|
|
Task:
|
|
subagent_type: "general-purpose"
|
|
description: "Observability sweep for incident: $INCIDENT"
|
|
prompt: |
|
|
You are an observability engineer. Perform rapid observability sweep for this incident.
|
|
|
|
Context: [Insert contents of .incident-response/01-classification.md]
|
|
|
|
Query and analyze:
|
|
1. Distributed tracing (OpenTelemetry/Jaeger) for request flow
|
|
2. Metrics correlation (Prometheus/Grafana/DataDog) for anomalies
|
|
3. Log aggregation (ELK/Splunk) for error patterns
|
|
4. APM data for performance degradation points
|
|
5. Real User Monitoring for user experience impact
|
|
|
|
Identify anomalies, error patterns, and service degradation points.
|
|
|
|
Provide structured output with: TRACE_ANALYSIS, METRICS_ANOMALIES, LOG_PATTERNS,
|
|
APM_FINDINGS, RUM_IMPACT, SERVICE_HEALTH_MATRIX.
|
|
```
|
|
|
|
Save output to `.incident-response/02-observability.md`.
|
|
|
|
Update `state.json`: set `current_step` to 3, add step 2 to `completed_steps`.
|
|
|
|
### Step 3: Initial Mitigation
|
|
|
|
Read `.incident-response/01-classification.md` and `.incident-response/02-observability.md`.
|
|
|
|
```
|
|
Task:
|
|
subagent_type: "incident-responder"
|
|
description: "Immediate mitigation for: $INCIDENT"
|
|
prompt: |
|
|
Implement immediate mitigation for this incident.
|
|
|
|
Classification: [Insert contents of .incident-response/01-classification.md]
|
|
Observability: [Insert contents of .incident-response/02-observability.md]
|
|
|
|
Actions to evaluate and implement:
|
|
1. Traffic throttling/rerouting if needed
|
|
2. Feature flag disabling for affected features
|
|
3. Circuit breaker activation
|
|
4. Rollback assessment for recent deployments
|
|
5. Scale resources if capacity-related
|
|
|
|
Prioritize user experience restoration.
|
|
|
|
Provide structured output with: MITIGATION_ACTIONS, TEMPORARY_FIXES,
|
|
ROLLBACK_DECISIONS, SERVICE_STATUS_AFTER, USER_IMPACT_REDUCTION.
|
|
```
|
|
|
|
Save output to `.incident-response/03-mitigation.md`.
|
|
|
|
Update `state.json`: set `current_step` to "checkpoint-1", add step 3 to `completed_steps`.
|
|
|
|
---
|
|
|
|
## PHASE CHECKPOINT 1 — User Approval Required
|
|
|
|
You MUST stop here and present the triage results.
|
|
|
|
Display a summary from `.incident-response/01-classification.md` and `.incident-response/03-mitigation.md` and ask:
|
|
|
|
```
|
|
Triage and initial mitigation complete.
|
|
|
|
Severity: [from classification]
|
|
Affected services: [from classification]
|
|
Mitigation status: [from mitigation]
|
|
User impact reduction: [from mitigation]
|
|
|
|
1. Approve — proceed to investigation and root cause analysis
|
|
2. Request changes — adjust mitigation or severity
|
|
3. Pause — save progress and stop here (mitigation in place)
|
|
```
|
|
|
|
Do NOT proceed to Phase 2 until the user approves.
|
|
|
|
---
|
|
|
|
## Phase 2: Investigation & Root Cause (Steps 4-6)
|
|
|
|
### Step 4: Deep System Debugging
|
|
|
|
Read `.incident-response/02-observability.md` and `.incident-response/03-mitigation.md`.
|
|
|
|
```
|
|
Task:
|
|
subagent_type: "debugger"
|
|
description: "Deep debugging for: $INCIDENT"
|
|
prompt: |
|
|
Conduct deep debugging for this incident using observability data.
|
|
|
|
Observability: [Insert contents of .incident-response/02-observability.md]
|
|
Mitigation: [Insert contents of .incident-response/03-mitigation.md]
|
|
|
|
Investigate:
|
|
1. Stack traces and error logs
|
|
2. Database query performance and locks
|
|
3. Network latency and timeouts
|
|
4. Memory leaks and CPU spikes
|
|
5. Dependency failures and cascading errors
|
|
|
|
Apply Five Whys analysis to identify root cause.
|
|
|
|
Provide structured output with: ROOT_CAUSE, CONTRIBUTING_FACTORS,
|
|
DEPENDENCY_IMPACT_MAP, FIVE_WHYS_ANALYSIS.
|
|
```
|
|
|
|
Save output to `.incident-response/04-debugging.md`.
|
|
|
|
Update `state.json`: set `current_step` to 5, add step 4 to `completed_steps`.
|
|
|
|
### Step 5: Security Assessment
|
|
|
|
Read `.incident-response/04-debugging.md`.
|
|
|
|
```
|
|
Task:
|
|
subagent_type: "general-purpose"
|
|
description: "Security assessment for: $INCIDENT"
|
|
prompt: |
|
|
You are a security auditor. Assess security implications of this incident.
|
|
|
|
Debug findings: [Insert contents of .incident-response/04-debugging.md]
|
|
|
|
Check:
|
|
1. DDoS attack indicators
|
|
2. Authentication/authorization failures
|
|
3. Data exposure risks
|
|
4. Certificate issues
|
|
5. Suspicious access patterns
|
|
|
|
Review WAF logs, security groups, and audit trails.
|
|
|
|
Provide structured output with: SECURITY_ASSESSMENT, BREACH_ANALYSIS,
|
|
VULNERABILITY_IDENTIFICATION, DATA_EXPOSURE_RISK, REMEDIATION_STEPS.
|
|
```
|
|
|
|
### Step 6: Performance Analysis
|
|
|
|
Read `.incident-response/04-debugging.md`.
|
|
|
|
Launch in parallel with Step 5:
|
|
|
|
```
|
|
Task:
|
|
subagent_type: "general-purpose"
|
|
description: "Performance analysis for: $INCIDENT"
|
|
prompt: |
|
|
You are a performance engineer. Analyze performance aspects of this incident.
|
|
|
|
Debug findings: [Insert contents of .incident-response/04-debugging.md]
|
|
|
|
Examine:
|
|
1. Resource utilization patterns
|
|
2. Query optimization opportunities
|
|
3. Caching effectiveness
|
|
4. Load balancer health
|
|
5. CDN performance
|
|
6. Autoscaling triggers
|
|
|
|
Identify bottlenecks and capacity issues.
|
|
|
|
Provide structured output with: PERFORMANCE_BOTTLENECKS, RESOURCE_RECOMMENDATIONS,
|
|
OPTIMIZATION_OPPORTUNITIES, CAPACITY_ISSUES.
|
|
```
|
|
|
|
After both complete, consolidate into `.incident-response/05-investigation.md`:
|
|
|
|
```markdown
|
|
# Investigation: $INCIDENT
|
|
|
|
## Root Cause (from debugging)
|
|
|
|
[From Step 4]
|
|
|
|
## Security Assessment
|
|
|
|
[From Step 5]
|
|
|
|
## Performance Analysis
|
|
|
|
[From Step 6]
|
|
|
|
## Combined Findings
|
|
|
|
[Synthesis of all investigation results]
|
|
```
|
|
|
|
Update `state.json`: set `current_step` to "checkpoint-2", add steps 4-6 to `completed_steps`.
|
|
|
|
---
|
|
|
|
## PHASE CHECKPOINT 2 — User Approval Required
|
|
|
|
Display investigation results from `.incident-response/05-investigation.md` and ask:
|
|
|
|
```
|
|
Investigation complete. Please review .incident-response/05-investigation.md
|
|
|
|
Root cause: [brief summary]
|
|
Security concerns: [summary]
|
|
Performance issues: [summary]
|
|
|
|
1. Approve — proceed to fix implementation and deployment
|
|
2. Request changes — investigate further
|
|
3. Pause — save progress and stop here
|
|
```
|
|
|
|
Do NOT proceed to Phase 3 until the user approves.
|
|
|
|
---
|
|
|
|
## Phase 3: Resolution & Recovery (Steps 7-8)
|
|
|
|
### Step 7: Fix Implementation
|
|
|
|
Read `.incident-response/05-investigation.md`.
|
|
|
|
```
|
|
Task:
|
|
subagent_type: "general-purpose"
|
|
description: "Implement production fix for: $INCIDENT"
|
|
prompt: |
|
|
You are a senior backend architect. Design and implement a production fix for this incident.
|
|
|
|
Investigation: [Insert contents of .incident-response/05-investigation.md]
|
|
|
|
Requirements:
|
|
1. Minimal viable fix for rapid deployment
|
|
2. Risk assessment and rollback capability
|
|
3. Staged rollout plan with monitoring
|
|
4. Validation criteria and health checks
|
|
5. Consider both immediate fix and long-term solution
|
|
|
|
Provide structured output with: FIX_IMPLEMENTATION, DEPLOYMENT_STRATEGY,
|
|
VALIDATION_PLAN, ROLLBACK_PROCEDURES, LONG_TERM_SOLUTION.
|
|
```
|
|
|
|
Save output to `.incident-response/06-fix.md`.
|
|
|
|
Update `state.json`: set `current_step` to 8, add step 7 to `completed_steps`.
|
|
|
|
### Step 8: Deployment and Validation
|
|
|
|
Read `.incident-response/06-fix.md`.
|
|
|
|
```
|
|
Task:
|
|
subagent_type: "devops-troubleshooter"
|
|
description: "Deploy and validate fix for: $INCIDENT"
|
|
prompt: |
|
|
Execute emergency deployment for incident fix.
|
|
|
|
Fix details: [Insert contents of .incident-response/06-fix.md]
|
|
|
|
Process:
|
|
1. Blue-green or canary deployment strategy
|
|
2. Progressive rollout with monitoring
|
|
3. Health check validation at each stage
|
|
4. Rollback triggers configured
|
|
5. Real-time monitoring during deployment
|
|
|
|
Provide structured output with: DEPLOYMENT_STATUS, VALIDATION_RESULTS,
|
|
MONITORING_DASHBOARD, ROLLBACK_READINESS, SERVICE_HEALTH_POST_DEPLOY.
|
|
```
|
|
|
|
Save output to `.incident-response/07-deployment.md`.
|
|
|
|
Update `state.json`: set `current_step` to "checkpoint-3", add step 8 to `completed_steps`.
|
|
|
|
---
|
|
|
|
## PHASE CHECKPOINT 3 — User Approval Required
|
|
|
|
Display deployment results from `.incident-response/07-deployment.md` and ask:
|
|
|
|
```
|
|
Fix deployed and validated.
|
|
|
|
Deployment status: [from deployment]
|
|
Service health: [from deployment]
|
|
Rollback ready: [yes/no]
|
|
|
|
1. Approve — proceed to communication and postmortem
|
|
2. Rollback — revert the deployment
|
|
3. Pause — save progress and monitor
|
|
```
|
|
|
|
Do NOT proceed to Phase 4 until the user approves.
|
|
|
|
---
|
|
|
|
## Phase 4: Communication & Coordination (Steps 9-10)
|
|
|
|
### Step 9: Stakeholder Communication
|
|
|
|
Read `.incident-response/01-classification.md`, `.incident-response/05-investigation.md`, and `.incident-response/07-deployment.md`.
|
|
|
|
```
|
|
Task:
|
|
subagent_type: "general-purpose"
|
|
description: "Manage incident communication for: $INCIDENT"
|
|
prompt: |
|
|
You are a communications specialist. Manage incident communication for this incident.
|
|
|
|
Classification: [Insert contents of .incident-response/01-classification.md]
|
|
Investigation: [Insert contents of .incident-response/05-investigation.md]
|
|
Deployment: [Insert contents of .incident-response/07-deployment.md]
|
|
|
|
Create:
|
|
1. Status page updates (public-facing)
|
|
2. Internal engineering updates (technical details)
|
|
3. Executive summary (business impact/ETA)
|
|
4. Customer support briefing (talking points)
|
|
5. Timeline documentation with key decisions
|
|
|
|
Provide structured output with: STATUS_PAGE_UPDATE, ENGINEERING_UPDATE,
|
|
EXECUTIVE_SUMMARY, SUPPORT_BRIEFING, INCIDENT_TIMELINE.
|
|
```
|
|
|
|
Save output to `.incident-response/08-communication.md`.
|
|
|
|
Update `state.json`: set `current_step` to 10, add step 9 to `completed_steps`.
|
|
|
|
### Step 10: Customer Impact Assessment
|
|
|
|
Read `.incident-response/01-classification.md` and `.incident-response/07-deployment.md`.
|
|
|
|
```
|
|
Task:
|
|
subagent_type: "incident-responder"
|
|
description: "Assess customer impact for: $INCIDENT"
|
|
prompt: |
|
|
Assess and document customer impact for this incident.
|
|
|
|
Classification: [Insert contents of .incident-response/01-classification.md]
|
|
Resolution: [Insert contents of .incident-response/07-deployment.md]
|
|
|
|
Analyze:
|
|
1. Affected user segments and geography
|
|
2. Failed transactions or data loss
|
|
3. SLA violations and contractual implications
|
|
4. Customer support ticket volume
|
|
5. Revenue impact estimation
|
|
6. Proactive customer outreach recommendations
|
|
|
|
Provide structured output with: CUSTOMER_IMPACT_REPORT, SLA_ANALYSIS,
|
|
REVENUE_IMPACT, OUTREACH_RECOMMENDATIONS.
|
|
```
|
|
|
|
Save output to `.incident-response/09-customer-impact.md`.
|
|
|
|
Update `state.json`: set `current_step` to 11, add step 10 to `completed_steps`.
|
|
|
|
---
|
|
|
|
## Phase 5: Postmortem & Prevention (Steps 11-13)
|
|
|
|
### Step 11: Blameless Postmortem
|
|
|
|
Read all `.incident-response/*.md` files.
|
|
|
|
```
|
|
Task:
|
|
subagent_type: "general-purpose"
|
|
description: "Blameless postmortem for: $INCIDENT"
|
|
prompt: |
|
|
You are an SRE documentation specialist. Conduct a blameless postmortem for this incident.
|
|
|
|
Context: [Insert contents of all .incident-response/*.md files]
|
|
|
|
Document:
|
|
1. Complete incident timeline with decisions
|
|
2. Root cause and contributing factors (systems focus, not people)
|
|
3. What went well in response
|
|
4. What could improve
|
|
5. Action items with owners and deadlines
|
|
6. Lessons learned for team education
|
|
|
|
Follow SRE postmortem best practices. Focus on systems, not blame.
|
|
|
|
Provide structured output with: INCIDENT_TIMELINE, ROOT_CAUSE_SUMMARY,
|
|
WHAT_WENT_WELL, IMPROVEMENTS, ACTION_ITEMS, LESSONS_LEARNED.
|
|
```
|
|
|
|
Save output to `.incident-response/10-postmortem.md`.
|
|
|
|
Update `state.json`: set `current_step` to 12, add step 11 to `completed_steps`.
|
|
|
|
### Step 12: Monitoring Enhancement
|
|
|
|
Read `.incident-response/05-investigation.md` and `.incident-response/10-postmortem.md`.
|
|
|
|
```
|
|
Task:
|
|
subagent_type: "general-purpose"
|
|
description: "Enhance monitoring for: $INCIDENT prevention"
|
|
prompt: |
|
|
You are an observability engineer. Enhance monitoring to prevent recurrence of this incident.
|
|
|
|
Investigation: [Insert contents of .incident-response/05-investigation.md]
|
|
Postmortem: [Insert contents of .incident-response/10-postmortem.md]
|
|
|
|
Implement:
|
|
1. New alerts for early detection
|
|
2. SLI/SLO adjustments if needed
|
|
3. Dashboard improvements for visibility
|
|
4. Runbook automation opportunities
|
|
5. Chaos engineering scenarios for testing
|
|
|
|
Ensure alerts are actionable and reduce noise.
|
|
|
|
Provide structured output with: NEW_ALERTS, SLO_ADJUSTMENTS, DASHBOARD_UPDATES,
|
|
RUNBOOK_AUTOMATION, CHAOS_SCENARIOS.
|
|
```
|
|
|
|
Save output to `.incident-response/11-monitoring.md`.
|
|
|
|
Update `state.json`: set `current_step` to 13, add step 12 to `completed_steps`.
|
|
|
|
### Step 13: System Hardening
|
|
|
|
Read `.incident-response/05-investigation.md` and `.incident-response/10-postmortem.md`.
|
|
|
|
```
|
|
Task:
|
|
subagent_type: "general-purpose"
|
|
description: "System hardening for: $INCIDENT prevention"
|
|
prompt: |
|
|
You are a senior backend architect. Design system improvements to prevent recurrence.
|
|
|
|
Investigation: [Insert contents of .incident-response/05-investigation.md]
|
|
Postmortem: [Insert contents of .incident-response/10-postmortem.md]
|
|
|
|
Propose:
|
|
1. Architecture changes for resilience (circuit breakers, bulkheads)
|
|
2. Graceful degradation strategies
|
|
3. Capacity planning adjustments
|
|
4. Technical debt prioritization
|
|
5. Dependency reduction opportunities
|
|
6. Implementation roadmap
|
|
|
|
Provide structured output with: ARCHITECTURE_IMPROVEMENTS, RESILIENCE_PATTERNS,
|
|
CAPACITY_PLAN, TECH_DEBT_ITEMS, IMPLEMENTATION_ROADMAP.
|
|
```
|
|
|
|
Save output to `.incident-response/12-hardening.md`.
|
|
|
|
Update `state.json`: set `current_step` to "complete", add step 13 to `completed_steps`.
|
|
|
|
---
|
|
|
|
## Completion
|
|
|
|
Update `state.json`:
|
|
|
|
- Set `status` to `"complete"`
|
|
- Set `last_updated` to current timestamp
|
|
|
|
Present the final summary:
|
|
|
|
```
|
|
Incident response complete: $INCIDENT
|
|
|
|
## Files Created
|
|
[List all .incident-response/ output files]
|
|
|
|
## Response Summary
|
|
- Classification: .incident-response/01-classification.md
|
|
- Observability: .incident-response/02-observability.md
|
|
- Mitigation: .incident-response/03-mitigation.md
|
|
- Debugging: .incident-response/04-debugging.md
|
|
- Investigation: .incident-response/05-investigation.md
|
|
- Fix: .incident-response/06-fix.md
|
|
- Deployment: .incident-response/07-deployment.md
|
|
- Communication: .incident-response/08-communication.md
|
|
- Customer Impact: .incident-response/09-customer-impact.md
|
|
- Postmortem: .incident-response/10-postmortem.md
|
|
- Monitoring: .incident-response/11-monitoring.md
|
|
- Hardening: .incident-response/12-hardening.md
|
|
|
|
## Immediate Follow-ups
|
|
1. Verify service stability over the next 24 hours
|
|
2. Complete all postmortem action items
|
|
3. Deploy monitoring enhancements within 1 week
|
|
4. Schedule system hardening work
|
|
5. Conduct team learning session on lessons learned
|
|
|
|
## Success Criteria
|
|
- Service restored within SLA targets
|
|
- Postmortem completed within 48 hours
|
|
- All action items assigned with deadlines
|
|
- Monitoring improvements deployed within 1 week
|
|
- No recurrence of the same root cause
|
|
```
|
|
|
|
Production incident requiring immediate response: $ARGUMENTS
|