Files
agents/plugins/incident-response/commands/incident-response.md
Seth Hobson 4d504ed8fa fix: eliminate cross-plugin dependencies and modernize plugin.json across marketplace
Rewrites 14 commands across 11 plugins to remove all cross-plugin
subagent_type references (e.g., "unit-testing::test-automator"), which
break when plugins are installed standalone. Each command now uses only
local bundled agents or general-purpose with role context in the prompt.

All rewritten commands follow conductor-style patterns:
- CRITICAL BEHAVIORAL RULES with strong directives
- State files for session tracking and resume support
- Phase checkpoints requiring explicit user approval
- File-based context passing between steps

Also fixes 4 plugin.json files missing version/license fields and adds
plugin.json for dotnet-contribution.

Closes #433
2026-02-06 19:34:26 -05:00

602 lines
18 KiB
Markdown

---
description: "Orchestrate multi-agent incident response with modern SRE practices for rapid resolution and learning"
argument-hint: "<incident description> [--severity P0|P1|P2|P3]"
---
# Incident Response Orchestrator
## CRITICAL BEHAVIORAL RULES
You MUST follow these rules exactly. Violating any of them is a failure.
1. **Execute steps in order.** Do NOT skip ahead, reorder, or merge steps.
2. **Write output files.** Each step MUST produce its output file in `.incident-response/` before the next step begins. Read from prior step files — do NOT rely on context window memory.
3. **Stop at checkpoints.** When you reach a `PHASE CHECKPOINT`, you MUST stop and wait for explicit user approval before continuing. Use the AskUserQuestion tool with clear options.
4. **Halt on failure.** If any step fails (agent error, test failure, missing dependency), STOP immediately. Present the error and ask the user how to proceed. Do NOT silently continue.
5. **Use only local agents.** All `subagent_type` references use agents bundled with this plugin or `general-purpose`. No cross-plugin dependencies.
6. **Never enter plan mode autonomously.** Do NOT use EnterPlanMode. This command IS the plan — execute it.
## Pre-flight Checks
Before starting, perform these checks:
### 1. Check for existing session
Check if `.incident-response/state.json` exists:
- If it exists and `status` is `"in_progress"`: Read it, display the current step, and ask the user:
```
Found an in-progress incident response session:
Incident: [incident from state]
Severity: [severity from state]
Current step: [step from state]
1. Resume from where we left off
2. Start fresh (archives existing session)
```
- If it exists and `status` is `"complete"`: Ask whether to archive and start fresh.
### 2. Initialize state
Create `.incident-response/` directory and `state.json`:
```json
{
"incident": "$ARGUMENTS",
"status": "in_progress",
"severity": "P1",
"current_step": 1,
"current_phase": 1,
"completed_steps": [],
"files_created": [],
"started_at": "ISO_TIMESTAMP",
"last_updated": "ISO_TIMESTAMP"
}
```
Parse `$ARGUMENTS` for `--severity` flag. Default to P1 if not specified.
### 3. Parse incident description
Extract the incident description from `$ARGUMENTS` (everything before the flags). This is referenced as `$INCIDENT` in prompts below.
---
## Phase 1: Detection & Triage (Steps 1-3)
### Step 1: Incident Detection and Classification
Use the Task tool to launch the incident responder agent:
```
Task:
subagent_type: "incident-responder"
description: "URGENT: Classify incident: $INCIDENT"
prompt: |
URGENT: Detect and classify incident: $INCIDENT
Determine:
1. Incident severity (P0-P3) based on impact assessment
2. Affected services and their dependencies
3. User impact and business risk
4. Initial incident command structure needed
5. SLO violation status and error budget impact
Check: error budgets, recent deployments, configuration changes, and monitoring alerts.
Provide structured output with: SEVERITY, AFFECTED_SERVICES, USER_IMPACT,
BUSINESS_RISK, INCIDENT_COMMAND, SLO_STATUS.
```
Save output to `.incident-response/01-classification.md`.
Update `state.json`: set `current_step` to 2, update severity from classification, add step 1 to `completed_steps`.
### Step 2: Observability Analysis
Read `.incident-response/01-classification.md`.
```
Task:
subagent_type: "general-purpose"
description: "Observability sweep for incident: $INCIDENT"
prompt: |
You are an observability engineer. Perform rapid observability sweep for this incident.
Context: [Insert contents of .incident-response/01-classification.md]
Query and analyze:
1. Distributed tracing (OpenTelemetry/Jaeger) for request flow
2. Metrics correlation (Prometheus/Grafana/DataDog) for anomalies
3. Log aggregation (ELK/Splunk) for error patterns
4. APM data for performance degradation points
5. Real User Monitoring for user experience impact
Identify anomalies, error patterns, and service degradation points.
Provide structured output with: TRACE_ANALYSIS, METRICS_ANOMALIES, LOG_PATTERNS,
APM_FINDINGS, RUM_IMPACT, SERVICE_HEALTH_MATRIX.
```
Save output to `.incident-response/02-observability.md`.
Update `state.json`: set `current_step` to 3, add step 2 to `completed_steps`.
### Step 3: Initial Mitigation
Read `.incident-response/01-classification.md` and `.incident-response/02-observability.md`.
```
Task:
subagent_type: "incident-responder"
description: "Immediate mitigation for: $INCIDENT"
prompt: |
Implement immediate mitigation for this incident.
Classification: [Insert contents of .incident-response/01-classification.md]
Observability: [Insert contents of .incident-response/02-observability.md]
Actions to evaluate and implement:
1. Traffic throttling/rerouting if needed
2. Feature flag disabling for affected features
3. Circuit breaker activation
4. Rollback assessment for recent deployments
5. Scale resources if capacity-related
Prioritize user experience restoration.
Provide structured output with: MITIGATION_ACTIONS, TEMPORARY_FIXES,
ROLLBACK_DECISIONS, SERVICE_STATUS_AFTER, USER_IMPACT_REDUCTION.
```
Save output to `.incident-response/03-mitigation.md`.
Update `state.json`: set `current_step` to "checkpoint-1", add step 3 to `completed_steps`.
---
## PHASE CHECKPOINT 1 — User Approval Required
You MUST stop here and present the triage results.
Display a summary from `.incident-response/01-classification.md` and `.incident-response/03-mitigation.md` and ask:
```
Triage and initial mitigation complete.
Severity: [from classification]
Affected services: [from classification]
Mitigation status: [from mitigation]
User impact reduction: [from mitigation]
1. Approve — proceed to investigation and root cause analysis
2. Request changes — adjust mitigation or severity
3. Pause — save progress and stop here (mitigation in place)
```
Do NOT proceed to Phase 2 until the user approves.
---
## Phase 2: Investigation & Root Cause (Steps 4-6)
### Step 4: Deep System Debugging
Read `.incident-response/02-observability.md` and `.incident-response/03-mitigation.md`.
```
Task:
subagent_type: "debugger"
description: "Deep debugging for: $INCIDENT"
prompt: |
Conduct deep debugging for this incident using observability data.
Observability: [Insert contents of .incident-response/02-observability.md]
Mitigation: [Insert contents of .incident-response/03-mitigation.md]
Investigate:
1. Stack traces and error logs
2. Database query performance and locks
3. Network latency and timeouts
4. Memory leaks and CPU spikes
5. Dependency failures and cascading errors
Apply Five Whys analysis to identify root cause.
Provide structured output with: ROOT_CAUSE, CONTRIBUTING_FACTORS,
DEPENDENCY_IMPACT_MAP, FIVE_WHYS_ANALYSIS.
```
Save output to `.incident-response/04-debugging.md`.
Update `state.json`: set `current_step` to 5, add step 4 to `completed_steps`.
### Step 5: Security Assessment
Read `.incident-response/04-debugging.md`.
```
Task:
subagent_type: "general-purpose"
description: "Security assessment for: $INCIDENT"
prompt: |
You are a security auditor. Assess security implications of this incident.
Debug findings: [Insert contents of .incident-response/04-debugging.md]
Check:
1. DDoS attack indicators
2. Authentication/authorization failures
3. Data exposure risks
4. Certificate issues
5. Suspicious access patterns
Review WAF logs, security groups, and audit trails.
Provide structured output with: SECURITY_ASSESSMENT, BREACH_ANALYSIS,
VULNERABILITY_IDENTIFICATION, DATA_EXPOSURE_RISK, REMEDIATION_STEPS.
```
### Step 6: Performance Analysis
Read `.incident-response/04-debugging.md`.
Launch in parallel with Step 5:
```
Task:
subagent_type: "general-purpose"
description: "Performance analysis for: $INCIDENT"
prompt: |
You are a performance engineer. Analyze performance aspects of this incident.
Debug findings: [Insert contents of .incident-response/04-debugging.md]
Examine:
1. Resource utilization patterns
2. Query optimization opportunities
3. Caching effectiveness
4. Load balancer health
5. CDN performance
6. Autoscaling triggers
Identify bottlenecks and capacity issues.
Provide structured output with: PERFORMANCE_BOTTLENECKS, RESOURCE_RECOMMENDATIONS,
OPTIMIZATION_OPPORTUNITIES, CAPACITY_ISSUES.
```
After both complete, consolidate into `.incident-response/05-investigation.md`:
```markdown
# Investigation: $INCIDENT
## Root Cause (from debugging)
[From Step 4]
## Security Assessment
[From Step 5]
## Performance Analysis
[From Step 6]
## Combined Findings
[Synthesis of all investigation results]
```
Update `state.json`: set `current_step` to "checkpoint-2", add steps 4-6 to `completed_steps`.
---
## PHASE CHECKPOINT 2 — User Approval Required
Display investigation results from `.incident-response/05-investigation.md` and ask:
```
Investigation complete. Please review .incident-response/05-investigation.md
Root cause: [brief summary]
Security concerns: [summary]
Performance issues: [summary]
1. Approve — proceed to fix implementation and deployment
2. Request changes — investigate further
3. Pause — save progress and stop here
```
Do NOT proceed to Phase 3 until the user approves.
---
## Phase 3: Resolution & Recovery (Steps 7-8)
### Step 7: Fix Implementation
Read `.incident-response/05-investigation.md`.
```
Task:
subagent_type: "general-purpose"
description: "Implement production fix for: $INCIDENT"
prompt: |
You are a senior backend architect. Design and implement a production fix for this incident.
Investigation: [Insert contents of .incident-response/05-investigation.md]
Requirements:
1. Minimal viable fix for rapid deployment
2. Risk assessment and rollback capability
3. Staged rollout plan with monitoring
4. Validation criteria and health checks
5. Consider both immediate fix and long-term solution
Provide structured output with: FIX_IMPLEMENTATION, DEPLOYMENT_STRATEGY,
VALIDATION_PLAN, ROLLBACK_PROCEDURES, LONG_TERM_SOLUTION.
```
Save output to `.incident-response/06-fix.md`.
Update `state.json`: set `current_step` to 8, add step 7 to `completed_steps`.
### Step 8: Deployment and Validation
Read `.incident-response/06-fix.md`.
```
Task:
subagent_type: "devops-troubleshooter"
description: "Deploy and validate fix for: $INCIDENT"
prompt: |
Execute emergency deployment for incident fix.
Fix details: [Insert contents of .incident-response/06-fix.md]
Process:
1. Blue-green or canary deployment strategy
2. Progressive rollout with monitoring
3. Health check validation at each stage
4. Rollback triggers configured
5. Real-time monitoring during deployment
Provide structured output with: DEPLOYMENT_STATUS, VALIDATION_RESULTS,
MONITORING_DASHBOARD, ROLLBACK_READINESS, SERVICE_HEALTH_POST_DEPLOY.
```
Save output to `.incident-response/07-deployment.md`.
Update `state.json`: set `current_step` to "checkpoint-3", add step 8 to `completed_steps`.
---
## PHASE CHECKPOINT 3 — User Approval Required
Display deployment results from `.incident-response/07-deployment.md` and ask:
```
Fix deployed and validated.
Deployment status: [from deployment]
Service health: [from deployment]
Rollback ready: [yes/no]
1. Approve — proceed to communication and postmortem
2. Rollback — revert the deployment
3. Pause — save progress and monitor
```
Do NOT proceed to Phase 4 until the user approves.
---
## Phase 4: Communication & Coordination (Steps 9-10)
### Step 9: Stakeholder Communication
Read `.incident-response/01-classification.md`, `.incident-response/05-investigation.md`, and `.incident-response/07-deployment.md`.
```
Task:
subagent_type: "general-purpose"
description: "Manage incident communication for: $INCIDENT"
prompt: |
You are a communications specialist. Manage incident communication for this incident.
Classification: [Insert contents of .incident-response/01-classification.md]
Investigation: [Insert contents of .incident-response/05-investigation.md]
Deployment: [Insert contents of .incident-response/07-deployment.md]
Create:
1. Status page updates (public-facing)
2. Internal engineering updates (technical details)
3. Executive summary (business impact/ETA)
4. Customer support briefing (talking points)
5. Timeline documentation with key decisions
Provide structured output with: STATUS_PAGE_UPDATE, ENGINEERING_UPDATE,
EXECUTIVE_SUMMARY, SUPPORT_BRIEFING, INCIDENT_TIMELINE.
```
Save output to `.incident-response/08-communication.md`.
Update `state.json`: set `current_step` to 10, add step 9 to `completed_steps`.
### Step 10: Customer Impact Assessment
Read `.incident-response/01-classification.md` and `.incident-response/07-deployment.md`.
```
Task:
subagent_type: "incident-responder"
description: "Assess customer impact for: $INCIDENT"
prompt: |
Assess and document customer impact for this incident.
Classification: [Insert contents of .incident-response/01-classification.md]
Resolution: [Insert contents of .incident-response/07-deployment.md]
Analyze:
1. Affected user segments and geography
2. Failed transactions or data loss
3. SLA violations and contractual implications
4. Customer support ticket volume
5. Revenue impact estimation
6. Proactive customer outreach recommendations
Provide structured output with: CUSTOMER_IMPACT_REPORT, SLA_ANALYSIS,
REVENUE_IMPACT, OUTREACH_RECOMMENDATIONS.
```
Save output to `.incident-response/09-customer-impact.md`.
Update `state.json`: set `current_step` to 11, add step 10 to `completed_steps`.
---
## Phase 5: Postmortem & Prevention (Steps 11-13)
### Step 11: Blameless Postmortem
Read all `.incident-response/*.md` files.
```
Task:
subagent_type: "general-purpose"
description: "Blameless postmortem for: $INCIDENT"
prompt: |
You are an SRE documentation specialist. Conduct a blameless postmortem for this incident.
Context: [Insert contents of all .incident-response/*.md files]
Document:
1. Complete incident timeline with decisions
2. Root cause and contributing factors (systems focus, not people)
3. What went well in response
4. What could improve
5. Action items with owners and deadlines
6. Lessons learned for team education
Follow SRE postmortem best practices. Focus on systems, not blame.
Provide structured output with: INCIDENT_TIMELINE, ROOT_CAUSE_SUMMARY,
WHAT_WENT_WELL, IMPROVEMENTS, ACTION_ITEMS, LESSONS_LEARNED.
```
Save output to `.incident-response/10-postmortem.md`.
Update `state.json`: set `current_step` to 12, add step 11 to `completed_steps`.
### Step 12: Monitoring Enhancement
Read `.incident-response/05-investigation.md` and `.incident-response/10-postmortem.md`.
```
Task:
subagent_type: "general-purpose"
description: "Enhance monitoring for: $INCIDENT prevention"
prompt: |
You are an observability engineer. Enhance monitoring to prevent recurrence of this incident.
Investigation: [Insert contents of .incident-response/05-investigation.md]
Postmortem: [Insert contents of .incident-response/10-postmortem.md]
Implement:
1. New alerts for early detection
2. SLI/SLO adjustments if needed
3. Dashboard improvements for visibility
4. Runbook automation opportunities
5. Chaos engineering scenarios for testing
Ensure alerts are actionable and reduce noise.
Provide structured output with: NEW_ALERTS, SLO_ADJUSTMENTS, DASHBOARD_UPDATES,
RUNBOOK_AUTOMATION, CHAOS_SCENARIOS.
```
Save output to `.incident-response/11-monitoring.md`.
Update `state.json`: set `current_step` to 13, add step 12 to `completed_steps`.
### Step 13: System Hardening
Read `.incident-response/05-investigation.md` and `.incident-response/10-postmortem.md`.
```
Task:
subagent_type: "general-purpose"
description: "System hardening for: $INCIDENT prevention"
prompt: |
You are a senior backend architect. Design system improvements to prevent recurrence.
Investigation: [Insert contents of .incident-response/05-investigation.md]
Postmortem: [Insert contents of .incident-response/10-postmortem.md]
Propose:
1. Architecture changes for resilience (circuit breakers, bulkheads)
2. Graceful degradation strategies
3. Capacity planning adjustments
4. Technical debt prioritization
5. Dependency reduction opportunities
6. Implementation roadmap
Provide structured output with: ARCHITECTURE_IMPROVEMENTS, RESILIENCE_PATTERNS,
CAPACITY_PLAN, TECH_DEBT_ITEMS, IMPLEMENTATION_ROADMAP.
```
Save output to `.incident-response/12-hardening.md`.
Update `state.json`: set `current_step` to "complete", add step 13 to `completed_steps`.
---
## Completion
Update `state.json`:
- Set `status` to `"complete"`
- Set `last_updated` to current timestamp
Present the final summary:
```
Incident response complete: $INCIDENT
## Files Created
[List all .incident-response/ output files]
## Response Summary
- Classification: .incident-response/01-classification.md
- Observability: .incident-response/02-observability.md
- Mitigation: .incident-response/03-mitigation.md
- Debugging: .incident-response/04-debugging.md
- Investigation: .incident-response/05-investigation.md
- Fix: .incident-response/06-fix.md
- Deployment: .incident-response/07-deployment.md
- Communication: .incident-response/08-communication.md
- Customer Impact: .incident-response/09-customer-impact.md
- Postmortem: .incident-response/10-postmortem.md
- Monitoring: .incident-response/11-monitoring.md
- Hardening: .incident-response/12-hardening.md
## Immediate Follow-ups
1. Verify service stability over the next 24 hours
2. Complete all postmortem action items
3. Deploy monitoring enhancements within 1 week
4. Schedule system hardening work
5. Conduct team learning session on lessons learned
## Success Criteria
- Service restored within SLA targets
- Postmortem completed within 48 hours
- All action items assigned with deadlines
- Monitoring improvements deployed within 1 week
- No recurrence of the same root cause
```
Production incident requiring immediate response: $ARGUMENTS