Rewrites 14 commands across 11 plugins to remove all cross-plugin subagent_type references (e.g., "unit-testing::test-automator"), which break when plugins are installed standalone. Each command now uses only local bundled agents or general-purpose with role context in the prompt. All rewritten commands follow conductor-style patterns: - CRITICAL BEHAVIORAL RULES with strong directives - State files for session tracking and resume support - Phase checkpoints requiring explicit user approval - File-based context passing between steps Also fixes 4 plugin.json files missing version/license fields and adds plugin.json for dotnet-contribution. Closes #433
18 KiB
description, argument-hint
| description | argument-hint |
|---|---|
| Orchestrate multi-agent incident response with modern SRE practices for rapid resolution and learning | <incident description> [--severity P0|P1|P2|P3] |
Incident Response Orchestrator
CRITICAL BEHAVIORAL RULES
You MUST follow these rules exactly. Violating any of them is a failure.
- Execute steps in order. Do NOT skip ahead, reorder, or merge steps.
- Write output files. Each step MUST produce its output file in
.incident-response/before the next step begins. Read from prior step files — do NOT rely on context window memory. - Stop at checkpoints. When you reach a
PHASE CHECKPOINT, you MUST stop and wait for explicit user approval before continuing. Use the AskUserQuestion tool with clear options. - Halt on failure. If any step fails (agent error, test failure, missing dependency), STOP immediately. Present the error and ask the user how to proceed. Do NOT silently continue.
- Use only local agents. All
subagent_typereferences use agents bundled with this plugin orgeneral-purpose. No cross-plugin dependencies. - Never enter plan mode autonomously. Do NOT use EnterPlanMode. This command IS the plan — execute it.
Pre-flight Checks
Before starting, perform these checks:
1. Check for existing session
Check if .incident-response/state.json exists:
-
If it exists and
statusis"in_progress": Read it, display the current step, and ask the user:Found an in-progress incident response session: Incident: [incident from state] Severity: [severity from state] Current step: [step from state] 1. Resume from where we left off 2. Start fresh (archives existing session) -
If it exists and
statusis"complete": Ask whether to archive and start fresh.
2. Initialize state
Create .incident-response/ directory and state.json:
{
"incident": "$ARGUMENTS",
"status": "in_progress",
"severity": "P1",
"current_step": 1,
"current_phase": 1,
"completed_steps": [],
"files_created": [],
"started_at": "ISO_TIMESTAMP",
"last_updated": "ISO_TIMESTAMP"
}
Parse $ARGUMENTS for --severity flag. Default to P1 if not specified.
3. Parse incident description
Extract the incident description from $ARGUMENTS (everything before the flags). This is referenced as $INCIDENT in prompts below.
Phase 1: Detection & Triage (Steps 1-3)
Step 1: Incident Detection and Classification
Use the Task tool to launch the incident responder agent:
Task:
subagent_type: "incident-responder"
description: "URGENT: Classify incident: $INCIDENT"
prompt: |
URGENT: Detect and classify incident: $INCIDENT
Determine:
1. Incident severity (P0-P3) based on impact assessment
2. Affected services and their dependencies
3. User impact and business risk
4. Initial incident command structure needed
5. SLO violation status and error budget impact
Check: error budgets, recent deployments, configuration changes, and monitoring alerts.
Provide structured output with: SEVERITY, AFFECTED_SERVICES, USER_IMPACT,
BUSINESS_RISK, INCIDENT_COMMAND, SLO_STATUS.
Save output to .incident-response/01-classification.md.
Update state.json: set current_step to 2, update severity from classification, add step 1 to completed_steps.
Step 2: Observability Analysis
Read .incident-response/01-classification.md.
Task:
subagent_type: "general-purpose"
description: "Observability sweep for incident: $INCIDENT"
prompt: |
You are an observability engineer. Perform rapid observability sweep for this incident.
Context: [Insert contents of .incident-response/01-classification.md]
Query and analyze:
1. Distributed tracing (OpenTelemetry/Jaeger) for request flow
2. Metrics correlation (Prometheus/Grafana/DataDog) for anomalies
3. Log aggregation (ELK/Splunk) for error patterns
4. APM data for performance degradation points
5. Real User Monitoring for user experience impact
Identify anomalies, error patterns, and service degradation points.
Provide structured output with: TRACE_ANALYSIS, METRICS_ANOMALIES, LOG_PATTERNS,
APM_FINDINGS, RUM_IMPACT, SERVICE_HEALTH_MATRIX.
Save output to .incident-response/02-observability.md.
Update state.json: set current_step to 3, add step 2 to completed_steps.
Step 3: Initial Mitigation
Read .incident-response/01-classification.md and .incident-response/02-observability.md.
Task:
subagent_type: "incident-responder"
description: "Immediate mitigation for: $INCIDENT"
prompt: |
Implement immediate mitigation for this incident.
Classification: [Insert contents of .incident-response/01-classification.md]
Observability: [Insert contents of .incident-response/02-observability.md]
Actions to evaluate and implement:
1. Traffic throttling/rerouting if needed
2. Feature flag disabling for affected features
3. Circuit breaker activation
4. Rollback assessment for recent deployments
5. Scale resources if capacity-related
Prioritize user experience restoration.
Provide structured output with: MITIGATION_ACTIONS, TEMPORARY_FIXES,
ROLLBACK_DECISIONS, SERVICE_STATUS_AFTER, USER_IMPACT_REDUCTION.
Save output to .incident-response/03-mitigation.md.
Update state.json: set current_step to "checkpoint-1", add step 3 to completed_steps.
PHASE CHECKPOINT 1 — User Approval Required
You MUST stop here and present the triage results.
Display a summary from .incident-response/01-classification.md and .incident-response/03-mitigation.md and ask:
Triage and initial mitigation complete.
Severity: [from classification]
Affected services: [from classification]
Mitigation status: [from mitigation]
User impact reduction: [from mitigation]
1. Approve — proceed to investigation and root cause analysis
2. Request changes — adjust mitigation or severity
3. Pause — save progress and stop here (mitigation in place)
Do NOT proceed to Phase 2 until the user approves.
Phase 2: Investigation & Root Cause (Steps 4-6)
Step 4: Deep System Debugging
Read .incident-response/02-observability.md and .incident-response/03-mitigation.md.
Task:
subagent_type: "debugger"
description: "Deep debugging for: $INCIDENT"
prompt: |
Conduct deep debugging for this incident using observability data.
Observability: [Insert contents of .incident-response/02-observability.md]
Mitigation: [Insert contents of .incident-response/03-mitigation.md]
Investigate:
1. Stack traces and error logs
2. Database query performance and locks
3. Network latency and timeouts
4. Memory leaks and CPU spikes
5. Dependency failures and cascading errors
Apply Five Whys analysis to identify root cause.
Provide structured output with: ROOT_CAUSE, CONTRIBUTING_FACTORS,
DEPENDENCY_IMPACT_MAP, FIVE_WHYS_ANALYSIS.
Save output to .incident-response/04-debugging.md.
Update state.json: set current_step to 5, add step 4 to completed_steps.
Step 5: Security Assessment
Read .incident-response/04-debugging.md.
Task:
subagent_type: "general-purpose"
description: "Security assessment for: $INCIDENT"
prompt: |
You are a security auditor. Assess security implications of this incident.
Debug findings: [Insert contents of .incident-response/04-debugging.md]
Check:
1. DDoS attack indicators
2. Authentication/authorization failures
3. Data exposure risks
4. Certificate issues
5. Suspicious access patterns
Review WAF logs, security groups, and audit trails.
Provide structured output with: SECURITY_ASSESSMENT, BREACH_ANALYSIS,
VULNERABILITY_IDENTIFICATION, DATA_EXPOSURE_RISK, REMEDIATION_STEPS.
Step 6: Performance Analysis
Read .incident-response/04-debugging.md.
Launch in parallel with Step 5:
Task:
subagent_type: "general-purpose"
description: "Performance analysis for: $INCIDENT"
prompt: |
You are a performance engineer. Analyze performance aspects of this incident.
Debug findings: [Insert contents of .incident-response/04-debugging.md]
Examine:
1. Resource utilization patterns
2. Query optimization opportunities
3. Caching effectiveness
4. Load balancer health
5. CDN performance
6. Autoscaling triggers
Identify bottlenecks and capacity issues.
Provide structured output with: PERFORMANCE_BOTTLENECKS, RESOURCE_RECOMMENDATIONS,
OPTIMIZATION_OPPORTUNITIES, CAPACITY_ISSUES.
After both complete, consolidate into .incident-response/05-investigation.md:
# Investigation: $INCIDENT
## Root Cause (from debugging)
[From Step 4]
## Security Assessment
[From Step 5]
## Performance Analysis
[From Step 6]
## Combined Findings
[Synthesis of all investigation results]
Update state.json: set current_step to "checkpoint-2", add steps 4-6 to completed_steps.
PHASE CHECKPOINT 2 — User Approval Required
Display investigation results from .incident-response/05-investigation.md and ask:
Investigation complete. Please review .incident-response/05-investigation.md
Root cause: [brief summary]
Security concerns: [summary]
Performance issues: [summary]
1. Approve — proceed to fix implementation and deployment
2. Request changes — investigate further
3. Pause — save progress and stop here
Do NOT proceed to Phase 3 until the user approves.
Phase 3: Resolution & Recovery (Steps 7-8)
Step 7: Fix Implementation
Read .incident-response/05-investigation.md.
Task:
subagent_type: "general-purpose"
description: "Implement production fix for: $INCIDENT"
prompt: |
You are a senior backend architect. Design and implement a production fix for this incident.
Investigation: [Insert contents of .incident-response/05-investigation.md]
Requirements:
1. Minimal viable fix for rapid deployment
2. Risk assessment and rollback capability
3. Staged rollout plan with monitoring
4. Validation criteria and health checks
5. Consider both immediate fix and long-term solution
Provide structured output with: FIX_IMPLEMENTATION, DEPLOYMENT_STRATEGY,
VALIDATION_PLAN, ROLLBACK_PROCEDURES, LONG_TERM_SOLUTION.
Save output to .incident-response/06-fix.md.
Update state.json: set current_step to 8, add step 7 to completed_steps.
Step 8: Deployment and Validation
Read .incident-response/06-fix.md.
Task:
subagent_type: "devops-troubleshooter"
description: "Deploy and validate fix for: $INCIDENT"
prompt: |
Execute emergency deployment for incident fix.
Fix details: [Insert contents of .incident-response/06-fix.md]
Process:
1. Blue-green or canary deployment strategy
2. Progressive rollout with monitoring
3. Health check validation at each stage
4. Rollback triggers configured
5. Real-time monitoring during deployment
Provide structured output with: DEPLOYMENT_STATUS, VALIDATION_RESULTS,
MONITORING_DASHBOARD, ROLLBACK_READINESS, SERVICE_HEALTH_POST_DEPLOY.
Save output to .incident-response/07-deployment.md.
Update state.json: set current_step to "checkpoint-3", add step 8 to completed_steps.
PHASE CHECKPOINT 3 — User Approval Required
Display deployment results from .incident-response/07-deployment.md and ask:
Fix deployed and validated.
Deployment status: [from deployment]
Service health: [from deployment]
Rollback ready: [yes/no]
1. Approve — proceed to communication and postmortem
2. Rollback — revert the deployment
3. Pause — save progress and monitor
Do NOT proceed to Phase 4 until the user approves.
Phase 4: Communication & Coordination (Steps 9-10)
Step 9: Stakeholder Communication
Read .incident-response/01-classification.md, .incident-response/05-investigation.md, and .incident-response/07-deployment.md.
Task:
subagent_type: "general-purpose"
description: "Manage incident communication for: $INCIDENT"
prompt: |
You are a communications specialist. Manage incident communication for this incident.
Classification: [Insert contents of .incident-response/01-classification.md]
Investigation: [Insert contents of .incident-response/05-investigation.md]
Deployment: [Insert contents of .incident-response/07-deployment.md]
Create:
1. Status page updates (public-facing)
2. Internal engineering updates (technical details)
3. Executive summary (business impact/ETA)
4. Customer support briefing (talking points)
5. Timeline documentation with key decisions
Provide structured output with: STATUS_PAGE_UPDATE, ENGINEERING_UPDATE,
EXECUTIVE_SUMMARY, SUPPORT_BRIEFING, INCIDENT_TIMELINE.
Save output to .incident-response/08-communication.md.
Update state.json: set current_step to 10, add step 9 to completed_steps.
Step 10: Customer Impact Assessment
Read .incident-response/01-classification.md and .incident-response/07-deployment.md.
Task:
subagent_type: "incident-responder"
description: "Assess customer impact for: $INCIDENT"
prompt: |
Assess and document customer impact for this incident.
Classification: [Insert contents of .incident-response/01-classification.md]
Resolution: [Insert contents of .incident-response/07-deployment.md]
Analyze:
1. Affected user segments and geography
2. Failed transactions or data loss
3. SLA violations and contractual implications
4. Customer support ticket volume
5. Revenue impact estimation
6. Proactive customer outreach recommendations
Provide structured output with: CUSTOMER_IMPACT_REPORT, SLA_ANALYSIS,
REVENUE_IMPACT, OUTREACH_RECOMMENDATIONS.
Save output to .incident-response/09-customer-impact.md.
Update state.json: set current_step to 11, add step 10 to completed_steps.
Phase 5: Postmortem & Prevention (Steps 11-13)
Step 11: Blameless Postmortem
Read all .incident-response/*.md files.
Task:
subagent_type: "general-purpose"
description: "Blameless postmortem for: $INCIDENT"
prompt: |
You are an SRE documentation specialist. Conduct a blameless postmortem for this incident.
Context: [Insert contents of all .incident-response/*.md files]
Document:
1. Complete incident timeline with decisions
2. Root cause and contributing factors (systems focus, not people)
3. What went well in response
4. What could improve
5. Action items with owners and deadlines
6. Lessons learned for team education
Follow SRE postmortem best practices. Focus on systems, not blame.
Provide structured output with: INCIDENT_TIMELINE, ROOT_CAUSE_SUMMARY,
WHAT_WENT_WELL, IMPROVEMENTS, ACTION_ITEMS, LESSONS_LEARNED.
Save output to .incident-response/10-postmortem.md.
Update state.json: set current_step to 12, add step 11 to completed_steps.
Step 12: Monitoring Enhancement
Read .incident-response/05-investigation.md and .incident-response/10-postmortem.md.
Task:
subagent_type: "general-purpose"
description: "Enhance monitoring for: $INCIDENT prevention"
prompt: |
You are an observability engineer. Enhance monitoring to prevent recurrence of this incident.
Investigation: [Insert contents of .incident-response/05-investigation.md]
Postmortem: [Insert contents of .incident-response/10-postmortem.md]
Implement:
1. New alerts for early detection
2. SLI/SLO adjustments if needed
3. Dashboard improvements for visibility
4. Runbook automation opportunities
5. Chaos engineering scenarios for testing
Ensure alerts are actionable and reduce noise.
Provide structured output with: NEW_ALERTS, SLO_ADJUSTMENTS, DASHBOARD_UPDATES,
RUNBOOK_AUTOMATION, CHAOS_SCENARIOS.
Save output to .incident-response/11-monitoring.md.
Update state.json: set current_step to 13, add step 12 to completed_steps.
Step 13: System Hardening
Read .incident-response/05-investigation.md and .incident-response/10-postmortem.md.
Task:
subagent_type: "general-purpose"
description: "System hardening for: $INCIDENT prevention"
prompt: |
You are a senior backend architect. Design system improvements to prevent recurrence.
Investigation: [Insert contents of .incident-response/05-investigation.md]
Postmortem: [Insert contents of .incident-response/10-postmortem.md]
Propose:
1. Architecture changes for resilience (circuit breakers, bulkheads)
2. Graceful degradation strategies
3. Capacity planning adjustments
4. Technical debt prioritization
5. Dependency reduction opportunities
6. Implementation roadmap
Provide structured output with: ARCHITECTURE_IMPROVEMENTS, RESILIENCE_PATTERNS,
CAPACITY_PLAN, TECH_DEBT_ITEMS, IMPLEMENTATION_ROADMAP.
Save output to .incident-response/12-hardening.md.
Update state.json: set current_step to "complete", add step 13 to completed_steps.
Completion
Update state.json:
- Set
statusto"complete" - Set
last_updatedto current timestamp
Present the final summary:
Incident response complete: $INCIDENT
## Files Created
[List all .incident-response/ output files]
## Response Summary
- Classification: .incident-response/01-classification.md
- Observability: .incident-response/02-observability.md
- Mitigation: .incident-response/03-mitigation.md
- Debugging: .incident-response/04-debugging.md
- Investigation: .incident-response/05-investigation.md
- Fix: .incident-response/06-fix.md
- Deployment: .incident-response/07-deployment.md
- Communication: .incident-response/08-communication.md
- Customer Impact: .incident-response/09-customer-impact.md
- Postmortem: .incident-response/10-postmortem.md
- Monitoring: .incident-response/11-monitoring.md
- Hardening: .incident-response/12-hardening.md
## Immediate Follow-ups
1. Verify service stability over the next 24 hours
2. Complete all postmortem action items
3. Deploy monitoring enhancements within 1 week
4. Schedule system hardening work
5. Conduct team learning session on lessons learned
## Success Criteria
- Service restored within SLA targets
- Postmortem completed within 48 hours
- All action items assigned with deadlines
- Monitoring improvements deployed within 1 week
- No recurrence of the same root cause
Production incident requiring immediate response: $ARGUMENTS