Files
agents/plugins/incident-response/commands/incident-response.md
Seth Hobson 4d504ed8fa fix: eliminate cross-plugin dependencies and modernize plugin.json across marketplace
Rewrites 14 commands across 11 plugins to remove all cross-plugin
subagent_type references (e.g., "unit-testing::test-automator"), which
break when plugins are installed standalone. Each command now uses only
local bundled agents or general-purpose with role context in the prompt.

All rewritten commands follow conductor-style patterns:
- CRITICAL BEHAVIORAL RULES with strong directives
- State files for session tracking and resume support
- Phase checkpoints requiring explicit user approval
- File-based context passing between steps

Also fixes 4 plugin.json files missing version/license fields and adds
plugin.json for dotnet-contribution.

Closes #433
2026-02-06 19:34:26 -05:00

18 KiB

description, argument-hint
description argument-hint
Orchestrate multi-agent incident response with modern SRE practices for rapid resolution and learning <incident description> [--severity P0|P1|P2|P3]

Incident Response Orchestrator

CRITICAL BEHAVIORAL RULES

You MUST follow these rules exactly. Violating any of them is a failure.

  1. Execute steps in order. Do NOT skip ahead, reorder, or merge steps.
  2. Write output files. Each step MUST produce its output file in .incident-response/ before the next step begins. Read from prior step files — do NOT rely on context window memory.
  3. Stop at checkpoints. When you reach a PHASE CHECKPOINT, you MUST stop and wait for explicit user approval before continuing. Use the AskUserQuestion tool with clear options.
  4. Halt on failure. If any step fails (agent error, test failure, missing dependency), STOP immediately. Present the error and ask the user how to proceed. Do NOT silently continue.
  5. Use only local agents. All subagent_type references use agents bundled with this plugin or general-purpose. No cross-plugin dependencies.
  6. Never enter plan mode autonomously. Do NOT use EnterPlanMode. This command IS the plan — execute it.

Pre-flight Checks

Before starting, perform these checks:

1. Check for existing session

Check if .incident-response/state.json exists:

  • If it exists and status is "in_progress": Read it, display the current step, and ask the user:

    Found an in-progress incident response session:
    Incident: [incident from state]
    Severity: [severity from state]
    Current step: [step from state]
    
    1. Resume from where we left off
    2. Start fresh (archives existing session)
    
  • If it exists and status is "complete": Ask whether to archive and start fresh.

2. Initialize state

Create .incident-response/ directory and state.json:

{
  "incident": "$ARGUMENTS",
  "status": "in_progress",
  "severity": "P1",
  "current_step": 1,
  "current_phase": 1,
  "completed_steps": [],
  "files_created": [],
  "started_at": "ISO_TIMESTAMP",
  "last_updated": "ISO_TIMESTAMP"
}

Parse $ARGUMENTS for --severity flag. Default to P1 if not specified.

3. Parse incident description

Extract the incident description from $ARGUMENTS (everything before the flags). This is referenced as $INCIDENT in prompts below.


Phase 1: Detection & Triage (Steps 1-3)

Step 1: Incident Detection and Classification

Use the Task tool to launch the incident responder agent:

Task:
  subagent_type: "incident-responder"
  description: "URGENT: Classify incident: $INCIDENT"
  prompt: |
    URGENT: Detect and classify incident: $INCIDENT

    Determine:
    1. Incident severity (P0-P3) based on impact assessment
    2. Affected services and their dependencies
    3. User impact and business risk
    4. Initial incident command structure needed
    5. SLO violation status and error budget impact

    Check: error budgets, recent deployments, configuration changes, and monitoring alerts.

    Provide structured output with: SEVERITY, AFFECTED_SERVICES, USER_IMPACT,
    BUSINESS_RISK, INCIDENT_COMMAND, SLO_STATUS.

Save output to .incident-response/01-classification.md.

Update state.json: set current_step to 2, update severity from classification, add step 1 to completed_steps.

Step 2: Observability Analysis

Read .incident-response/01-classification.md.

Task:
  subagent_type: "general-purpose"
  description: "Observability sweep for incident: $INCIDENT"
  prompt: |
    You are an observability engineer. Perform rapid observability sweep for this incident.

    Context: [Insert contents of .incident-response/01-classification.md]

    Query and analyze:
    1. Distributed tracing (OpenTelemetry/Jaeger) for request flow
    2. Metrics correlation (Prometheus/Grafana/DataDog) for anomalies
    3. Log aggregation (ELK/Splunk) for error patterns
    4. APM data for performance degradation points
    5. Real User Monitoring for user experience impact

    Identify anomalies, error patterns, and service degradation points.

    Provide structured output with: TRACE_ANALYSIS, METRICS_ANOMALIES, LOG_PATTERNS,
    APM_FINDINGS, RUM_IMPACT, SERVICE_HEALTH_MATRIX.

Save output to .incident-response/02-observability.md.

Update state.json: set current_step to 3, add step 2 to completed_steps.

Step 3: Initial Mitigation

Read .incident-response/01-classification.md and .incident-response/02-observability.md.

Task:
  subagent_type: "incident-responder"
  description: "Immediate mitigation for: $INCIDENT"
  prompt: |
    Implement immediate mitigation for this incident.

    Classification: [Insert contents of .incident-response/01-classification.md]
    Observability: [Insert contents of .incident-response/02-observability.md]

    Actions to evaluate and implement:
    1. Traffic throttling/rerouting if needed
    2. Feature flag disabling for affected features
    3. Circuit breaker activation
    4. Rollback assessment for recent deployments
    5. Scale resources if capacity-related

    Prioritize user experience restoration.

    Provide structured output with: MITIGATION_ACTIONS, TEMPORARY_FIXES,
    ROLLBACK_DECISIONS, SERVICE_STATUS_AFTER, USER_IMPACT_REDUCTION.

Save output to .incident-response/03-mitigation.md.

Update state.json: set current_step to "checkpoint-1", add step 3 to completed_steps.


PHASE CHECKPOINT 1 — User Approval Required

You MUST stop here and present the triage results.

Display a summary from .incident-response/01-classification.md and .incident-response/03-mitigation.md and ask:

Triage and initial mitigation complete.

Severity: [from classification]
Affected services: [from classification]
Mitigation status: [from mitigation]
User impact reduction: [from mitigation]

1. Approve — proceed to investigation and root cause analysis
2. Request changes — adjust mitigation or severity
3. Pause — save progress and stop here (mitigation in place)

Do NOT proceed to Phase 2 until the user approves.


Phase 2: Investigation & Root Cause (Steps 4-6)

Step 4: Deep System Debugging

Read .incident-response/02-observability.md and .incident-response/03-mitigation.md.

Task:
  subagent_type: "debugger"
  description: "Deep debugging for: $INCIDENT"
  prompt: |
    Conduct deep debugging for this incident using observability data.

    Observability: [Insert contents of .incident-response/02-observability.md]
    Mitigation: [Insert contents of .incident-response/03-mitigation.md]

    Investigate:
    1. Stack traces and error logs
    2. Database query performance and locks
    3. Network latency and timeouts
    4. Memory leaks and CPU spikes
    5. Dependency failures and cascading errors

    Apply Five Whys analysis to identify root cause.

    Provide structured output with: ROOT_CAUSE, CONTRIBUTING_FACTORS,
    DEPENDENCY_IMPACT_MAP, FIVE_WHYS_ANALYSIS.

Save output to .incident-response/04-debugging.md.

Update state.json: set current_step to 5, add step 4 to completed_steps.

Step 5: Security Assessment

Read .incident-response/04-debugging.md.

Task:
  subagent_type: "general-purpose"
  description: "Security assessment for: $INCIDENT"
  prompt: |
    You are a security auditor. Assess security implications of this incident.

    Debug findings: [Insert contents of .incident-response/04-debugging.md]

    Check:
    1. DDoS attack indicators
    2. Authentication/authorization failures
    3. Data exposure risks
    4. Certificate issues
    5. Suspicious access patterns

    Review WAF logs, security groups, and audit trails.

    Provide structured output with: SECURITY_ASSESSMENT, BREACH_ANALYSIS,
    VULNERABILITY_IDENTIFICATION, DATA_EXPOSURE_RISK, REMEDIATION_STEPS.

Step 6: Performance Analysis

Read .incident-response/04-debugging.md.

Launch in parallel with Step 5:

Task:
  subagent_type: "general-purpose"
  description: "Performance analysis for: $INCIDENT"
  prompt: |
    You are a performance engineer. Analyze performance aspects of this incident.

    Debug findings: [Insert contents of .incident-response/04-debugging.md]

    Examine:
    1. Resource utilization patterns
    2. Query optimization opportunities
    3. Caching effectiveness
    4. Load balancer health
    5. CDN performance
    6. Autoscaling triggers

    Identify bottlenecks and capacity issues.

    Provide structured output with: PERFORMANCE_BOTTLENECKS, RESOURCE_RECOMMENDATIONS,
    OPTIMIZATION_OPPORTUNITIES, CAPACITY_ISSUES.

After both complete, consolidate into .incident-response/05-investigation.md:

# Investigation: $INCIDENT

## Root Cause (from debugging)

[From Step 4]

## Security Assessment

[From Step 5]

## Performance Analysis

[From Step 6]

## Combined Findings

[Synthesis of all investigation results]

Update state.json: set current_step to "checkpoint-2", add steps 4-6 to completed_steps.


PHASE CHECKPOINT 2 — User Approval Required

Display investigation results from .incident-response/05-investigation.md and ask:

Investigation complete. Please review .incident-response/05-investigation.md

Root cause: [brief summary]
Security concerns: [summary]
Performance issues: [summary]

1. Approve — proceed to fix implementation and deployment
2. Request changes — investigate further
3. Pause — save progress and stop here

Do NOT proceed to Phase 3 until the user approves.


Phase 3: Resolution & Recovery (Steps 7-8)

Step 7: Fix Implementation

Read .incident-response/05-investigation.md.

Task:
  subagent_type: "general-purpose"
  description: "Implement production fix for: $INCIDENT"
  prompt: |
    You are a senior backend architect. Design and implement a production fix for this incident.

    Investigation: [Insert contents of .incident-response/05-investigation.md]

    Requirements:
    1. Minimal viable fix for rapid deployment
    2. Risk assessment and rollback capability
    3. Staged rollout plan with monitoring
    4. Validation criteria and health checks
    5. Consider both immediate fix and long-term solution

    Provide structured output with: FIX_IMPLEMENTATION, DEPLOYMENT_STRATEGY,
    VALIDATION_PLAN, ROLLBACK_PROCEDURES, LONG_TERM_SOLUTION.

Save output to .incident-response/06-fix.md.

Update state.json: set current_step to 8, add step 7 to completed_steps.

Step 8: Deployment and Validation

Read .incident-response/06-fix.md.

Task:
  subagent_type: "devops-troubleshooter"
  description: "Deploy and validate fix for: $INCIDENT"
  prompt: |
    Execute emergency deployment for incident fix.

    Fix details: [Insert contents of .incident-response/06-fix.md]

    Process:
    1. Blue-green or canary deployment strategy
    2. Progressive rollout with monitoring
    3. Health check validation at each stage
    4. Rollback triggers configured
    5. Real-time monitoring during deployment

    Provide structured output with: DEPLOYMENT_STATUS, VALIDATION_RESULTS,
    MONITORING_DASHBOARD, ROLLBACK_READINESS, SERVICE_HEALTH_POST_DEPLOY.

Save output to .incident-response/07-deployment.md.

Update state.json: set current_step to "checkpoint-3", add step 8 to completed_steps.


PHASE CHECKPOINT 3 — User Approval Required

Display deployment results from .incident-response/07-deployment.md and ask:

Fix deployed and validated.

Deployment status: [from deployment]
Service health: [from deployment]
Rollback ready: [yes/no]

1. Approve — proceed to communication and postmortem
2. Rollback — revert the deployment
3. Pause — save progress and monitor

Do NOT proceed to Phase 4 until the user approves.


Phase 4: Communication & Coordination (Steps 9-10)

Step 9: Stakeholder Communication

Read .incident-response/01-classification.md, .incident-response/05-investigation.md, and .incident-response/07-deployment.md.

Task:
  subagent_type: "general-purpose"
  description: "Manage incident communication for: $INCIDENT"
  prompt: |
    You are a communications specialist. Manage incident communication for this incident.

    Classification: [Insert contents of .incident-response/01-classification.md]
    Investigation: [Insert contents of .incident-response/05-investigation.md]
    Deployment: [Insert contents of .incident-response/07-deployment.md]

    Create:
    1. Status page updates (public-facing)
    2. Internal engineering updates (technical details)
    3. Executive summary (business impact/ETA)
    4. Customer support briefing (talking points)
    5. Timeline documentation with key decisions

    Provide structured output with: STATUS_PAGE_UPDATE, ENGINEERING_UPDATE,
    EXECUTIVE_SUMMARY, SUPPORT_BRIEFING, INCIDENT_TIMELINE.

Save output to .incident-response/08-communication.md.

Update state.json: set current_step to 10, add step 9 to completed_steps.

Step 10: Customer Impact Assessment

Read .incident-response/01-classification.md and .incident-response/07-deployment.md.

Task:
  subagent_type: "incident-responder"
  description: "Assess customer impact for: $INCIDENT"
  prompt: |
    Assess and document customer impact for this incident.

    Classification: [Insert contents of .incident-response/01-classification.md]
    Resolution: [Insert contents of .incident-response/07-deployment.md]

    Analyze:
    1. Affected user segments and geography
    2. Failed transactions or data loss
    3. SLA violations and contractual implications
    4. Customer support ticket volume
    5. Revenue impact estimation
    6. Proactive customer outreach recommendations

    Provide structured output with: CUSTOMER_IMPACT_REPORT, SLA_ANALYSIS,
    REVENUE_IMPACT, OUTREACH_RECOMMENDATIONS.

Save output to .incident-response/09-customer-impact.md.

Update state.json: set current_step to 11, add step 10 to completed_steps.


Phase 5: Postmortem & Prevention (Steps 11-13)

Step 11: Blameless Postmortem

Read all .incident-response/*.md files.

Task:
  subagent_type: "general-purpose"
  description: "Blameless postmortem for: $INCIDENT"
  prompt: |
    You are an SRE documentation specialist. Conduct a blameless postmortem for this incident.

    Context: [Insert contents of all .incident-response/*.md files]

    Document:
    1. Complete incident timeline with decisions
    2. Root cause and contributing factors (systems focus, not people)
    3. What went well in response
    4. What could improve
    5. Action items with owners and deadlines
    6. Lessons learned for team education

    Follow SRE postmortem best practices. Focus on systems, not blame.

    Provide structured output with: INCIDENT_TIMELINE, ROOT_CAUSE_SUMMARY,
    WHAT_WENT_WELL, IMPROVEMENTS, ACTION_ITEMS, LESSONS_LEARNED.

Save output to .incident-response/10-postmortem.md.

Update state.json: set current_step to 12, add step 11 to completed_steps.

Step 12: Monitoring Enhancement

Read .incident-response/05-investigation.md and .incident-response/10-postmortem.md.

Task:
  subagent_type: "general-purpose"
  description: "Enhance monitoring for: $INCIDENT prevention"
  prompt: |
    You are an observability engineer. Enhance monitoring to prevent recurrence of this incident.

    Investigation: [Insert contents of .incident-response/05-investigation.md]
    Postmortem: [Insert contents of .incident-response/10-postmortem.md]

    Implement:
    1. New alerts for early detection
    2. SLI/SLO adjustments if needed
    3. Dashboard improvements for visibility
    4. Runbook automation opportunities
    5. Chaos engineering scenarios for testing

    Ensure alerts are actionable and reduce noise.

    Provide structured output with: NEW_ALERTS, SLO_ADJUSTMENTS, DASHBOARD_UPDATES,
    RUNBOOK_AUTOMATION, CHAOS_SCENARIOS.

Save output to .incident-response/11-monitoring.md.

Update state.json: set current_step to 13, add step 12 to completed_steps.

Step 13: System Hardening

Read .incident-response/05-investigation.md and .incident-response/10-postmortem.md.

Task:
  subagent_type: "general-purpose"
  description: "System hardening for: $INCIDENT prevention"
  prompt: |
    You are a senior backend architect. Design system improvements to prevent recurrence.

    Investigation: [Insert contents of .incident-response/05-investigation.md]
    Postmortem: [Insert contents of .incident-response/10-postmortem.md]

    Propose:
    1. Architecture changes for resilience (circuit breakers, bulkheads)
    2. Graceful degradation strategies
    3. Capacity planning adjustments
    4. Technical debt prioritization
    5. Dependency reduction opportunities
    6. Implementation roadmap

    Provide structured output with: ARCHITECTURE_IMPROVEMENTS, RESILIENCE_PATTERNS,
    CAPACITY_PLAN, TECH_DEBT_ITEMS, IMPLEMENTATION_ROADMAP.

Save output to .incident-response/12-hardening.md.

Update state.json: set current_step to "complete", add step 13 to completed_steps.


Completion

Update state.json:

  • Set status to "complete"
  • Set last_updated to current timestamp

Present the final summary:

Incident response complete: $INCIDENT

## Files Created
[List all .incident-response/ output files]

## Response Summary
- Classification: .incident-response/01-classification.md
- Observability: .incident-response/02-observability.md
- Mitigation: .incident-response/03-mitigation.md
- Debugging: .incident-response/04-debugging.md
- Investigation: .incident-response/05-investigation.md
- Fix: .incident-response/06-fix.md
- Deployment: .incident-response/07-deployment.md
- Communication: .incident-response/08-communication.md
- Customer Impact: .incident-response/09-customer-impact.md
- Postmortem: .incident-response/10-postmortem.md
- Monitoring: .incident-response/11-monitoring.md
- Hardening: .incident-response/12-hardening.md

## Immediate Follow-ups
1. Verify service stability over the next 24 hours
2. Complete all postmortem action items
3. Deploy monitoring enhancements within 1 week
4. Schedule system hardening work
5. Conduct team learning session on lessons learned

## Success Criteria
- Service restored within SLA targets
- Postmortem completed within 48 hours
- All action items assigned with deadlines
- Monitoring improvements deployed within 1 week
- No recurrence of the same root cause

Production incident requiring immediate response: $ARGUMENTS