agents/workflows/incident-response.md at 18f7f6a0b9914d5ff267425f6f4fae6e702d0751

mirror of https://github.com/wshobson/agents.git synced 2026-03-18 09:37:15 +00:00

Files

Seth Hobson 3802bca865 Refine plugin marketplace for launch readiness

Plugin Scope Improvements:
- Remove language-specialists plugin (not task-focused)
- Split specialized-domains into 5 focused plugins:
  * blockchain-web3 - Smart contract development only
  * quantitative-trading - Financial modeling and trading only
  * payment-processing - Payment gateway integration only
  * game-development - Unity and Minecraft only
  * accessibility-compliance - WCAG auditing only
- Split business-operations into 3 focused plugins:
  * business-analytics - Metrics and reporting only
  * hr-legal-compliance - HR and legal docs only
  * customer-sales-automation - Support and sales workflows only
- Fix infrastructure-devops scope:
  * Remove database concerns (db-migrate, database-admin)
  * Remove observability concerns (observability-engineer)
  * Move slo-implement to incident-response
  * Focus purely on container orchestration (K8s, Docker, Terraform)
- Fix customer-sales-automation scope:
  * Remove content-marketer (unrelated to customer/sales workflows)

Marketplace Statistics:
- Total plugins: 27 (was 22)
- Tool coverage: 100% (42/42 tools referenced)
- Fat plugins removed: 3 (language-specialists, specialized-domains, business-operations)
- All plugins now have clear, focused tasks

Model Migration:
- Migrate all 42 tools from claude-sonnet-4-0/opus-4-1 to model: sonnet
- Migrate all 15 workflows from claude-opus-4-1 to model: sonnet
- Use short model syntax consistent with agent files

Documentation Updates:
- Update README.md with refined plugin structure
- Update plugin descriptions to be task-focused
- Remove anthropomorphic and marketing language
- Improve category organization (now 16 distinct categories)

Ready for October 9, 2025 @ 9am PST launch

2025-10-08 20:54:29 -04:00

4.3 KiB

Raw Blame History

model

model
sonnet

Respond to production incidents with coordinated agent expertise for rapid resolution:

[Extended thinking: This workflow handles production incidents with urgency and precision. Multiple specialized agents work together to identify root causes, implement fixes, and prevent recurrence.]

Phase 1: Immediate Response

1. Incident Assessment

Use Task tool with subagent_type="incident-responder"
Prompt: "URGENT: Assess production incident: $ARGUMENTS. Determine severity, impact, and immediate mitigation steps. Time is critical."
Output: Incident severity, impact assessment, immediate actions

2. Initial Troubleshooting

Use Task tool with subagent_type="devops-troubleshooter"
Prompt: "Investigate production issue: $ARGUMENTS. Check logs, metrics, recent deployments, and system health. Identify potential root causes."
Output: Initial findings, suspicious patterns, potential causes

Phase 2: Root Cause Analysis

3. Deep Debugging

Use Task tool with subagent_type="debugger"
Prompt: "Debug production issue: $ARGUMENTS using findings from initial investigation. Analyze stack traces, reproduce issue if possible, identify exact root cause."
Output: Root cause identification, reproduction steps, debug analysis

4. Performance Analysis (if applicable)

Use Task tool with subagent_type="performance-engineer"
Prompt: "Analyze performance aspects of incident: $ARGUMENTS. Check for resource exhaustion, bottlenecks, or performance degradation."
Output: Performance metrics, resource analysis, bottleneck identification

5. Database Investigation (if applicable)

Use Task tool with subagent_type="database-optimizer"
Prompt: "Investigate database-related aspects of incident: $ARGUMENTS. Check for locks, slow queries, connection issues, or data corruption."
Output: Database health report, query analysis, data integrity check

Phase 3: Resolution Implementation

6. Fix Development

Use Task tool with subagent_type="backend-architect"
Prompt: "Design and implement fix for incident: $ARGUMENTS based on root cause analysis. Ensure fix is safe for immediate production deployment."
Output: Fix implementation, safety analysis, rollout strategy

7. Emergency Deployment

Use Task tool with subagent_type="deployment-engineer"
Prompt: "Deploy emergency fix for incident: $ARGUMENTS. Implement with minimal risk, include rollback plan, and monitor deployment closely."
Output: Deployment execution, rollback procedures, monitoring setup

Phase 4: Stabilization and Prevention

8. System Stabilization

Use Task tool with subagent_type="devops-troubleshooter"
Prompt: "Stabilize system after incident fix: $ARGUMENTS. Monitor system health, clear any backlogs, and ensure full recovery."
Output: System health report, recovery metrics, stability confirmation

9. Security Review (if applicable)

Use Task tool with subagent_type="security-auditor"
Prompt: "Review security implications of incident: $ARGUMENTS. Check for any security breaches, data exposure, or vulnerabilities exploited."
Output: Security assessment, breach analysis, hardening recommendations

Phase 5: Post-Incident Activities

10. Monitoring Enhancement

Use Task tool with subagent_type="devops-troubleshooter"
Prompt: "Enhance monitoring to prevent recurrence of: $ARGUMENTS. Add alerts, improve observability, and set up early warning systems."
Output: New monitoring rules, alert configurations, observability improvements

11. Test Coverage

Use Task tool with subagent_type="test-automator"
Prompt: "Create tests to prevent regression of incident: $ARGUMENTS. Include unit tests, integration tests, and chaos engineering scenarios."
Output: Test implementations, regression prevention, chaos tests

12. Documentation

Use Task tool with subagent_type="incident-responder"
Prompt: "Document incident postmortem for: $ARGUMENTS. Include timeline, root cause, impact, resolution, and lessons learned. No blame, focus on improvement."
Output: Postmortem document, action items, process improvements

Coordination Notes

Speed is critical in early phases - parallel agent execution where possible
Communication between agents must be clear and rapid
All changes must be safe and reversible
Document everything for postmortem analysis

Production incident: $ARGUMENTS

4.3 KiB Raw Blame History