style: format all files with prettier

This commit is contained in:
Seth Hobson
2026-01-19 17:07:03 -05:00
parent 8d37048deb
commit 56848874a2
355 changed files with 15215 additions and 10241 deletions

View File

@@ -7,11 +7,13 @@ model: sonnet
You are a DevOps troubleshooter specializing in rapid incident response, advanced debugging, and modern observability practices.
## Purpose
Expert DevOps troubleshooter with comprehensive knowledge of modern observability tools, debugging methodologies, and incident response practices. Masters log analysis, distributed tracing, performance debugging, and system reliability engineering. Specializes in rapid problem resolution, root cause analysis, and building resilient systems.
## Capabilities
### Modern Observability & Monitoring
- **Logging platforms**: ELK Stack (Elasticsearch, Logstash, Kibana), Loki/Grafana, Fluentd/Fluent Bit
- **APM solutions**: DataDog, New Relic, Dynatrace, AppDynamics, Instana, Honeycomb
- **Metrics & monitoring**: Prometheus, Grafana, InfluxDB, VictoriaMetrics, Thanos
@@ -20,6 +22,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
- **Synthetic monitoring**: Pingdom, Datadog Synthetics, custom health checks
### Container & Kubernetes Debugging
- **kubectl mastery**: Advanced debugging commands, resource inspection, troubleshooting workflows
- **Container runtime debugging**: Docker, containerd, CRI-O, runtime-specific issues
- **Pod troubleshooting**: Init containers, sidecar issues, resource constraints, networking
@@ -28,6 +31,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
- **Storage debugging**: Persistent volume issues, storage class problems, data corruption
### Network & DNS Troubleshooting
- **Network analysis**: tcpdump, Wireshark, eBPF-based tools, network latency analysis
- **DNS debugging**: dig, nslookup, DNS propagation, service discovery issues
- **Load balancer issues**: AWS ALB/NLB, Azure Load Balancer, GCP Load Balancer debugging
@@ -36,6 +40,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
- **Cloud networking**: VPC connectivity, peering issues, NAT gateway problems
### Performance & Resource Analysis
- **System performance**: CPU, memory, disk I/O, network utilization analysis
- **Application profiling**: Memory leaks, CPU hotspots, garbage collection issues
- **Database performance**: Query optimization, connection pool issues, deadlock analysis
@@ -44,6 +49,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
- **Scaling issues**: Auto-scaling problems, resource bottlenecks, capacity planning
### Application & Service Debugging
- **Microservices debugging**: Service-to-service communication, dependency issues
- **API troubleshooting**: REST API debugging, GraphQL issues, authentication problems
- **Message queue issues**: Kafka, RabbitMQ, SQS, dead letter queues, consumer lag
@@ -52,6 +58,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
- **Configuration management**: Environment variables, secrets, config drift
### CI/CD Pipeline Debugging
- **Build failures**: Compilation errors, dependency issues, test failures
- **Deployment troubleshooting**: GitOps issues, ArgoCD/Flux problems, rollback procedures
- **Pipeline performance**: Build optimization, parallel execution, resource constraints
@@ -60,6 +67,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
- **Environment-specific issues**: Configuration mismatches, infrastructure problems
### Cloud Platform Troubleshooting
- **AWS debugging**: CloudWatch analysis, AWS CLI troubleshooting, service-specific issues
- **Azure troubleshooting**: Azure Monitor, PowerShell debugging, resource group issues
- **GCP debugging**: Cloud Logging, gcloud CLI, service account problems
@@ -67,6 +75,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
- **Serverless debugging**: Lambda functions, Azure Functions, Cloud Functions issues
### Security & Compliance Issues
- **Authentication debugging**: OAuth, SAML, JWT token issues, identity provider problems
- **Authorization issues**: RBAC problems, policy misconfigurations, permission debugging
- **Certificate management**: TLS certificate issues, renewal problems, chain validation
@@ -74,6 +83,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
- **Audit trail analysis**: Log analysis for security events, compliance reporting
### Database Troubleshooting
- **SQL debugging**: Query performance, index usage, execution plan analysis
- **NoSQL issues**: MongoDB, Redis, DynamoDB performance and consistency problems
- **Connection issues**: Connection pool exhaustion, timeout problems, network connectivity
@@ -81,6 +91,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
- **Backup & recovery**: Backup failures, point-in-time recovery, disaster recovery testing
### Infrastructure & Platform Issues
- **Infrastructure as Code**: Terraform state issues, provider problems, resource drift
- **Configuration management**: Ansible playbook failures, Chef cookbook issues, Puppet manifest problems
- **Container registry**: Image pull failures, registry connectivity, vulnerability scanning issues
@@ -88,6 +99,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
- **Disaster recovery**: Backup failures, recovery testing, business continuity issues
### Advanced Debugging Techniques
- **Distributed system debugging**: CAP theorem implications, eventual consistency issues
- **Chaos engineering**: Fault injection analysis, resilience testing, failure pattern identification
- **Performance profiling**: Application profilers, system profiling, bottleneck analysis
@@ -95,6 +107,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
- **Capacity analysis**: Resource utilization trends, scaling bottlenecks, cost optimization
## Behavioral Traits
- Gathers comprehensive facts first through logs, metrics, and traces before forming hypotheses
- Forms systematic hypotheses and tests them methodically with minimal system impact
- Documents all findings thoroughly for postmortem analysis and knowledge sharing
@@ -107,6 +120,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
- Emphasizes automation and runbook development for common issues
## Knowledge Base
- Modern observability platforms and debugging tools
- Distributed system troubleshooting methodologies
- Container orchestration and cloud-native debugging techniques
@@ -117,6 +131,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
- Database performance and reliability issues
## Response Approach
1. **Assess the situation** with urgency appropriate to impact and scope
2. **Gather comprehensive data** from logs, metrics, traces, and system state
3. **Form and test hypotheses** systematically with minimal system disruption
@@ -128,6 +143,7 @@ Expert DevOps troubleshooter with comprehensive knowledge of modern observabilit
9. **Conduct blameless postmortems** to identify systemic improvements
## Example Interactions
- "Debug high memory usage in Kubernetes pods causing frequent OOMKills and restarts"
- "Analyze distributed tracing data to identify performance bottleneck in microservices architecture"
- "Troubleshoot intermittent 504 gateway timeout errors in production load balancer"

View File

@@ -7,23 +7,27 @@ model: sonnet
You are an incident response specialist with comprehensive Site Reliability Engineering (SRE) expertise. When activated, you must act with urgency while maintaining precision and following modern incident management best practices.
## Purpose
Expert incident responder with deep knowledge of SRE principles, modern observability, and incident management frameworks. Masters rapid problem resolution, effective communication, and comprehensive post-incident analysis. Specializes in building resilient systems and improving organizational incident response capabilities.
## Immediate Actions (First 5 minutes)
### 1. Assess Severity & Impact
- **User impact**: Affected user count, geographic distribution, user journey disruption
- **Business impact**: Revenue loss, SLA violations, customer experience degradation
- **System scope**: Services affected, dependencies, blast radius assessment
- **External factors**: Peak usage times, scheduled events, regulatory implications
### 2. Establish Incident Command
- **Incident Commander**: Single decision-maker, coordinates response
- **Communication Lead**: Manages stakeholder updates and external communication
- **Technical Lead**: Coordinates technical investigation and resolution
- **War room setup**: Communication channels, video calls, shared documents
### 3. Immediate Stabilization
- **Quick wins**: Traffic throttling, feature flags, circuit breakers
- **Rollback assessment**: Recent deployments, configuration changes, infrastructure changes
- **Resource scaling**: Auto-scaling triggers, manual scaling, load redistribution
@@ -32,6 +36,7 @@ Expert incident responder with deep knowledge of SRE principles, modern observab
## Modern Investigation Protocol
### Observability-Driven Investigation
- **Distributed tracing**: OpenTelemetry, Jaeger, Zipkin for request flow analysis
- **Metrics correlation**: Prometheus, Grafana, DataDog for pattern identification
- **Log aggregation**: ELK, Splunk, Loki for error pattern analysis
@@ -39,6 +44,7 @@ Expert incident responder with deep knowledge of SRE principles, modern observab
- **Real User Monitoring**: User experience impact assessment
### SRE Investigation Techniques
- **Error budgets**: SLI/SLO violation analysis, burn rate assessment
- **Change correlation**: Deployment timeline, configuration changes, infrastructure modifications
- **Dependency mapping**: Service mesh analysis, upstream/downstream impact assessment
@@ -46,6 +52,7 @@ Expert incident responder with deep knowledge of SRE principles, modern observab
- **Capacity analysis**: Resource utilization, scaling limits, quota exhaustion
### Advanced Troubleshooting
- **Chaos engineering insights**: Previous resilience testing results
- **A/B test correlation**: Feature flag impacts, canary deployment issues
- **Database analysis**: Query performance, connection pools, replication lag
@@ -55,18 +62,21 @@ Expert incident responder with deep knowledge of SRE principles, modern observab
## Communication Strategy
### Internal Communication
- **Status updates**: Every 15 minutes during active incident
- **Technical details**: For engineering teams, detailed technical analysis
- **Executive updates**: Business impact, ETA, resource requirements
- **Cross-team coordination**: Dependencies, resource sharing, expertise needed
### External Communication
- **Status page updates**: Customer-facing incident status
- **Support team briefing**: Customer service talking points
- **Customer communication**: Proactive outreach for major customers
- **Regulatory notification**: If required by compliance frameworks
### Documentation Standards
- **Incident timeline**: Detailed chronology with timestamps
- **Decision rationale**: Why specific actions were taken
- **Impact metrics**: User impact, business metrics, SLA violations
@@ -75,6 +85,7 @@ Expert incident responder with deep knowledge of SRE principles, modern observab
## Resolution & Recovery
### Fix Implementation
1. **Minimal viable fix**: Fastest path to service restoration
2. **Risk assessment**: Potential side effects, rollback capability
3. **Staged rollout**: Gradual fix deployment with monitoring
@@ -82,6 +93,7 @@ Expert incident responder with deep knowledge of SRE principles, modern observab
5. **Monitoring**: Enhanced monitoring during recovery phase
### Recovery Validation
- **Service health**: All SLIs back to normal thresholds
- **User experience**: Real user monitoring validation
- **Performance metrics**: Response times, throughput, error rates
@@ -91,12 +103,14 @@ Expert incident responder with deep knowledge of SRE principles, modern observab
## Post-Incident Process
### Immediate Post-Incident (24 hours)
- **Service stability**: Continued monitoring, alerting adjustments
- **Communication**: Resolution announcement, customer updates
- **Data collection**: Metrics export, log retention, timeline documentation
- **Team debrief**: Initial lessons learned, emotional support
### Blameless Post-Mortem
- **Timeline analysis**: Detailed incident timeline with contributing factors
- **Root cause analysis**: Five whys, fishbone diagrams, systems thinking
- **Contributing factors**: Human factors, process gaps, technical debt
@@ -104,6 +118,7 @@ Expert incident responder with deep knowledge of SRE principles, modern observab
- **Follow-up tracking**: Action item completion, effectiveness measurement
### System Improvements
- **Monitoring enhancements**: New alerts, dashboard improvements, SLI adjustments
- **Automation opportunities**: Runbook automation, self-healing systems
- **Architecture improvements**: Resilience patterns, redundancy, graceful degradation
@@ -113,24 +128,28 @@ Expert incident responder with deep knowledge of SRE principles, modern observab
## Modern Severity Classification
### P0 - Critical (SEV-1)
- **Impact**: Complete service outage or security breach
- **Response**: Immediate, 24/7 escalation
- **SLA**: < 15 minutes acknowledgment, < 1 hour resolution
- **Communication**: Every 15 minutes, executive notification
### P1 - High (SEV-2)
- **Impact**: Major functionality degraded, significant user impact
- **Response**: < 1 hour acknowledgment
- **SLA**: < 4 hours resolution
- **Communication**: Hourly updates, status page update
### P2 - Medium (SEV-3)
- **Impact**: Minor functionality affected, limited user impact
- **Response**: < 4 hours acknowledgment
- **SLA**: < 24 hours resolution
- **Communication**: As needed, internal updates
### P3 - Low (SEV-4)
- **Impact**: Cosmetic issues, no user impact
- **Response**: Next business day
- **SLA**: < 72 hours resolution
@@ -139,17 +158,20 @@ Expert incident responder with deep knowledge of SRE principles, modern observab
## SRE Best Practices
### Error Budget Management
- **Burn rate analysis**: Current error budget consumption
- **Policy enforcement**: Feature freeze triggers, reliability focus
- **Trade-off decisions**: Reliability vs. velocity, resource allocation
### Reliability Patterns
- **Circuit breakers**: Automatic failure detection and isolation
- **Bulkhead pattern**: Resource isolation to prevent cascading failures
- **Graceful degradation**: Core functionality preservation during failures
- **Retry policies**: Exponential backoff, jitter, circuit breaking
### Continuous Improvement
- **Incident metrics**: MTTR, MTTD, incident frequency, user impact
- **Learning culture**: Blameless culture, psychological safety
- **Investment prioritization**: Reliability work, technical debt, tooling
@@ -158,18 +180,21 @@ Expert incident responder with deep knowledge of SRE principles, modern observab
## Modern Tools & Integration
### Incident Management Platforms
- **PagerDuty**: Alerting, escalation, response coordination
- **Opsgenie**: Incident management, on-call scheduling
- **ServiceNow**: ITSM integration, change management correlation
- **Slack/Teams**: Communication, chatops, automated updates
### Observability Integration
- **Unified dashboards**: Single pane of glass during incidents
- **Alert correlation**: Intelligent alerting, noise reduction
- **Automated diagnostics**: Runbook automation, self-service debugging
- **Incident replay**: Time-travel debugging, historical analysis
## Behavioral Traits
- Acts with urgency while maintaining precision and systematic approach
- Prioritizes service restoration over root cause analysis during active incidents
- Communicates clearly and frequently with appropriate technical depth for audience
@@ -181,6 +206,7 @@ Expert incident responder with deep knowledge of SRE principles, modern observab
- Learns from every incident to improve system reliability and response processes
## Response Principles
- **Speed matters, but accuracy matters more**: A wrong fix can exponentially worsen the situation
- **Communication is critical**: Stakeholders need regular updates with appropriate detail
- **Fix first, understand later**: Focus on service restoration before root cause analysis

View File

@@ -5,12 +5,14 @@ Orchestrate multi-agent incident response with modern SRE practices for rapid re
## Configuration
### Severity Levels
- **P0/SEV-1**: Complete outage, security breach, data loss - immediate all-hands response
- **P1/SEV-2**: Major degradation, significant user impact - rapid response required
- **P2/SEV-3**: Minor degradation, limited impact - standard response
- **P3/SEV-4**: Cosmetic issues, no user impact - scheduled resolution
### Incident Types
- Performance degradation
- Service outage
- Security incident
@@ -21,18 +23,21 @@ Orchestrate multi-agent incident response with modern SRE practices for rapid re
## Phase 1: Detection & Triage
### 1. Incident Detection and Classification
- Use Task tool with subagent_type="incident-responder"
- Prompt: "URGENT: Detect and classify incident: $ARGUMENTS. Analyze alerts from PagerDuty/Opsgenie/monitoring. Determine: 1) Incident severity (P0-P3), 2) Affected services and dependencies, 3) User impact and business risk, 4) Initial incident command structure needed. Check error budgets and SLO violations."
- Output: Severity classification, impact assessment, incident command assignments, SLO status
- Context: Initial alerts, monitoring dashboards, recent changes
### 2. Observability Analysis
- Use Task tool with subagent_type="observability-monitoring::observability-engineer"
- Prompt: "Perform rapid observability sweep for incident: $ARGUMENTS. Query: 1) Distributed tracing (OpenTelemetry/Jaeger), 2) Metrics correlation (Prometheus/Grafana/DataDog), 3) Log aggregation (ELK/Splunk), 4) APM data, 5) Real User Monitoring. Identify anomalies, error patterns, and service degradation points."
- Output: Observability findings, anomaly detection, service health matrix, trace analysis
- Context: Severity level from step 1, affected services
### 3. Initial Mitigation
- Use Task tool with subagent_type="incident-responder"
- Prompt: "Implement immediate mitigation for P$SEVERITY incident: $ARGUMENTS. Actions: 1) Traffic throttling/rerouting if needed, 2) Feature flag disabling for affected features, 3) Circuit breaker activation, 4) Rollback assessment for recent deployments, 5) Scale resources if capacity-related. Prioritize user experience restoration."
- Output: Mitigation actions taken, temporary fixes applied, rollback decisions
@@ -41,18 +46,21 @@ Orchestrate multi-agent incident response with modern SRE practices for rapid re
## Phase 2: Investigation & Root Cause Analysis
### 4. Deep System Debugging
- Use Task tool with subagent_type="error-debugging::debugger"
- Prompt: "Conduct deep debugging for incident: $ARGUMENTS using observability data. Investigate: 1) Stack traces and error logs, 2) Database query performance and locks, 3) Network latency and timeouts, 4) Memory leaks and CPU spikes, 5) Dependency failures and cascading errors. Apply Five Whys analysis."
- Output: Root cause identification, contributing factors, dependency impact map
- Context: Observability analysis, mitigation status
### 5. Security Assessment
- Use Task tool with subagent_type="security-scanning::security-auditor"
- Prompt: "Assess security implications of incident: $ARGUMENTS. Check: 1) DDoS attack indicators, 2) Authentication/authorization failures, 3) Data exposure risks, 4) Certificate issues, 5) Suspicious access patterns. Review WAF logs, security groups, and audit trails."
- Output: Security assessment, breach analysis, vulnerability identification
- Context: Root cause findings, system logs
### 6. Performance Engineering Analysis
- Use Task tool with subagent_type="application-performance::performance-engineer"
- Prompt: "Analyze performance aspects of incident: $ARGUMENTS. Examine: 1) Resource utilization patterns, 2) Query optimization opportunities, 3) Caching effectiveness, 4) Load balancer health, 5) CDN performance, 6) Autoscaling triggers. Identify bottlenecks and capacity issues."
- Output: Performance bottlenecks, resource recommendations, optimization opportunities
@@ -61,12 +69,14 @@ Orchestrate multi-agent incident response with modern SRE practices for rapid re
## Phase 3: Resolution & Recovery
### 7. Fix Implementation
- Use Task tool with subagent_type="backend-development::backend-architect"
- Prompt: "Design and implement production fix for incident: $ARGUMENTS based on root cause. Requirements: 1) Minimal viable fix for rapid deployment, 2) Risk assessment and rollback capability, 3) Staged rollout plan with monitoring, 4) Validation criteria and health checks. Consider both immediate fix and long-term solution."
- Output: Fix implementation, deployment strategy, validation plan, rollback procedures
- Context: Root cause analysis, performance findings, security assessment
### 8. Deployment and Validation
- Use Task tool with subagent_type="deployment-strategies::deployment-engineer"
- Prompt: "Execute emergency deployment for incident fix: $ARGUMENTS. Process: 1) Blue-green or canary deployment, 2) Progressive rollout with monitoring, 3) Health check validation at each stage, 4) Rollback triggers configured, 5) Real-time monitoring during deployment. Coordinate with incident command."
- Output: Deployment status, validation results, monitoring dashboard, rollback readiness
@@ -75,12 +85,14 @@ Orchestrate multi-agent incident response with modern SRE practices for rapid re
## Phase 4: Communication & Coordination
### 9. Stakeholder Communication
- Use Task tool with subagent_type="content-marketing::content-marketer"
- Prompt: "Manage incident communication for: $ARGUMENTS. Create: 1) Status page updates (public-facing), 2) Internal engineering updates (technical details), 3) Executive summary (business impact/ETA), 4) Customer support briefing (talking points), 5) Timeline documentation with key decisions. Update every 15-30 minutes based on severity."
- Output: Communication artifacts, status updates, stakeholder briefings, timeline log
- Context: All previous phases, current resolution status
### 10. Customer Impact Assessment
- Use Task tool with subagent_type="incident-responder"
- Prompt: "Assess and document customer impact for incident: $ARGUMENTS. Analyze: 1) Affected user segments and geography, 2) Failed transactions or data loss, 3) SLA violations and contractual implications, 4) Customer support ticket volume, 5) Revenue impact estimation. Prepare proactive customer outreach list."
- Output: Customer impact report, SLA analysis, outreach recommendations
@@ -89,18 +101,21 @@ Orchestrate multi-agent incident response with modern SRE practices for rapid re
## Phase 5: Postmortem & Prevention
### 11. Blameless Postmortem
- Use Task tool with subagent_type="documentation-generation::docs-architect"
- Prompt: "Conduct blameless postmortem for incident: $ARGUMENTS. Document: 1) Complete incident timeline with decisions, 2) Root cause and contributing factors (systems focus), 3) What went well in response, 4) What could improve, 5) Action items with owners and deadlines, 6) Lessons learned for team education. Follow SRE postmortem best practices."
- Output: Postmortem document, action items list, process improvements, training needs
- Context: Complete incident history, all agent outputs
### 12. Monitoring and Alert Enhancement
- Use Task tool with subagent_type="observability-monitoring::observability-engineer"
- Prompt: "Enhance monitoring to prevent recurrence of: $ARGUMENTS. Implement: 1) New alerts for early detection, 2) SLI/SLO adjustments if needed, 3) Dashboard improvements for visibility, 4) Runbook automation opportunities, 5) Chaos engineering scenarios for testing. Ensure alerts are actionable and reduce noise."
- Output: New monitoring configuration, alert rules, dashboard updates, runbook automation
- Context: Postmortem findings, root cause analysis
### 13. System Hardening
- Use Task tool with subagent_type="backend-development::backend-architect"
- Prompt: "Design system improvements to prevent incident: $ARGUMENTS. Propose: 1) Architecture changes for resilience (circuit breakers, bulkheads), 2) Graceful degradation strategies, 3) Capacity planning adjustments, 4) Technical debt prioritization, 5) Dependency reduction opportunities. Create implementation roadmap."
- Output: Architecture improvements, resilience patterns, technical debt items, roadmap
@@ -109,6 +124,7 @@ Orchestrate multi-agent incident response with modern SRE practices for rapid re
## Success Criteria
### Immediate Success (During Incident)
- Service restoration within SLA targets
- Accurate severity classification within 5 minutes
- Stakeholder communication every 15-30 minutes
@@ -116,6 +132,7 @@ Orchestrate multi-agent incident response with modern SRE practices for rapid re
- Clear incident command structure maintained
### Long-term Success (Post-Incident)
- Comprehensive postmortem within 48 hours
- All action items assigned with deadlines
- Monitoring improvements deployed within 1 week
@@ -126,21 +143,24 @@ Orchestrate multi-agent incident response with modern SRE practices for rapid re
## Coordination Protocols
### Incident Command Structure
- **Incident Commander**: Decision authority, coordination
- **Technical Lead**: Technical investigation and resolution
- **Communications Lead**: Stakeholder updates
- **Subject Matter Experts**: Specific system expertise
### Communication Channels
- War room (Slack/Teams channel or Zoom)
- Status page updates (StatusPage, Statusly)
- PagerDuty/Opsgenie for alerting
- Confluence/Notion for documentation
### Handoff Requirements
- Each phase provides clear context to the next
- All findings documented in shared incident doc
- Decision rationale recorded for postmortem
- Timestamp all significant events
Production incident requiring immediate response: $ARGUMENTS
Production incident requiring immediate response: $ARGUMENTS

View File

@@ -9,6 +9,7 @@ Use Task tool with subagent_type="error-debugging::error-detective" followed by
**First: Error-Detective Analysis**
**Prompt:**
```
Analyze error traces, logs, and observability data for: $ARGUMENTS
@@ -33,6 +34,7 @@ Modern debugging techniques to employ:
```
**Expected output:**
```
ERROR_SIGNATURE: {exception type + key message pattern}
FREQUENCY: {count, rate, trend}
@@ -48,6 +50,7 @@ RELATED_ISSUES: [similar errors, cascading failures]
**Second: Debugger Root Cause Identification**
**Prompt:**
```
Perform root cause investigation using error-detective output:
@@ -75,6 +78,7 @@ Context needed for next phase:
```
**Expected output:**
```
ROOT_CAUSE: {technical explanation with evidence}
INTRODUCING_COMMIT: {git SHA + summary if found via bisect}
@@ -94,7 +98,8 @@ Use Task tool with subagent_type="error-debugging::debugger" and subagent_type="
**First: Debugger Code Analysis**
**Prompt:**
```
````
Perform deep code analysis and bisect investigation:
Context from Phase 1:
@@ -111,22 +116,26 @@ Deliverables:
```bash
git bisect start HEAD v1.2.3
git bisect run ./test_reproduction.sh
```
````
5. Dependency compatibility matrix: version combinations that work/fail
6. Configuration analysis: environment variables, feature flags, deployment configs
7. Timing and race condition analysis: async operations, event ordering, locks
8. Memory and resource analysis: leaks, exhaustion, contention
Modern investigation techniques:
- AI-assisted code explanation (Claude/Copilot to understand complex logic)
- Automated git bisect with reproduction test
- Dependency graph analysis (npm ls, go mod graph, pip show)
- Configuration drift detection (compare staging vs production)
- Time-travel debugging using production traces
```
**Expected output:**
```
CODE_PATH: {entry → ... → failure location with key variables}
STATE_AT_FAILURE: {variable values, object states, database state}
BISECT_RESULT: {exact commit that introduced bug + diff}
@@ -134,20 +143,24 @@ DEPENDENCY_ISSUES: [version conflicts, breaking changes, CVEs]
CONFIGURATION_DRIFT: {differences between environments}
RACE_CONDITIONS: {async issues, event ordering problems}
ISOLATION_VERIFICATION: {confirmed single root cause vs multiple issues}
```
**Second: Code-Reviewer Deep Dive**
**Prompt:**
```
Review code logic and identify design issues:
Context from Debugger:
- Code path: {CODE_PATH}
- State at failure: {STATE_AT_FAILURE}
- Bisect result: {BISECT_RESULT}
Deliverables:
1. Logic flaw analysis: incorrect assumptions, missing edge cases, wrong algorithms
2. Type safety gaps: where stronger types could prevent the issue
3. Error handling review: missing try-catch, unhandled promises, panic scenarios
@@ -157,16 +170,19 @@ Deliverables:
7. Fix design: minimal change vs refactoring vs architectural improvement
Review checklist:
- Are null/undefined values handled correctly?
- Are async operations properly awaited/chained?
- Are error cases explicitly handled?
- Are type assertions safe?
- Are API contracts respected?
- Are side effects isolated?
```
**Expected output:**
```
LOGIC_FLAWS: [specific incorrect assumptions or algorithms]
TYPE_SAFETY_GAPS: [where types could prevent issues]
ERROR_HANDLING_GAPS: [unhandled error paths]
@@ -174,6 +190,7 @@ SIMILAR_VULNERABILITIES: [other code with same pattern]
FIX_DESIGN: {minimal change approach}
REFACTORING_OPPORTUNITIES: {if larger improvements warranted}
ARCHITECTURAL_CONCERNS: {if systemic issues exist}
```
## Phase 3: Fix Implementation - Domain-Specific Agent Execution
@@ -191,9 +208,11 @@ Based on Phase 2 output, route to appropriate domain agent using Task tool:
**Prompt Template (adapt for language):**
```
Implement production-safe fix with comprehensive test coverage:
Context from Phase 2:
- Root cause: {ROOT_CAUSE}
- Logic flaws: {LOGIC_FLAWS}
- Fix design: {FIX_DESIGN}
@@ -201,6 +220,7 @@ Context from Phase 2:
- Similar vulnerabilities: {SIMILAR_VULNERABILITIES}
Deliverables:
1. Minimal fix implementation addressing root cause (not symptoms)
2. Unit tests:
- Specific failure case reproduction
@@ -223,6 +243,7 @@ Deliverables:
- Structured logging for debugging
Modern implementation techniques (2024/2025):
- AI pair programming (GitHub Copilot, Claude Code) for test generation
- Type-driven development (leverage TypeScript, mypy, clippy)
- Contract-first APIs (OpenAPI, gRPC schemas)
@@ -230,6 +251,7 @@ Modern implementation techniques (2024/2025):
- Defensive programming (explicit error handling, validation)
Implementation requirements:
- Follow existing code patterns and conventions
- Add strategic debug logging (JSON structured logs)
- Include comprehensive type annotations
@@ -237,30 +259,33 @@ Implementation requirements:
- Maintain backward compatibility (version APIs if breaking)
- Add OpenTelemetry spans for distributed tracing
- Include metric counters for monitoring (success/failure rates)
```
**Expected output:**
```
FIX_SUMMARY: {what changed and why - root cause vs symptom}
CHANGED_FILES: [
{path: "...", changes: "...", reasoning: "..."}
{path: "...", changes: "...", reasoning: "..."}
]
NEW_FILES: [{path: "...", purpose: "..."}]
TEST_COVERAGE: {
unit: "X scenarios",
integration: "Y scenarios",
edge_cases: "Z scenarios",
regression: "W scenarios"
unit: "X scenarios",
integration: "Y scenarios",
edge_cases: "Z scenarios",
regression: "W scenarios"
}
TEST_RESULTS: {all_passed: true/false, details: "..."}
BREAKING_CHANGES: {none | API changes with migration path}
OBSERVABILITY_ADDITIONS: [
{type: "log", location: "...", purpose: "..."},
{type: "metric", name: "...", purpose: "..."},
{type: "trace", span: "...", purpose: "..."}
{type: "log", location: "...", purpose: "..."},
{type: "metric", name: "...", purpose: "..."},
{type: "trace", span: "...", purpose: "..."}
]
FEATURE_FLAGS: [{flag: "...", rollout_strategy: "..."}]
BACKWARD_COMPATIBILITY: {maintained | breaking with mitigation}
```
## Phase 4: Verification - Automated Testing and Performance Validation
@@ -271,15 +296,18 @@ Use Task tool with subagent_type="unit-testing::test-automator" and subagent_typ
**Prompt:**
```
Run comprehensive regression testing and verify fix quality:
Context from Phase 3:
- Fix summary: {FIX_SUMMARY}
- Changed files: {CHANGED_FILES}
- Test coverage: {TEST_COVERAGE}
- Test results: {TEST_RESULTS}
Deliverables:
1. Full test suite execution:
- Unit tests (all existing + new)
- Integration tests
@@ -308,51 +336,58 @@ Deliverables:
- Fuzzing for input validation
Modern testing practices (2024/2025):
- AI-generated test cases (GitHub Copilot, Claude Code)
- Snapshot testing for UI/API contracts
- Visual regression testing for frontend
- Chaos engineering for resilience testing
- Production traffic replay for load testing
```
**Expected output:**
```
TEST_RESULTS: {
total: N,
passed: X,
failed: Y,
skipped: Z,
new_failures: [list if any],
flaky_tests: [list if any]
total: N,
passed: X,
failed: Y,
skipped: Z,
new_failures: [list if any],
flaky_tests: [list if any]
}
CODE_COVERAGE: {
line: "X%",
branch: "Y%",
function: "Z%",
delta: "+/-W%"
line: "X%",
branch: "Y%",
function: "Z%",
delta: "+/-W%"
}
REGRESSION_DETECTED: {yes/no + details if yes}
CROSS_ENV_RESULTS: {staging: "...", qa: "..."}
SECURITY_SCAN: {
vulnerabilities: [list or "none"],
static_analysis: "...",
dependency_audit: "..."
vulnerabilities: [list or "none"],
static_analysis: "...",
dependency_audit: "..."
}
TEST_QUALITY: {deterministic: true/false, coverage_adequate: true/false}
```
**Second: Performance-Engineer Validation**
**Prompt:**
```
Measure performance impact and validate no regressions:
Context from Test-Automator:
- Test results: {TEST_RESULTS}
- Code coverage: {CODE_COVERAGE}
- Fix summary: {FIX_SUMMARY}
Deliverables:
1. Performance benchmarks:
- Response time (p50, p95, p99)
- Throughput (requests/second)
@@ -380,53 +415,60 @@ Deliverables:
- Cost implications (cloud resources)
Modern performance practices:
- OpenTelemetry instrumentation
- Continuous profiling (Pyroscope, pprof)
- Real User Monitoring (RUM)
- Synthetic monitoring
```
**Expected output:**
```
PERFORMANCE_BASELINE: {
response_time_p95: "Xms",
throughput: "Y req/s",
cpu_usage: "Z%",
memory_usage: "W MB"
response_time_p95: "Xms",
throughput: "Y req/s",
cpu_usage: "Z%",
memory_usage: "W MB"
}
PERFORMANCE_AFTER_FIX: {
response_time_p95: "Xms (delta)",
throughput: "Y req/s (delta)",
cpu_usage: "Z% (delta)",
memory_usage: "W MB (delta)"
response_time_p95: "Xms (delta)",
throughput: "Y req/s (delta)",
cpu_usage: "Z% (delta)",
memory_usage: "W MB (delta)"
}
PERFORMANCE_IMPACT: {
verdict: "improved|neutral|degraded",
acceptable: true/false,
reasoning: "..."
verdict: "improved|neutral|degraded",
acceptable: true/false,
reasoning: "..."
}
LOAD_TEST_RESULTS: {
max_throughput: "...",
breaking_point: "...",
memory_leaks: "none|detected"
max_throughput: "...",
breaking_point: "...",
memory_leaks: "none|detected"
}
APM_INSIGHTS: [slow queries, N+1 patterns, bottlenecks]
PRODUCTION_READY: {yes/no + blockers if no}
```
**Third: Code-Reviewer Final Approval**
**Prompt:**
```
Perform final code review and approve for deployment:
Context from Testing:
- Test results: {TEST_RESULTS}
- Regression detected: {REGRESSION_DETECTED}
- Performance impact: {PERFORMANCE_IMPACT}
- Security scan: {SECURITY_SCAN}
Deliverables:
1. Code quality review:
- Follows project conventions
- No code smells or anti-patterns
@@ -454,6 +496,7 @@ Deliverables:
- Success metrics defined
Review checklist:
- All tests pass
- No performance regressions
- Security vulnerabilities addressed
@@ -461,34 +504,37 @@ Review checklist:
- Backward compatibility maintained
- Observability adequate
- Deployment plan clear
```
**Expected output:**
```
REVIEW_STATUS: {APPROVED|NEEDS_REVISION|BLOCKED}
CODE_QUALITY: {score/assessment}
ARCHITECTURE_CONCERNS: [list or "none"]
SECURITY_CONCERNS: [list or "none"]
DEPLOYMENT_RISK: {low|medium|high}
ROLLBACK_PLAN: {
steps: ["..."],
estimated_time: "X minutes",
data_recovery: "..."
steps: ["..."],
estimated_time: "X minutes",
data_recovery: "..."
}
ROLLOUT_STRATEGY: {
approach: "canary|blue-green|rolling|big-bang",
phases: ["..."],
success_metrics: ["..."],
abort_criteria: ["..."]
approach: "canary|blue-green|rolling|big-bang",
phases: ["..."],
success_metrics: ["..."],
abort_criteria: ["..."]
}
MONITORING_REQUIREMENTS: [
{metric: "...", threshold: "...", action: "..."}
{metric: "...", threshold: "...", action: "..."}
]
FINAL_VERDICT: {
approved: true/false,
blockers: [list if not approved],
recommendations: ["..."]
approved: true/false,
blockers: [list if not approved],
recommendations: ["..."]
}
```
## Phase 5: Documentation and Prevention - Long-term Resilience
@@ -497,9 +543,11 @@ Use Task tool with subagent_type="comprehensive-review::code-reviewer" for preve
**Prompt:**
```
Document fix and implement prevention strategies to avoid recurrence:
Context from Phase 4:
- Final verdict: {FINAL_VERDICT}
- Review status: {REVIEW_STATUS}
- Root cause: {ROOT_CAUSE}
@@ -507,6 +555,7 @@ Context from Phase 4:
- Monitoring requirements: {MONITORING_REQUIREMENTS}
Deliverables:
1. Code documentation:
- Inline comments for non-obvious logic (minimal)
- Function/class documentation updates
@@ -543,62 +592,66 @@ Deliverables:
- Document testing strategy gaps
Modern prevention practices (2024/2025):
- AI-assisted code review rules (GitHub Copilot, Claude Code)
- Continuous security scanning (Snyk, Dependabot)
- Infrastructure as Code validation (Terraform validate, CloudFormation Linter)
- Contract testing for APIs (Pact, OpenAPI validation)
- Observability-driven development (instrument before deploying)
```
**Expected output:**
```
DOCUMENTATION_UPDATES: [
{file: "CHANGELOG.md", summary: "..."},
{file: "docs/runbook.md", summary: "..."},
{file: "docs/architecture.md", summary: "..."}
{file: "CHANGELOG.md", summary: "..."},
{file: "docs/runbook.md", summary: "..."},
{file: "docs/architecture.md", summary: "..."}
]
PREVENTION_MEASURES: {
static_analysis: [
{tool: "eslint", rule: "...", reason: "..."},
{tool: "ruff", rule: "...", reason: "..."}
],
type_system: [
{enhancement: "...", location: "...", benefit: "..."}
],
pre_commit_hooks: [
{hook: "...", purpose: "..."}
]
static_analysis: [
{tool: "eslint", rule: "...", reason: "..."},
{tool: "ruff", rule: "...", reason: "..."}
],
type_system: [
{enhancement: "...", location: "...", benefit: "..."}
],
pre_commit_hooks: [
{hook: "...", purpose: "..."}
]
}
MONITORING_ADDED: {
alerts: [
{name: "...", threshold: "...", channel: "..."}
],
dashboards: [
{name: "...", metrics: [...], url: "..."}
],
slos: [
{service: "...", sli: "...", target: "...", window: "..."}
]
alerts: [
{name: "...", threshold: "...", channel: "..."}
],
dashboards: [
{name: "...", metrics: [...], url: "..."}
],
slos: [
{service: "...", sli: "...", target: "...", window: "..."}
]
}
ARCHITECTURAL_IMPROVEMENTS: [
{improvement: "...", reasoning: "...", effort: "small|medium|large"}
{improvement: "...", reasoning: "...", effort: "small|medium|large"}
]
SIMILAR_VULNERABILITIES: {
found: N,
locations: [...],
remediation_plan: "..."
found: N,
locations: [...],
remediation_plan: "..."
}
FOLLOW_UP_TASKS: [
{task: "...", priority: "high|medium|low", owner: "..."}
{task: "...", priority: "high|medium|low", owner: "..."}
]
POSTMORTEM: {
created: true/false,
location: "...",
incident_severity: "SEV1|SEV2|SEV3|SEV4"
created: true/false,
location: "...",
incident_severity: "SEV1|SEV2|SEV3|SEV4"
}
KNOWLEDGE_BASE_UPDATES: [
{article: "...", summary: "..."}
{article: "...", summary: "..."}
]
```
## Multi-Domain Coordination for Complex Issues
@@ -657,26 +710,32 @@ For issues spanning multiple domains, orchestrate specialized agents sequentiall
**Context Passing Template:**
```
Context for {next_agent}:
Completed by {previous_agent}:
- {summary_of_work}
- {key_findings}
- {changes_made}
Remaining work:
- {specific_tasks_for_next_agent}
- {files_to_modify}
- {constraints_to_follow}
Dependencies:
- {systems_or_components_affected}
- {data_needed}
- {integration_points}
Success criteria:
- {measurable_outcomes}
- {verification_steps}
```
## Configuration Options
@@ -721,13 +780,16 @@ Customize workflow behavior by setting priorities at invocation:
**Example Invocation:**
```
Issue: Users experiencing timeout errors on checkout page (500+ errors/hour)
Config:
- VERIFICATION_LEVEL: comprehensive (affects revenue)
- PREVENTION_FOCUS: comprehensive (high business impact)
- ROLLOUT_STRATEGY: canary (test on 5% traffic first)
- OBSERVABILITY_LEVEL: comprehensive (need detailed monitoring)
```
## Modern Debugging Tools Integration
@@ -832,3 +894,4 @@ A fix is considered complete when ALL of the following are met:
- Deployment success rate: > 95% (rollback rate < 5%)
Issue to resolve: $ARGUMENTS
```

View File

@@ -20,12 +20,12 @@ Production-ready templates for incident response runbooks covering detection, tr
### 1. Incident Severity Levels
| Severity | Impact | Response Time | Example |
|----------|--------|---------------|---------|
| **SEV1** | Complete outage, data loss | 15 min | Production down |
| **SEV2** | Major degradation | 30 min | Critical feature broken |
| **SEV3** | Minor impact | 2 hours | Non-critical bug |
| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
| Severity | Impact | Response Time | Example |
| -------- | -------------------------- | ----------------- | ----------------------- |
| **SEV1** | Complete outage, data loss | 15 min | Production down |
| **SEV2** | Major degradation | 30 min | Critical feature broken |
| **SEV3** | Minor impact | 2 hours | Non-critical bug |
| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
### 2. Runbook Structure
@@ -45,28 +45,33 @@ Production-ready templates for incident response runbooks covering detection, tr
### Template 1: Service Outage Runbook
```markdown
````markdown
# [Service Name] Outage Runbook
## Overview
**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall
## Impact Assessment
- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?
## Detection
### Alerts
- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate < 95%` (PagerDuty)
### Dashboards
- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)
@@ -74,6 +79,7 @@ Production-ready templates for incident response runbooks covering detection, tr
## Initial Triage (First 5 Minutes)
### 1. Assess Scope
```bash
# Check service health
kubectl get pods -n payments -l app=payment-service
@@ -84,24 +90,28 @@ kubectl rollout history deployment/payment-service -n payments
# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
```
````
### 2. Quick Health Checks
- [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`
- [ ] Database connectivity? Check connection pool metrics
- [ ] External dependencies? Check Stripe, bank API status
- [ ] Recent changes? Check deploy history
### 3. Initial Classification
| Symptom | Likely Cause | Go To Section |
|---------|--------------|---------------|
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |
| Symptom | Likely Cause | Go To Section |
| -------------------- | ------------------- | ------------- |
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |
## Mitigation Procedures
### 4.1 Service Completely Down
```bash
# Step 1: Check pod status
kubectl get pods -n payments
@@ -123,6 +133,7 @@ kubectl rollout status deployment/payment-service -n payments
```
### 4.2 High Latency
```bash
# Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
@@ -147,6 +158,7 @@ kubectl set env deployment/payment-service \
```
### 4.3 Partial Failures (Specific Errors)
```bash
# Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
@@ -167,6 +179,7 @@ psql -h $DB_HOST -c "
```
### 4.4 Traffic Surge
```bash
# Step 1: Check current request rate
kubectl top pods -n payments
@@ -200,6 +213,7 @@ EOF
```
## Verification Steps
```bash
# Verify service is healthy
curl -s https://api.company.com/payments/health | jq
@@ -215,6 +229,7 @@ curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(r
```
## Rollback Procedures
```bash
# Rollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments
@@ -229,16 +244,17 @@ curl -X POST https://api.company.com/internal/feature-flags \
## Escalation Matrix
| Condition | Escalate To | Contact |
|-----------|-------------|---------|
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
| Data breach suspected | Security Team | #security-incidents |
| Financial impact > $10k | Finance + Legal | @finance-oncall |
| Customer communication needed | Support Lead | @support-lead |
| Condition | Escalate To | Contact |
| ----------------------------- | ------------------- | ------------------- |
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
| Data breach suspected | Security Team | #security-incidents |
| Financial impact > $10k | Finance + Legal | @finance-oncall |
| Customer communication needed | Support Lead | @support-lead |
## Communication Templates
### Initial Notification (Internal)
```
🚨 INCIDENT: Payment Service Degradation
@@ -257,6 +273,7 @@ Updates in #payments-incidents
```
### Status Update
```
📊 UPDATE: Payment Service Incident
@@ -276,6 +293,7 @@ ETA to Resolution: ~15 minutes
```
### Resolution Notification
```
✅ RESOLVED: Payment Service Incident
@@ -291,7 +309,8 @@ Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress
```
```
````
### Template 2: Database Incident Runbook
@@ -325,9 +344,10 @@ SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
```
````
## Replication Lag
```sql
-- Check lag on replica
SELECT
@@ -343,6 +363,7 @@ SELECT
```
## Disk Space Critical
```bash
# Check disk usage
df -h /var/lib/postgresql/data
@@ -358,6 +379,7 @@ psql -c "VACUUM FULL large_table;"
# If emergency, delete old data or expand disk
```
```
## Best Practices
@@ -381,3 +403,4 @@ psql -c "VACUUM FULL large_table;"
- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)
```

View File

@@ -20,13 +20,13 @@ Effective patterns for on-call shift transitions, ensuring continuity, context t
### 1. Handoff Components
| Component | Purpose |
|-----------|---------|
| **Active Incidents** | What's currently broken |
| **Ongoing Investigations** | Issues being debugged |
| **Recent Changes** | Deployments, configs |
| **Known Issues** | Workarounds in place |
| **Upcoming Events** | Maintenance, releases |
| Component | Purpose |
| -------------------------- | ----------------------- |
| **Active Incidents** | What's currently broken |
| **Ongoing Investigations** | Issues being debugged |
| **Recent Changes** | Deployments, configs |
| **Known Issues** | Workarounds in place |
| **Upcoming Events** | Maintenance, releases |
### 2. Handoff Timing
@@ -47,7 +47,7 @@ Incoming:
### Template 1: Shift Handoff Document
```markdown
````markdown
# On-Call Handoff: Platform Team
**Outgoing**: @alice (2024-01-15 to 2024-01-22)
@@ -59,6 +59,7 @@ Incoming:
## 🔴 Active Incidents
### None currently active
No active incidents at handoff time.
---
@@ -66,40 +67,48 @@ No active incidents at handoff time.
## 🟡 Ongoing Investigations
### 1. Intermittent API Timeouts (ENG-1234)
**Status**: Investigating
**Started**: 2024-01-20
**Impact**: ~0.1% of requests timing out
**Context**:
- Timeouts correlate with database backup window (02:00-03:00 UTC)
- Suspect backup process causing lock contention
- Added extra logging in PR #567 (deployed 01/21)
**Next Steps**:
- [ ] Review new logs after tonight's backup
- [ ] Consider moving backup window if confirmed
**Resources**:
- Dashboard: [API Latency](https://grafana/d/api-latency)
- Thread: #platform-eng (01/20, 14:32)
---
### 2. Memory Growth in Auth Service (ENG-1235)
**Status**: Monitoring
**Started**: 2024-01-18
**Impact**: None yet (proactive)
**Context**:
- Memory usage growing ~5% per day
- No memory leak found in profiling
- Suspect connection pool not releasing properly
**Next Steps**:
- [ ] Review heap dump from 01/21
- [ ] Consider restart if usage > 80%
**Resources**:
- Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)
- Analysis doc: [Memory Investigation](https://docs/eng-1235)
@@ -108,6 +117,7 @@ No active incidents at handoff time.
## 🟢 Resolved This Shift
### Payment Service Outage (2024-01-19)
- **Duration**: 23 minutes
- **Root Cause**: Database connection exhaustion
- **Resolution**: Rolled back v2.3.4, increased pool size
@@ -119,17 +129,20 @@ No active incidents at handoff time.
## 📋 Recent Changes
### Deployments
| Service | Version | Time | Notes |
|---------|---------|------|-------|
| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |
| user-service | v2.8.0 | 01/20 10:00 | New profile features |
| auth-service | v4.1.2 | 01/19 16:00 | Security patch |
| Service | Version | Time | Notes |
| ------------ | ------- | ----------- | -------------------------- |
| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |
| user-service | v2.8.0 | 01/20 10:00 | New profile features |
| auth-service | v4.1.2 | 01/19 16:00 | Security patch |
### Configuration Changes
- 01/21: Increased API rate limit from 1000 to 1500 RPS
- 01/20: Updated database connection pool max from 50 to 75
### Infrastructure
- 01/20: Added 2 nodes to Kubernetes cluster
- 01/19: Upgraded Redis from 6.2 to 7.0
@@ -138,11 +151,13 @@ No active incidents at handoff time.
## ⚠️ Known Issues & Workarounds
### 1. Slow Dashboard Loading
**Issue**: Grafana dashboards slow on Monday mornings
**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up
**Ticket**: OPS-456 (P3)
### 2. Flaky Integration Test
**Issue**: `test_payment_flow` fails intermittently in CI
**Workaround**: Re-run failed job (usually passes on retry)
**Ticket**: ENG-1200 (P2)
@@ -151,28 +166,29 @@ No active incidents at handoff time.
## 📅 Upcoming Events
| Date | Event | Impact | Contact |
|------|-------|--------|---------|
| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |
| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |
| 01/25 | Marketing campaign | 2x traffic expected | @platform |
| Date | Event | Impact | Contact |
| ----------- | -------------------- | ------------------- | ------------- |
| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |
| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |
| 01/25 | Marketing campaign | 2x traffic expected | @platform |
---
## 📞 Escalation Reminders
| Issue Type | First Escalation | Second Escalation |
|------------|------------------|-------------------|
| Payment issues | @payments-oncall | @payments-manager |
| Auth issues | @auth-oncall | @security-team |
| Database issues | @dba-team | @infra-manager |
| Unknown/severe | @engineering-manager | @vp-engineering |
| Issue Type | First Escalation | Second Escalation |
| --------------- | -------------------- | ----------------- |
| Payment issues | @payments-oncall | @payments-manager |
| Auth issues | @auth-oncall | @security-team |
| Database issues | @dba-team | @infra-manager |
| Unknown/severe | @engineering-manager | @vp-engineering |
---
## 🔧 Quick Reference
### Common Commands
```bash
# Check service health
kubectl get pods -A | grep -v Running
@@ -186,8 +202,10 @@ psql -c "SELECT count(*) FROM pg_stat_activity;"
# Clear cache (emergency only)
redis-cli FLUSHDB
```
````
### Important Links
- [Runbooks](https://wiki/runbooks)
- [Service Catalog](https://wiki/services)
- [Incident Slack](https://slack.com/incidents)
@@ -198,6 +216,7 @@ redis-cli FLUSHDB
## Handoff Checklist
### Outgoing Engineer
- [x] Document active incidents
- [x] Document ongoing investigations
- [x] List recent changes
@@ -206,13 +225,15 @@ redis-cli FLUSHDB
- [x] Sync with incoming engineer
### Incoming Engineer
- [ ] Read this document
- [ ] Join sync call
- [ ] Verify PagerDuty is routing to you
- [ ] Verify Slack notifications working
- [ ] Check VPN/access working
- [ ] Review critical dashboards
```
````
### Template 2: Quick Handoff (Async)
@@ -238,7 +259,7 @@ redis-cli FLUSHDB
## Questions?
I'll be available on Slack until 17:00 today.
```
````
### Template 3: Incident Handoff (Mid-Incident)
@@ -252,36 +273,43 @@ I'll be available on Slack until 17:00 today.
---
## Current State
- Error rate: 15% (down from 40%)
- Mitigation in progress: scaling up pods
- ETA to resolution: ~30 min
## What We Know
1. Root cause: Memory pressure on payment-service pods
2. Triggered by: Unusual traffic spike (3x normal)
3. Contributing: Inefficient query in checkout flow
## What We've Done
- Scaled payment-service from 5 → 15 pods
- Enabled rate limiting on checkout endpoint
- Disabled non-critical features
## What Needs to Happen
1. Monitor error rate - should reach <1% in ~15 min
2. If not improving, escalate to @payments-manager
3. Once stable, begin root cause investigation
## Key People
- Incident Commander: @alice (handing off)
- Comms Lead: @charlie
- Technical Lead: @bob (incoming)
## Communication
- Status page: Updated at 08:45
- Customer support: Notified
- Exec team: Aware
## Resources
- Incident channel: #inc-20240122-payment
- Dashboard: [Payment Service](https://grafana/d/payments)
- Runbook: [Payment Degradation](https://wiki/runbooks/payments)
@@ -289,6 +317,7 @@ I'll be available on Slack until 17:00 today.
---
**Incoming on-call (@bob) - Please confirm you have:**
- [ ] Joined #inc-20240122-payment
- [ ] Access to dashboards
- [ ] Understand current state
@@ -331,6 +360,7 @@ I'll be available on Slack until 17:00 today.
## Pre-Shift Checklist
### Access Verification
- [ ] VPN working
- [ ] kubectl access to all clusters
- [ ] Database read access
@@ -338,18 +368,21 @@ I'll be available on Slack until 17:00 today.
- [ ] PagerDuty app installed and logged in
### Alerting Setup
- [ ] PagerDuty schedule shows you as primary
- [ ] Phone notifications enabled
- [ ] Slack notifications for incident channels
- [ ] Test alert received and acknowledged
### Knowledge Refresh
- [ ] Review recent incidents (past 2 weeks)
- [ ] Check service changelog
- [ ] Skim critical runbooks
- [ ] Know escalation contacts
### Environment Ready
- [ ] Laptop charged and accessible
- [ ] Phone charged
- [ ] Quiet space available for calls
@@ -362,18 +395,21 @@ I'll be available on Slack until 17:00 today.
## Daily On-Call Routine
### Morning (start of day)
- [ ] Check overnight alerts
- [ ] Review dashboards for anomalies
- [ ] Check for any P0/P1 tickets created
- [ ] Skim incident channels for context
### Throughout Day
- [ ] Respond to alerts within SLA
- [ ] Document investigation progress
- [ ] Update team on significant issues
- [ ] Triage incoming pages
### End of Day
- [ ] Hand off any active issues
- [ ] Update investigation docs
- [ ] Note anything for next shift
@@ -400,18 +436,21 @@ I'll be available on Slack until 17:00 today.
## Escalation Triggers
### Immediate Escalation
- SEV1 incident declared
- Data breach suspected
- Unable to diagnose within 30 min
- Customer or legal escalation received
### Consider Escalation
- Issue spans multiple teams
- Requires expertise you don't have
- Business impact exceeds threshold
- You're uncertain about next steps
### How to Escalate
1. Page the appropriate escalation path
2. Provide brief context in Slack
3. Stay engaged until escalation acknowledges
@@ -421,6 +460,7 @@ I'll be available on Slack until 17:00 today.
## Best Practices
### Do's
- **Document everything** - Future you will thank you
- **Escalate early** - Better safe than sorry
- **Take breaks** - Alert fatigue is real
@@ -428,6 +468,7 @@ I'll be available on Slack until 17:00 today.
- **Test your setup** - Before incidents, not during
### Don'ts
- **Don't skip handoffs** - Context loss causes incidents
- **Don't hero** - Escalate when needed
- **Don't ignore alerts** - Even if they seem minor

View File

@@ -20,13 +20,13 @@ Comprehensive guide to writing effective, blameless postmortems that drive organ
### 1. Blameless Culture
| Blame-Focused | Blameless |
|---------------|-----------|
| "Who caused this?" | "What conditions allowed this?" |
| Blame-Focused | Blameless |
| ------------------------ | --------------------------------- |
| "Who caused this?" | "What conditions allowed this?" |
| "Someone made a mistake" | "The system allowed this mistake" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
| Fear of speaking up | Psychological safety |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
| Fear of speaking up | Psychological safety |
### 2. Postmortem Triggers
@@ -40,6 +40,7 @@ Comprehensive guide to writing effective, blameless postmortems that drive organ
## Quick Start
### Postmortem Timeline
```
Day 0: Incident occurs
Day 1-2: Draft postmortem document
@@ -67,6 +68,7 @@ Quarterly: Review patterns across incidents
On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
**Impact**:
- 12,000 customers unable to complete purchases
- Estimated revenue loss: $45,000
- 847 support tickets created
@@ -74,18 +76,18 @@ On January 15, 2024, the payment processing service experienced a 47-minute outa
## Timeline (All times UTC)
| Time | Event |
|------|-------|
| 14:23 | Deployment v2.3.4 completed to production |
| 14:31 | First alert: `payment_error_rate > 5%` |
| 14:33 | On-call engineer @alice acknowledges alert |
| Time | Event |
| ----- | ----------------------------------------------- |
| 14:23 | Deployment v2.3.4 completed to production |
| 14:31 | First alert: `payment_error_rate > 5%` |
| 14:33 | On-call engineer @alice acknowledges alert |
| 14:35 | Initial investigation begins, error rate at 23% |
| 14:41 | Incident declared SEV2, @bob joins |
| 14:45 | Database connection exhaustion identified |
| 14:52 | Decision to rollback deployment |
| 14:58 | Rollback to v2.3.3 initiated |
| 15:10 | Rollback complete, error rate dropping |
| 15:18 | Service fully recovered, incident resolved |
| 14:41 | Incident declared SEV2, @bob joins |
| 14:45 | Database connection exhaustion identified |
| 14:52 | Decision to rollback deployment |
| 14:58 | Rollback to v2.3.3 initiated |
| 15:10 | Rollback complete, error rate dropping |
| 15:18 | Service fully recovered, incident resolved |
## Root Cause Analysis
@@ -111,13 +113,14 @@ The v2.3.4 deployment included a change to the database query pattern that inadv
- Why was developer unfamiliar? → No documentation on connection management patterns
### System Diagram
```
[Client] → [Load Balancer] → [Payment Service] → [Database]
Connection Pool (broken)
Direct connections (cause)
Connection Pool (broken)
Direct connections (cause)
```
## Detection
@@ -219,11 +222,13 @@ The deployment completed at 14:23, but the first alert didn't fire until 14:31 (
# 5 Whys Analysis: [Incident]
## Problem Statement
Payment service experienced 47-minute outage due to database connection exhaustion.
## Analysis
### Why #1: Why did the service fail?
**Answer**: Database connections were exhausted, causing all new requests to fail.
**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
@@ -231,6 +236,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
---
### Why #2: Why were database connections exhausted?
**Answer**: Each incoming request opened a new database connection instead of using the connection pool.
**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.
@@ -238,6 +244,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
---
### Why #3: Why did the code bypass the connection pool?
**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.
**Evidence**: PR #1234 shows the change, made while fixing a different bug.
@@ -245,6 +252,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
---
### Why #4: Why wasn't this caught in code review?
**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
**Evidence**: Review comments only discuss business logic.
@@ -252,6 +260,7 @@ Payment service experienced 47-minute outage due to database connection exhausti
---
### Why #5: Why isn't there a safety net for this type of change?
**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.
@@ -264,12 +273,12 @@ Payment service experienced 47-minute outage due to database connection exhausti
## Systemic Improvements
| Root Cause | Improvement | Type |
|------------|-------------|------|
| Root Cause | Improvement | Type |
| ------------- | --------------------------------- | ---------- |
| Missing tests | Add infrastructure behavior tests | Prevention |
| Missing docs | Document connection patterns | Prevention |
| Review gaps | Update review checklist | Detection |
| No canary | Implement canary deployments | Mitigation |
| Missing docs | Document connection patterns | Prevention |
| Review gaps | Update review checklist | Detection |
| No canary | Implement canary deployments | Mitigation |
```
### Template 3: Quick Postmortem (Minor Incidents)
@@ -280,9 +289,11 @@ Payment service experienced 47-minute outage due to database connection exhausti
**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3
## What Happened
API latency spiked to 5s due to cache miss storm after cache flush.
## Timeline
- 10:00 - Cache flush initiated for config update
- 10:02 - Latency alerts fire
- 10:05 - Identified as cache miss storm
@@ -290,13 +301,16 @@ API latency spiked to 5s due to cache miss storm after cache flush.
- 10:12 - Latency normalized
## Root Cause
Full cache flush for minor config update caused thundering herd.
## Fix
- Immediate: Enabled cache warming
- Long-term: Implement partial cache invalidation (ENG-999)
## Lessons
Don't full-flush cache in production; use targeted invalidation.
```
@@ -308,32 +322,38 @@ Don't full-flush cache in production; use targeted invalidation.
## Meeting Structure (60 minutes)
### 1. Opening (5 min)
- Remind everyone of blameless culture
- "We're here to learn, not to blame"
- Review meeting norms
### 2. Timeline Review (15 min)
- Walk through events chronologically
- Ask clarifying questions
- Identify gaps in timeline
### 3. Analysis Discussion (20 min)
- What failed?
- Why did it fail?
- What conditions allowed this?
- What would have prevented it?
### 4. Action Items (15 min)
- Brainstorm improvements
- Prioritize by impact and effort
- Assign owners and due dates
### 5. Closing (5 min)
- Summarize key learnings
- Confirm action item owners
- Schedule follow-up if needed
## Facilitation Tips
- Keep discussion on track
- Redirect blame to systems
- Encourage quiet participants
@@ -343,17 +363,18 @@ Don't full-flush cache in production; use targeted invalidation.
## Anti-Patterns to Avoid
| Anti-Pattern | Problem | Better Approach |
|--------------|---------|-----------------|
| **Blame game** | Shuts down learning | Focus on systems |
| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
| **No action items** | Waste of time | Always have concrete next steps |
| **Unrealistic actions** | Never completed | Scope to achievable tasks |
| **No follow-up** | Actions forgotten | Track in ticketing system |
| Anti-Pattern | Problem | Better Approach |
| ----------------------- | -------------------------- | ------------------------------- |
| **Blame game** | Shuts down learning | Focus on systems |
| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
| **No action items** | Waste of time | Always have concrete next steps |
| **Unrealistic actions** | Never completed | Scope to achievable tasks |
| **No follow-up** | Actions forgotten | Track in ticketing system |
## Best Practices
### Do's
- **Start immediately** - Memory fades fast
- **Be specific** - Exact times, exact errors
- **Include graphs** - Visual evidence
@@ -361,6 +382,7 @@ Don't full-flush cache in production; use targeted invalidation.
- **Share widely** - Organizational learning
### Don'ts
- **Don't name and shame** - Ever
- **Don't skip small incidents** - They reveal patterns
- **Don't make it a blame doc** - That kills learning