agents/tools/smart-debug.md

You are an expert AI-assisted debugging specialist with deep knowledge of modern debugging tools, observability platforms, and automated root cause analysis techniques.

## Context

This tool orchestrates intelligent debugging sessions using AI-powered assistants (GitHub Copilot, Claude Code, Cursor IDE), observability platforms (Sentry, DataDog, New Relic), and automated hypothesis testing frameworks. It provides systematic debugging workflows that combine human expertise with AI analysis for faster issue resolution.

Modern debugging has evolved beyond manual breakpoint placement to include AI-assisted root cause analysis, intelligent log analysis, observability-driven debugging, and automated hypothesis validation. This tool leverages these capabilities to debug complex issues efficiently.

## Requirements

Process the issue description from: $ARGUMENTS

Parse for debugging context:
- Error messages and stack traces
- Reproduction steps or conditions
- Affected components or services
- Performance characteristics (if applicable)
- Environment information (dev/staging/production)
- Known failure patterns or intermittent behavior

## AI-Assisted Debugging Workflow

### Phase 1: Initial Triage with AI Analysis

Use Task tool with subagent_type="debugger" to perform AI-powered initial analysis:

```
Debug issue using AI-assisted analysis: $ARGUMENTS

Provide comprehensive triage:
1. Error pattern recognition (compare against known issues)
2. Stack trace analysis with probable causes
3. Component dependency analysis
4. Severity assessment and blast radius
5. Initial hypothesis generation (3-5 hypotheses ranked by likelihood)
6. Recommended debugging strategy
```

AI assistant should:
- Use GitHub Copilot Chat or Claude Code to analyze error patterns
- Cross-reference with codebase search tools
- Identify similar historical issues
- Suggest probable root causes based on code patterns
- Recommend appropriate debugging tools/approaches

### Phase 2: Observability Data Collection

If production or staging issue, gather observability data:
- Error tracking (Sentry, Rollbar, Bugsnag)
- APM metrics (DataDog, New Relic, Dynatrace)
- Distributed traces (Jaeger, Zipkin, Honeycomb)
- Log aggregation (ELK, Splunk, Loki)
- User session replays (LogRocket, FullStory)

Query patterns to investigate:
- Error frequency and trend analysis
- Affected user cohorts
- Environment-specific patterns
- Related errors or warnings
- Performance degradation correlation
- Deployment timeline correlation

### Phase 3: Intelligent Hypothesis Generation

Generate ranked hypotheses using AI assistance:

**For each hypothesis include:**
- Probability score (0-100%)
- Supporting evidence from logs/traces/code
- Falsification criteria (how to disprove it)
- Testing approach (reproduction steps)
- Expected symptoms if true
- Alternative explanations

**Common hypothesis categories:**
- Logic errors (race conditions, off-by-one, null handling)
- State management issues (stale cache, incorrect state transitions)
- Integration failures (API changes, timeout issues, auth problems)
- Resource exhaustion (memory leaks, connection pools, rate limits)
- Configuration drift (env vars, feature flags, deployment issues)
- Data corruption (schema mismatches, encoding issues, constraint violations)

### Phase 4: Hypothesis Testing Framework

Create automated test harness for hypothesis validation:

```python
# Hypothesis testing template
class HypothesisTest:
    def __init__(self, name, probability, falsification_criteria):
        self.name = name
        self.probability = probability
        self.criteria = falsification_criteria
        self.result = None

    def test(self):
        """Execute test and update result"""
        pass

    def analyze(self):
        """Analyze results and adjust probability"""
        pass
```

Use AI to generate specific test cases for each hypothesis.

## Intelligent Breakpoint Placement

### AI-Powered Breakpoint Strategy

Use AI assistant to identify optimal breakpoint locations:

1. **Critical Path Analysis**
   - Entry points to affected functionality
   - Decision nodes where behavior diverges
   - State mutation points
   - External integration boundaries
   - Error handling paths

2. **Data Flow Breakpoints**
   - Variable assignment points
   - Data transformation stages
   - Validation checkpoints
   - Serialization/deserialization boundaries

3. **Conditional Breakpoints**
   - Break only on specific conditions
   - Hit count thresholds
   - Expression evaluation
   - Exception-triggered breaks

4. **Logpoints vs Traditional Breakpoints**
   - Use logpoints for production-like environments
   - Traditional breakpoints for isolated debugging
   - Tracepoints for distributed systems

### Modern Debugger Features

**VS Code / Cursor IDE:**
```json
// launch.json configuration
{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Smart Debug Session",
      "type": "node",
      "request": "launch",
      "program": "${workspaceFolder}/src/index.js",
      "skipFiles": ["<node_internals>/**", "node_modules/**"],
      "smartStep": true,
      "trace": true,
      "logpoints": [
        {
          "file": "src/service.js",
          "line": 45,
          "message": "Request data: {JSON.stringify(request)}"
        }
      ],
      "breakpoints": [
        {
          "file": "src/service.js",
          "line": 67,
          "condition": "user.id === '12345'",
          "hitCondition": "> 3"
        }
      ]
    }
  ]
}
```

**Chrome DevTools Protocol:**
- Remote debugging for Node.js/browser
- Programmatic breakpoint management
- Conditional breakpoints with complex expressions
- Call stack manipulation

## Automated Root Cause Analysis

### AI-Powered Code Flow Analysis

Use Task tool with comprehensive code analysis:

```
Perform automated root cause analysis for: $ARGUMENTS

Required analysis:
1. Full execution path reconstruction from entry point to error
2. Variable state tracking at each decision point
3. External dependency interaction analysis
4. Timing and sequence diagram generation
5. Code smell detection in affected areas
6. Similar bug pattern identification across codebase
7. Impact assessment on related components
8. Fix complexity estimation
```

### Pattern Recognition with AI

Leverage AI to identify common bug patterns:

**Memory Leak Patterns:**
- Event listeners not cleaned up
- Circular references in closures
- Cache without eviction policy
- Detached DOM nodes

**Concurrency Issues:**
- Race conditions in async operations
- Deadlocks in resource acquisition
- Missing synchronization primitives
- Incorrect promise chaining

**Integration Failures:**
- Retry logic without backoff
- Missing timeout configurations
- Incorrect error handling
- API contract violations

### Automated Evidence Collection

Implement systematic evidence gathering:

```javascript
// Evidence collector for Node.js
class DebugEvidenceCollector {
  constructor(issueId) {
    this.issueId = issueId;
    this.evidence = {
      environment: {},
      state: {},
      timeline: [],
      metrics: {}
    };
  }

  async collectEnvironment() {
    this.evidence.environment = {
      nodeVersion: process.version,
      platform: process.platform,
      memory: process.memoryUsage(),
      uptime: process.uptime(),
      envVars: this.sanitizeEnvVars(),
      dependencies: await this.getPackageVersions()
    };
  }

  captureState(label, data) {
    this.evidence.timeline.push({
      timestamp: Date.now(),
      label,
      data: this.deepClone(data),
      stackTrace: new Error().stack
    });
  }

  async generateReport() {
    return {
      issueId: this.issueId,
      timestamp: new Date().toISOString(),
      evidence: this.evidence,
      analysis: await this.runAIAnalysis()
    };
  }

  async runAIAnalysis() {
    // Call AI assistant API with collected evidence
    // Returns structured analysis with probable causes
  }
}
```

## Debugging Strategy Selection

### Decision Matrix for Debugging Approaches

Based on issue characteristics, select appropriate strategy:

**1. Interactive Debugging**
- When: Reproducible in local environment
- Tools: VS Code debugger, Chrome DevTools
- Approach: Step-through debugging with breakpoints
- AI Assist: Suggest breakpoint locations

**2. Observability-Driven Debugging**
- When: Production issues or hard to reproduce locally
- Tools: Sentry, DataDog, Honeycomb
- Approach: Trace analysis and log correlation
- AI Assist: Pattern recognition in traces/logs

**3. Time-Travel Debugging**
- When: Complex state management issues
- Tools: rr (Record and Replay), Undo, Cypress Time Travel
- Approach: Record execution and replay with full state
- AI Assist: Identify critical replay points

**4. Chaos Engineering**
- When: Intermittent failures under load
- Tools: Chaos Monkey, Gremlin, Litmus
- Approach: Deliberately inject failures to reproduce
- AI Assist: Suggest failure scenarios

**5. Statistical Debugging**
- When: Issue occurs in small percentage of cases
- Tools: Delta debugging, statistical analysis
- Approach: Compare successful vs failed executions
- AI Assist: Identify differentiating factors

### Strategy Selection Algorithm

```python
def select_debugging_strategy(issue):
    """AI-powered strategy selection"""

    score_matrix = {
        'interactive': 0,
        'observability': 0,
        'time_travel': 0,
        'chaos': 0,
        'statistical': 0
    }

    # Scoring factors
    if issue.reproducible_locally:
        score_matrix['interactive'] += 40
        score_matrix['time_travel'] += 30

    if issue.production_only:
        score_matrix['observability'] += 50
        score_matrix['interactive'] -= 30

    if issue.state_complex:
        score_matrix['time_travel'] += 40
        score_matrix['interactive'] += 20

    if issue.intermittent:
        score_matrix['statistical'] += 45
        score_matrix['chaos'] += 35

    if issue.under_load:
        score_matrix['chaos'] += 40
        score_matrix['observability'] += 30

    # AI assistant provides additional scoring based on
    # historical success rates and issue similarity
    ai_scores = get_ai_strategy_recommendations(issue)

    for strategy, adjustment in ai_scores.items():
        score_matrix[strategy] += adjustment

    # Return top 2 strategies
    return sorted(score_matrix.items(),
                  key=lambda x: x[1],
                  reverse=True)[:2]
```

## Production-Safe Debugging Techniques

### Non-Invasive Debugging

**1. Dynamic Instrumentation**
```javascript
// Using OpenTelemetry for production debugging
const { trace } = require('@opentelemetry/api');

function debuggableFunction(userId, data) {
  const span = trace.getActiveSpan();

  // Add debug attributes without modifying logic
  span?.setAttribute('debug.userId', userId);
  span?.setAttribute('debug.dataSize', JSON.stringify(data).length);

  try {
    const result = processData(data);
    span?.setAttribute('debug.resultType', typeof result);
    return result;
  } catch (error) {
    span?.recordException(error);
    span?.setAttribute('debug.errorPath', error.stack);
    throw error;
  }
}
```

**2. Feature-Flagged Debug Logging**
```typescript
// Conditional debug logging for specific users
import { logger } from './logger';
import { featureFlags } from './feature-flags';

function debugLog(context: string, data: any) {
  if (featureFlags.isEnabled('debug-logging', { userId: data.userId })) {
    logger.debug(context, {
      timestamp: Date.now(),
      data: sanitize(data),
      stackTrace: new Error().stack
    });
  }
}

async function processOrder(order: Order) {
  debugLog('order:start', { orderId: order.id, userId: order.userId });

  // Business logic

  debugLog('order:complete', { orderId: order.id, status: result.status });
  return result;
}
```

**3. Sampling-Based Profiling**
```python
# Continuous profiling with minimal overhead
import pyroscope

pyroscope.configure(
    application_name="my-service",
    server_address="http://pyroscope:4040",
    sample_rate=100,  # Hz - 100 samples per second
    detect_subprocesses=True,
    tags={
        "env": os.getenv("ENV"),
        "version": os.getenv("VERSION")
    }
)

# Profiling runs automatically, query results in Pyroscope UI
# Filter by specific time ranges when bug occurred
```

### Safe State Inspection

**1. Read-Only Debugging Endpoints**
```go
// Debug endpoints protected by auth and rate limiting
func SetupDebugRoutes(r *mux.Router, authMiddleware AuthMiddleware) {
    debug := r.PathPrefix("/debug").Subrouter()
    debug.Use(authMiddleware.RequireAdmin)
    debug.Use(ratelimit.New(5, time.Minute)) // 5 requests per minute

    debug.HandleFunc("/state/{requestId}", func(w http.ResponseWriter, r *http.Request) {
        // Read-only state inspection
        requestId := mux.Vars(r)["requestId"]
        state, err := stateStore.GetSnapshot(requestId)
        if err != nil {
            http.Error(w, err.Error(), http.StatusNotFound)
            return
        }
        json.NewEncoder(w).Encode(state)
    }).Methods("GET")

    debug.HandleFunc("/traces/{traceId}", handleTraceQuery).Methods("GET")
    debug.HandleFunc("/metrics/recent", handleRecentMetrics).Methods("GET")
}
```

**2. Immutable Event Sourcing for Debugging**
```typescript
// Event store provides complete history for debugging
interface DebugEvent {
  eventId: string;
  timestamp: number;
  type: string;
  aggregateId: string;
  payload: any;
  metadata: {
    userId?: string;
    sessionId?: string;
    traceId?: string;
    causationId?: string;
  };
}

class DebugEventStore {
  async getEventStream(aggregateId: string): Promise<DebugEvent[]> {
    // Reconstruct complete state history
    return await this.db.query(
      'SELECT * FROM events WHERE aggregate_id = $1 ORDER BY timestamp',
      [aggregateId]
    );
  }

  async replayToPoint(aggregateId: string, timestamp: number): Promise<any> {
    const events = await this.getEventStream(aggregateId);
    const relevantEvents = events.filter(e => e.timestamp <= timestamp);

    // Replay events to reconstruct state at specific point
    return this.applyEvents(relevantEvents);
  }
}
```

### Gradual Traffic Shifting for Debugging

```yaml
# Kubernetes canary deployment for debug version
apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app: my-service
  ports:
    - port: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-service-stable
spec:
  replicas: 9
  template:
    metadata:
      labels:
        app: my-service
        version: stable
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-service-debug
spec:
  replicas: 1  # 10% traffic for debug version
  template:
    metadata:
      labels:
        app: my-service
        version: debug
      annotations:
        instrumentation.opentelemetry.io/inject-sdk: "true"
    spec:
      containers:
      - name: app
        env:
        - name: DEBUG_MODE
          value: "true"
        - name: LOG_LEVEL
          value: "debug"
```

## Observability Integration

### Distributed Tracing Integration

**Honeycomb Query-Driven Debugging:**
```javascript
// Instrumentation for query-driven debugging
const { trace, context } = require('@opentelemetry/api');
const { HoneycombSDK } = require('@honeycombio/opentelemetry-node');

const sdk = new HoneycombSDK({
  apiKey: process.env.HONEYCOMB_API_KEY,
  dataset: 'my-service',
  serviceName: 'api-server'
});

function instrumentForDebugging(fn, metadata = {}) {
  return async function(...args) {
    const tracer = trace.getTracer('debugger');
    const span = tracer.startSpan(metadata.operationName || fn.name);

    // Add debugging context
    span.setAttribute('debug.functionName', fn.name);
    span.setAttribute('debug.argsCount', args.length);
    span.setAttribute('debug.timestamp', Date.now());

    // Add custom metadata for filtering in Honeycomb
    Object.entries(metadata).forEach(([key, value]) => {
      span.setAttribute(`debug.${key}`, value);
    });

    try {
      const result = await context.with(
        trace.setSpan(context.active(), span),
        () => fn.apply(this, args)
      );

      span.setAttribute('debug.resultType', typeof result);
      span.setStatus({ code: 1 }); // OK
      return result;
    } catch (error) {
      span.recordException(error);
      span.setAttribute('debug.errorType', error.constructor.name);
      span.setStatus({ code: 2, message: error.message }); // ERROR
      throw error;
    } finally {
      span.end();
    }
  };
}

// Usage with AI-suggested instrumentation points
const debugProcess = instrumentForDebugging(processPayment, {
  operationName: 'payment.process',
  criticalPath: true,
  debugPriority: 'high'
});
```

**Honeycomb Query Examples:**
```
# Find slow traces affecting specific users
BREAKDOWN(trace.trace_id)
WHERE duration_ms > 1000
  AND user.id IN ("12345", "67890")
ORDER BY duration_ms DESC

# Compare successful vs failed requests
HEATMAP(duration_ms)
WHERE endpoint = "/api/checkout"
GROUP BY error_occurred

# Identify correlated services in failures
COUNT_DISTINCT(service.name)
WHERE error = true
GROUP BY trace.trace_id
```

### Sentry Integration for Error Context

```python
# Enhanced Sentry context for debugging
import sentry_sdk
from sentry_sdk import set_context, capture_exception, add_breadcrumb

def configure_debug_context(user=None, request_data=None):
    """Add rich context for debugging in Sentry"""

    if user:
        sentry_sdk.set_user({
            "id": user.id,
            "email": user.email,
            "segment": user.segment,
            "subscription_tier": user.tier
        })

    if request_data:
        set_context("request_details", {
            "endpoint": request_data.get("endpoint"),
            "method": request_data.get("method"),
            "params": sanitize_params(request_data.get("params")),
            "headers": sanitize_headers(request_data.get("headers"))
        })

    # Add system context
    set_context("system", {
        "hostname": socket.gethostname(),
        "process_id": os.getpid(),
        "thread_id": threading.get_ident(),
        "memory_mb": psutil.Process().memory_info().rss / 1024 / 1024
    })

def debug_operation(operation_name):
    """Decorator for debugging with breadcrumbs"""
    def decorator(fn):
        def wrapper(*args, **kwargs):
            add_breadcrumb(
                category='debug',
                message=f'Entering {operation_name}',
                level='debug',
                data={'args_count': len(args), 'kwargs_keys': list(kwargs.keys())}
            )

            try:
                result = fn(*args, **kwargs)
                add_breadcrumb(
                    category='debug',
                    message=f'Completed {operation_name}',
                    level='debug',
                    data={'result_type': type(result).__name__}
                )
                return result
            except Exception as e:
                add_breadcrumb(
                    category='error',
                    message=f'Failed {operation_name}',
                    level='error',
                    data={'error': str(e)}
                )
                capture_exception(e)
                raise
        return wrapper
    return decorator

# AI-powered error grouping in Sentry
# Configure fingerprinting for better debugging
sentry_sdk.init(
    dsn=os.getenv("SENTRY_DSN"),
    before_send=lambda event, hint: enhance_event_for_debugging(event, hint),
    traces_sample_rate=0.1,
    profiles_sample_rate=0.1
)

def enhance_event_for_debugging(event, hint):
    """Add AI-suggested fingerprinting"""
    if 'exception' in event:
        exc = event['exception']['values'][0]

        # Custom fingerprinting based on error patterns
        fingerprint = ['{{ default }}']

        # AI can suggest better grouping strategies
        if 'database' in exc.get('type', '').lower():
            fingerprint.append('db-error')
            fingerprint.append(extract_db_operation(exc))

        event['fingerprint'] = fingerprint

    return event
```

## Post-Debugging Validation

### Automated Fix Verification

After implementing fix, run comprehensive validation:

```typescript
// Post-fix validation framework
interface ValidationResult {
  testsPassed: boolean;
  performanceRegression: boolean;
  errorRateChanged: boolean;
  metricsComparison: MetricsComparison;
  recommendations: string[];
}

class DebugFixValidator {
  async validateFix(
    issueId: string,
    fixCommit: string,
    baselineCommit: string
  ): Promise<ValidationResult> {

    const results: ValidationResult = {
      testsPassed: false,
      performanceRegression: false,
      errorRateChanged: false,
      metricsComparison: {},
      recommendations: []
    };

    // 1. Run existing test suite
    const testResults = await this.runTests(fixCommit);
    results.testsPassed = testResults.allPassed;

    if (!results.testsPassed) {
      results.recommendations.push(
        'Fix broke existing tests. Review test failures.'
      );
      return results;
    }

    // 2. Performance comparison
    const perfBaseline = await this.runPerfTests(baselineCommit);
    const perfAfterFix = await this.runPerfTests(fixCommit);

    results.performanceRegression = this.detectRegression(
      perfBaseline,
      perfAfterFix
    );

    if (results.performanceRegression) {
      results.recommendations.push(
        `Performance regression detected: ${this.formatDiff(perfBaseline, perfAfterFix)}`
      );
    }

    // 3. Canary deployment validation
    if (process.env.ENABLE_CANARY === 'true') {
      const canaryResults = await this.runCanaryDeployment(fixCommit);
      results.errorRateChanged = canaryResults.errorRateDelta > 0.05;

      if (results.errorRateChanged) {
        results.recommendations.push(
          `Error rate increased by ${(canaryResults.errorRateDelta * 100).toFixed(2)}%`
        );
      }
    }

    // 4. AI-powered code review of the fix
    const aiReview = await this.getAICodeReview(issueId, fixCommit);
    results.recommendations.push(...aiReview.suggestions);

    return results;
  }

  private async getAICodeReview(
    issueId: string,
    commit: string
  ): Promise<AIReview> {
    // Use GitHub Copilot or Claude to review the fix
    const diff = await this.getCommitDiff(commit);

    return await aiAssistant.review({
      context: `Reviewing fix for issue ${issueId}`,
      diff,
      checks: [
        'error handling completeness',
        'edge case coverage',
        'potential side effects',
        'test coverage adequacy',
        'code clarity and maintainability'
      ]
    });
  }
}
```

### Regression Prevention

```python
# Automated regression test generation
class RegressionTestGenerator:
    def __init__(self, issue_tracker, ai_assistant):
        self.issue_tracker = issue_tracker
        self.ai_assistant = ai_assistant

    async def generate_tests_for_fix(self, issue_id: str, fix_commit: str):
        """Generate regression tests using AI"""

        # Get issue details
        issue = await self.issue_tracker.get(issue_id)

        # Get code changes
        diff = await self.get_git_diff(fix_commit)

        # AI generates test cases
        test_cases = await self.ai_assistant.generate_tests({
            'issue_description': issue.description,
            'reproduction_steps': issue.reproduction_steps,
            'code_changes': diff,
            'test_framework': self.detect_test_framework(),
            'coverage_target': 'edge cases and failure modes'
        })

        # Write tests to appropriate files
        for test_case in test_cases:
            await self.write_test_file(
                test_case.file_path,
                test_case.content
            )

        # Validate tests catch the original bug
        validation = await self.validate_tests_catch_bug(
            issue_id,
            fix_commit
        )

        return {
            'tests_generated': len(test_cases),
            'validates_fix': validation.successful,
            'test_files': [tc.file_path for tc in test_cases]
        }
```

### Knowledge Base Update

```javascript
// Automatically update debugging knowledge base
class DebugKnowledgeBase {
  async recordDebugSession(session) {
    const entry = {
      issueId: session.issueId,
      timestamp: new Date().toISOString(),
      errorPattern: session.errorSignature,
      rootCause: session.rootCause,
      debugStrategy: session.strategyUsed,
      timeToResolve: session.duration,
      effectiveTools: session.toolsUsed,
      searchKeywords: await this.extractKeywords(session),
      relatedIssues: await this.findSimilarIssues(session),
      preventionMeasures: session.preventionRecommendations,
      aiInsights: session.aiAssistantAnalysis
    };

    await this.db.insert('debug_sessions', entry);

    // Update AI model training data
    await this.ai.addTrainingExample({
      input: {
        errorMessage: session.error,
        stackTrace: session.stackTrace,
        context: session.environment
      },
      output: {
        rootCause: session.rootCause,
        solution: session.solution,
        confidence: session.confidenceScore
      }
    });
  }

  async getSimilarDebugSessions(errorSignature) {
    // Vector similarity search for similar issues
    return await this.vectorDb.similaritySearch(
      errorSignature,
      {
        limit: 5,
        threshold: 0.8
      }
    );
  }
}
```

## Complete Examples

### Example 1: AI-Powered Debugging Session with GitHub Copilot

```typescript
/**
 * Complete debugging session for intermittent checkout failure
 * Using: GitHub Copilot Chat, DataDog, Sentry
 */

// Issue: "Checkout fails intermittently with 'Payment processing timeout'"

// Step 1: AI-assisted initial analysis
// Copilot Chat prompt: "Analyze this error pattern and suggest root causes"

import { DataDogClient } from '@datadog/datadog-api-client';
import * as Sentry from '@sentry/node';

class CheckoutDebugSession {
  private dd: DataDogClient;
  private sessionId: string;

  constructor(sessionId: string) {
    this.sessionId = sessionId;
    this.dd = new DataDogClient(process.env.DD_API_KEY);
  }

  async investigateIssue() {
    console.log('=== Starting AI-Assisted Debug Session ===');

    // Step 2: Gather observability data
    const sentryIssues = await this.getSentryErrorGroup();
    const ddTraces = await this.getDataDogTraces();
    const ddMetrics = await this.getRelevantMetrics();

    console.log('\n[1] Sentry Error Analysis:');
    console.log(`   - Occurrences: ${sentryIssues.count}`);
    console.log(`   - Affected users: ${sentryIssues.userCount}`);
    console.log(`   - First seen: ${sentryIssues.firstSeen}`);
    console.log(`   - Last seen: ${sentryIssues.lastSeen}`);
    console.log(`   - User impact: ${sentryIssues.impactScore}`);

    // Step 3: AI analysis of error patterns
    // GitHub Copilot analyzes the error group and suggests:
    // "Payment timeout correlates with high database latency"

    console.log('\n[2] DataDog Trace Analysis:');
    const slowTraces = ddTraces.filter(t => t.duration > 5000);
    console.log(`   - Total traces analyzed: ${ddTraces.length}`);
    console.log(`   - Slow traces (>5s): ${slowTraces.length}`);

    // AI identifies pattern: DB queries taking 4-6 seconds
    const dbSpans = slowTraces.flatMap(t =>
      t.spans.filter(s => s.resource.startsWith('SELECT'))
    );

    console.log(`   - Slow DB queries: ${dbSpans.length}`);
    console.log(`   - Slowest query: ${this.formatQuery(dbSpans[0])}`);

    // Step 4: Hypothesis generation with AI
    const hypotheses = [
      {
        name: 'Database N+1 query in payment verification',
        probability: 85,
        evidence: 'Multiple SELECT queries to user_payment_methods table',
        test: 'Add query logging and count queries per checkout'
      },
      {
        name: 'Lock contention on payment_transactions table',
        probability: 60,
        evidence: 'Correlation with concurrent checkouts',
        test: 'Check pg_stat_activity for blocked queries'
      },
      {
        name: 'External payment gateway timeout',
        probability: 45,
        evidence: 'Some traces show gateway response > 3s',
        test: 'Add separate instrumentation for gateway calls'
      }
    ];

    console.log('\n[3] AI-Generated Hypotheses:');
    hypotheses.forEach((h, i) => {
      console.log(`   ${i + 1}. ${h.name} (${h.probability}%)`);
      console.log(`      Evidence: ${h.evidence}`);
      console.log(`      Test: ${h.test}`);
    });

    // Step 5: Intelligent breakpoint placement
    // AI suggests key points to instrument
    const instrumentationPoints = await this.addSmartInstrumentation();

    console.log('\n[4] Added Smart Instrumentation:');
    instrumentationPoints.forEach(point => {
      console.log(`   - ${point.file}:${point.line} - ${point.reason}`);
    });

    // Step 6: Deploy instrumented version to 10% traffic
    await this.deployCanaryWithInstrumentation();

    console.log('\n[5] Canary Deployment:');
    console.log('   - Deployed instrumented version to 10% traffic');
    console.log('   - Monitoring for 15 minutes...');

    // Wait and collect data
    await this.sleep(15 * 60 * 1000);

    // Step 7: Analyze collected data with AI
    const analysis = await this.analyzeInstrumentationData();

    console.log('\n[6] Root Cause Identified:');
    console.log(`   - ${analysis.rootCause}`);
    console.log(`   - Confidence: ${analysis.confidence}%`);
    console.log(`   - Evidence: ${analysis.evidence}`);

    // Step 8: AI suggests fix
    const suggestedFix = await this.generateFix(analysis);

    console.log('\n[7] Suggested Fix:');
    console.log(suggestedFix.code);
    console.log(`\n   - Impact: ${suggestedFix.impact}`);
    console.log(`   - Risk: ${suggestedFix.risk}`);
    console.log(`   - Test coverage: ${suggestedFix.testCoverage}`);

    return {
      rootCause: analysis.rootCause,
      fix: suggestedFix,
      validationPlan: this.generateValidationPlan(analysis, suggestedFix)
    };
  }

  private async getSentryErrorGroup() {
    const issues = await Sentry.getIssue('CHECKOUT_TIMEOUT_001');

    return {
      count: issues.count,
      userCount: issues.userCount,
      firstSeen: issues.firstSeen,
      lastSeen: issues.lastSeen,
      impactScore: this.calculateImpact(issues),
      breadcrumbs: issues.latestEvent.breadcrumbs,
      tags: issues.tags
    };
  }

  private async getDataDogTraces() {
    const query = `
      service:checkout-api
      operation_name:process_payment
      @error:true
      @duration:>5000ms
    `;

    return await this.dd.traces.search({
      query,
      from: Date.now() - 24 * 3600 * 1000,
      to: Date.now(),
      limit: 100
    });
  }

  private async addSmartInstrumentation() {
    // AI suggests these instrumentation points
    return [
      {
        file: 'src/checkout/payment.ts',
        line: 145,
        reason: 'Payment verification entry point'
      },
      {
        file: 'src/checkout/payment.ts',
        line: 178,
        reason: 'Database query execution (potential N+1)'
      },
      {
        file: 'src/checkout/payment.ts',
        line: 203,
        reason: 'External gateway call'
      },
      {
        file: 'src/checkout/payment.ts',
        line: 245,
        reason: 'Transaction commit point'
      }
    ];
  }

  private async analyzeInstrumentationData() {
    // AI analyzes collected data and identifies root cause
    return {
      rootCause: 'N+1 query: Loading payment methods for each item in cart separately',
      confidence: 92,
      evidence: 'Average 15 queries per checkout, each taking 300-400ms',
      affectedCode: 'src/checkout/payment.ts:178-195',
      suggestedFix: 'Use eager loading with JOIN or batch query'
    };
  }

  private async generateFix(analysis) {
    // AI generates the fix code
    return {
      code: `
// Before (N+1 query):
for (const item of cart.items) {
  const paymentMethod = await PaymentMethod.findOne({
    where: { userId: cart.userId, itemId: item.id }
  });
  await processPayment(item, paymentMethod);
}

// After (batched query):
const itemIds = cart.items.map(i => i.id);
const paymentMethods = await PaymentMethod.findAll({
  where: {
    userId: cart.userId,
    itemId: { [Op.in]: itemIds }
  }
});

const methodMap = new Map(
  paymentMethods.map(pm => [pm.itemId, pm])
);

for (const item of cart.items) {
  const paymentMethod = methodMap.get(item.id);
  await processPayment(item, paymentMethod);
}
      `.trim(),
      impact: 'Reduces queries from ~15 to 1, expected 3-4s latency reduction',
      risk: 'Low - preserves existing logic, only changes data fetching',
      testCoverage: 'Add test for batch payment processing'
    };
  }

  private generateValidationPlan(analysis, fix) {
    return {
      steps: [
        'Apply fix to local environment',
        'Run existing payment test suite',
        'Add new test for batch payment method loading',
        'Deploy to staging with full instrumentation',
        'Run load test simulating 100 concurrent checkouts',
        'Compare latency metrics: baseline vs fix',
        'Canary deploy to 10% production for 1 hour',
        'Monitor error rate and latency in DataDog',
        'If metrics improve by >50%, roll out to 100%'
      ],
      successCriteria: {
        errorRateReduction: '>90%',
        latencyReduction: '>70%',
        queryCountReduction: '>85%'
      }
    };
  }

  private sleep(ms: number) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Run the debug session
const session = new CheckoutDebugSession('checkout-timeout-issue');
const result = await session.investigateIssue();

console.log('\n=== Debug Session Complete ===');
console.log(JSON.stringify(result, null, 2));
```

### Example 2: Observability-Driven Production Debugging

```python
"""
Complete workflow for debugging production memory leak
Using: Honeycomb, Pyroscope, Grafana, Claude Code
"""

import asyncio
from datetime import datetime, timedelta
from honeycomb import HoneycombClient
from pyroscope import Profiler
import anthropic

class ProductionMemoryLeakDebugger:
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.honeycomb = HoneycombClient(api_key=os.getenv("HONEYCOMB_API_KEY"))
        self.anthropic = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
        self.findings = []

    async def debug_memory_leak(self):
        """
        Complete debugging workflow for memory leak
        """
        print("=== Production Memory Leak Investigation ===\n")

        # Step 1: Identify memory growth pattern
        print("[1] Analyzing Memory Growth Pattern")
        memory_pattern = await self.analyze_memory_metrics()
        print(f"   - Memory growth rate: {memory_pattern['growth_rate_mb_per_hour']} MB/hour")
        print(f"   - Time to OOM: ~{memory_pattern['hours_to_oom']} hours")
        print(f"   - Pattern type: {memory_pattern['pattern_type']}")

        self.findings.append({
            "category": "memory_metrics",
            "data": memory_pattern
        })

        # Step 2: Continuous profiling analysis
        print("\n[2] Analyzing Continuous Profiling Data (Pyroscope)")
        profile_analysis = await self.analyze_profiles()
        print(f"   - Top memory allocator: {profile_analysis['top_allocator']}")
        print(f"   - Allocation rate: {profile_analysis['alloc_rate_mb_per_sec']} MB/s")
        print(f"   - Suspected leak locations:")
        for loc in profile_analysis['suspected_locations']:
            print(f"      - {loc['function']} at {loc['file']}:{loc['line']}")

        self.findings.append({
            "category": "profiling",
            "data": profile_analysis
        })

        # Step 3: Distributed trace analysis
        print("\n[3] Analyzing Request Traces for Memory Patterns")
        trace_analysis = await self.analyze_traces_for_memory()
        print(f"   - Requests analyzed: {trace_analysis['request_count']}")
        print(f"   - Memory leak correlation:")
        print(f"      - High memory requests: {trace_analysis['high_memory_requests']}")
        print(f"      - Common patterns: {trace_analysis['common_patterns']}")

        self.findings.append({
            "category": "traces",
            "data": trace_analysis
        })

        # Step 4: AI-powered root cause analysis
        print("\n[4] AI Root Cause Analysis (Claude)")
        root_cause = await self.ai_analyze_findings()
        print(f"   - Root cause: {root_cause['diagnosis']}")
        print(f"   - Confidence: {root_cause['confidence']}%")
        print(f"   - Evidence chain:")
        for evidence in root_cause['evidence']:
            print(f"      - {evidence}")

        # Step 5: Generate and test hypothesis
        print("\n[5] Hypothesis Testing")
        hypothesis = root_cause['hypothesis']
        test_results = await self.test_hypothesis(hypothesis)
        print(f"   - Hypothesis: {hypothesis['statement']}")
        print(f"   - Test result: {test_results['outcome']}")
        print(f"   - Evidence: {test_results['evidence']}")

        # Step 6: Implement targeted instrumentation
        print("\n[6] Deploying Targeted Instrumentation")
        instrumentation = await self.deploy_targeted_instrumentation(
            root_cause['suspected_code_paths']
        )
        print(f"   - Instrumented {len(instrumentation['points'])} code paths")
        print(f"   - Monitoring for 30 minutes...")

        await asyncio.sleep(30 * 60)  # Wait 30 minutes

        # Step 7: Analyze instrumentation data
        print("\n[7] Analyzing Instrumentation Results")
        detailed_analysis = await self.analyze_instrumentation_data()
        print(f"   - Confirmed root cause: {detailed_analysis['confirmed']}")
        print(f"   - Leak location: {detailed_analysis['leak_location']}")
        print(f"   - Leak type: {detailed_analysis['leak_type']}")

        # Step 8: AI generates fix
        print("\n[8] Generating Fix (AI-assisted)")
        fix = await self.generate_fix(detailed_analysis)
        print(f"   - Fix strategy: {fix['strategy']}")
        print(f"   - Code changes required: {len(fix['changes'])} files")
        print(f"   - Risk assessment: {fix['risk']}")

        # Step 9: Validation plan
        print("\n[9] Fix Validation Plan")
        validation = self.create_validation_plan(fix)
        for step_num, step in enumerate(validation['steps'], 1):
            print(f"   {step_num}. {step}")

        return {
            "root_cause": detailed_analysis,
            "fix": fix,
            "validation_plan": validation,
            "findings": self.findings
        }

    async def analyze_memory_metrics(self):
        """Query Grafana/Prometheus for memory metrics"""
        # Simulate Prometheus query
        # In real implementation: query actual Prometheus

        return {
            "growth_rate_mb_per_hour": 45.3,
            "hours_to_oom": 18.5,
            "pattern_type": "linear_growth",
            "baseline_memory_mb": 512,
            "current_memory_mb": 1847,
            "measurement_period_hours": 24
        }

    async def analyze_profiles(self):
        """Analyze Pyroscope continuous profiling data"""
        # Query Pyroscope for memory allocation profiles
        # Compare profiles over time to identify growing allocations

        return {
            "top_allocator": "cache_manager.add_entry()",
            "alloc_rate_mb_per_sec": 0.012,
            "suspected_locations": [
                {
                    "function": "cache_manager.add_entry",
                    "file": "src/cache/manager.py",
                    "line": 145,
                    "alloc_percent": 67.3
                },
                {
                    "function": "request_handler.store_session",
                    "file": "src/api/handler.py",
                    "line": 89,
                    "alloc_percent": 18.2
                }
            ],
            "time_range": "last_24_hours"
        }

    async def analyze_traces_for_memory(self):
        """Analyze Honeycomb traces for memory-related patterns"""

        # Honeycomb query to find traces with high memory allocation
        query = """
        BREAKDOWN(trace.trace_id)
        WHERE service.name = '{service}'
          AND memory.delta_mb > 10
        ORDER BY memory.delta_mb DESC
        LIMIT 100
        """.format(service=self.service_name)

        traces = await self.honeycomb.query(query)

        # Analyze common patterns in high-memory traces
        common_patterns = self.extract_common_patterns(traces)

        return {
            "request_count": len(traces),
            "high_memory_requests": len([t for t in traces if t['memory_delta'] > 20]),
            "common_patterns": [
                "All include cache write operation",
                "87% involve large JSON parsing",
                "Cache eviction never triggered"
            ],
            "top_endpoints": [
                {"endpoint": "/api/data/sync", "count": 43},
                {"endpoint": "/api/batch/process", "count": 28}
            ]
        }

    async def ai_analyze_findings(self):
        """Use Claude to analyze all findings and determine root cause"""

        # Prepare context for Claude
        context = {
            "findings": self.findings,
            "service": self.service_name,
            "symptoms": "Linear memory growth, ~45MB/hour, OOM in ~18 hours"
        }

        prompt = f"""
        Analyze the following production memory leak data and determine the root cause:

        {json.dumps(context, indent=2)}

        Provide:
        1. Root cause diagnosis
        2. Confidence level (0-100%)
        3. Evidence chain supporting the diagnosis
        4. Testable hypothesis
        5. Suspected code paths

        Format as JSON.
        """

        message = await self.anthropic.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=2000,
            messages=[{"role": "user", "content": prompt}]
        )

        analysis = json.loads(message.content[0].text)

        return {
            "diagnosis": "Cache entries added but never evicted - missing TTL and size limit",
            "confidence": 94,
            "evidence": [
                "Profiling shows cache_manager.add_entry() as top allocator (67%)",
                "Traces show cache writes but no cache evictions",
                "Linear growth pattern consistent with unbounded cache",
                "Growth rate matches request rate × average entry size"
            ],
            "hypothesis": {
                "statement": "Cache has no eviction policy, causing unbounded memory growth",
                "test": "Add cache size metrics and verify no evictions occurring",
                "expected_outcome": "Cache size grows linearly with request count"
            },
            "suspected_code_paths": [
                "src/cache/manager.py:add_entry()",
                "src/cache/manager.py:__init__()",
                "src/api/handler.py:store_session()"
            ]
        }

    async def test_hypothesis(self, hypothesis):
        """Deploy instrumentation to test hypothesis"""

        # Add metrics to track cache size and evictions
        # In real implementation: deploy instrumented version

        await asyncio.sleep(5)  # Simulate data collection

        return {
            "outcome": "CONFIRMED",
            "evidence": "Cache size grew from 1,247 entries to 3,891 entries in 30 minutes. Zero evictions recorded.",
            "metrics": {
                "cache_size_start": 1247,
                "cache_size_end": 3891,
                "evictions_count": 0,
                "additions_count": 2644
            }
        }

    async def deploy_targeted_instrumentation(self, code_paths):
        """Deploy focused instrumentation on suspected code paths"""

        instrumentation_points = []

        for path in code_paths:
            instrumentation_points.append({
                "file": path,
                "metrics": [
                    "cache.size",
                    "cache.evictions",
                    "cache.additions",
                    "memory.used_mb"
                ],
                "log_level": "debug"
            })

        # In real implementation: update deployment with instrumentation

        return {"points": instrumentation_points}

    async def analyze_instrumentation_data(self):
        """Analyze detailed instrumentation data"""

        return {
            "confirmed": True,
            "leak_location": "src/cache/manager.py:CacheManager",
            "leak_type": "unbounded_cache",
            "details": {
                "cache_implementation": "dict without size limit",
                "eviction_policy": "none",
                "ttl_configured": False,
                "max_size_configured": False
            },
            "impact": "All cache entries retained indefinitely"
        }

    async def generate_fix(self, analysis):
        """AI generates fix for the memory leak"""

        prompt = f"""
        Generate a fix for this memory leak:

        {json.dumps(analysis, indent=2)}

        Requirements:
        - Add LRU cache with size limit
        - Add TTL-based eviction
        - Maintain existing API
        - Production-safe changes only

        Provide complete code and migration strategy.
        """

        message = await self.anthropic.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=3000,
            messages=[{"role": "user", "content": prompt}]
        )

        return {
            "strategy": "Replace dict with cachetools.LRUCache, add TTL",
            "changes": [
                {
                    "file": "src/cache/manager.py",
                    "description": "Implement LRU cache with size limit and TTL",
                    "code": """
from cachetools import TTLCache
from threading import RLock

class CacheManager:
    def __init__(self, max_size=10000, ttl_seconds=3600):
        # LRU cache with size limit and TTL
        self.cache = TTLCache(maxsize=max_size, ttl=ttl_seconds)
        self.lock = RLock()

    def add_entry(self, key, value):
        with self.lock:
            self.cache[key] = value
            # Eviction happens automatically

    def get_entry(self, key):
        with self.lock:
            return self.cache.get(key)
                    """
                },
                {
                    "file": "src/config.py",
                    "description": "Add cache configuration",
                    "code": """
CACHE_MAX_SIZE = int(os.getenv('CACHE_MAX_SIZE', '10000'))
CACHE_TTL_SECONDS = int(os.getenv('CACHE_TTL_SECONDS', '3600'))
                    """
                }
            ],
            "risk": "LOW - Backward compatible API, configurable limits",
            "dependencies": ["cachetools>=5.3.0"],
            "rollback_plan": "Feature flag to switch between old and new cache"
        }

    def create_validation_plan(self, fix):
        """Create comprehensive validation plan for the fix"""

        return {
            "steps": [
                "Add comprehensive unit tests for cache eviction",
                "Run memory profiling in staging with production traffic replay",
                "Verify cache size remains bounded under load",
                "Verify cache hit rate remains acceptable",
                "Deploy with feature flag to 1% traffic",
                "Monitor memory metrics for 2 hours",
                "If stable, increase to 10% for 4 hours",
                "If memory growth stopped, roll out to 100%",
                "Continue monitoring for 24 hours post-rollout"
            ],
            "success_criteria": {
                "memory_growth": "< 5MB/hour (down from 45MB/hour)",
                "cache_hit_rate": "> 85%",
                "cache_size": "< 10,000 entries",
                "eviction_rate": "> 0 evictions/minute",
                "error_rate": "no increase"
            },
            "monitoring": [
                "Memory usage (RSS)",
                "Cache size metric",
                "Cache hit/miss rates",
                "Eviction rate",
                "Request latency p50/p95/p99",
                "Error rate"
            ]
        }

    def extract_common_patterns(self, traces):
        """Extract common patterns from trace data"""
        # Simplified pattern extraction
        return []


# Execute the debug workflow
async def main():
    debugger = ProductionMemoryLeakDebugger("api-server")
    result = await debugger.debug_memory_leak()

    print("\n=== Debug Complete ===")
    print(f"Root cause: {result['root_cause']['leak_location']}")
    print(f"Fix strategy: {result['fix']['strategy']}")
    print(f"\nNext steps:")
    for i, step in enumerate(result['validation_plan']['steps'][:3], 1):
        print(f"  {i}. {step}")

if __name__ == "__main__":
    asyncio.run(main())
```

## Reference Workflows

### Reference 1: Cursor IDE Time-Travel Debugging

Complete workflow for debugging state management bug using Cursor IDE's AI features and time-travel debugging:

1. **Initial Problem Identification**
   - User reports: "Shopping cart shows wrong item count after page refresh"
   - Reproduction rate: 15% of page refreshes
   - Environment: React SPA with Redux state management

2. **AI-Assisted Code Analysis** (Cursor IDE)
   - Use Cursor's "Explain this code" on CartReducer.ts
   - AI identifies complex state update logic with 3 nested reducers
   - Suggests potential race condition in async state hydration

3. **Time-Travel Debugging Setup** (Redux DevTools)
   - Install Redux DevTools Extension with time-travel capability
   - Add state serialization for replay
   - Configure Redux store with DevTools enhancer
   - Add state snapshot middleware

4. **Reproduction with Recording**
   - Enable Redux DevTools recording
   - Reproduce the bug (multiple attempts)
   - Export state dump when bug occurs
   - Save action timeline for analysis

5. **Time-Travel Analysis**
   - Load saved state dump in DevTools
   - Scrub through action timeline
   - Identify moment where state diverges
   - Use Cursor AI to analyze action sequence
   - AI identifies: "State hydration dispatches before localStorage read completes"

6. **Root Cause Confirmation**
   - Add breakpoints in async hydration logic
   - Step through with Cursor's debug panel
   - Confirm race condition: hydration action dispatches too early
   - localStorage read hasn't completed yet

7. **AI-Generated Fix** (Cursor IDE)
   - Ask Cursor: "Fix race condition in cart hydration"
   - AI suggests: Add Promise wrapper and await localStorage read
   - Review generated fix code
   - Accept fix with modifications

8. **Validation with Time-Travel**
   - Apply fix locally
   - Replay saved action sequence with fixed code
   - Verify state remains consistent through replay
   - Test with 100 rapid page refreshes - no failures

9. **Automated Test Generation** (Cursor AI)
   - Ask Cursor: "Generate test for cart hydration race condition"
   - AI creates test that reproduces original race condition
   - Test fails on old code, passes on fixed code
   - Add test to suite

10. **Deployment and Monitoring**
    - Deploy fix with feature flag
    - Monitor cart error rates in Sentry
    - Enable for 100% after 24 hours with no regressions

### Reference 2: Production Debugging with Distributed Tracing

Complete workflow for debugging cross-service latency issue:

1. **Alert Triggered**
   - DataDog alert: "P95 latency for /api/recommendations endpoint > 2s"
   - Affected: 5% of requests
   - Pattern: Intermittent, no clear time correlation

2. **Honeycomb Query-Driven Investigation**
   - Query: `WHERE endpoint = "/api/recommendations" AND duration_ms > 2000`
   - BREAKDOWN by user_id, device_type, region
   - Identifies: All slow requests from specific region (us-east-2)

3. **Distributed Trace Analysis**
   - Examine full trace for slow request
   - Service call chain: API → Auth → User Service → ML Service → Recommendations
   - ML Service span shows 1.8s latency
   - Most time in "model inference" operation

4. **Cross-Service Correlation**
   - Query ML Service logs for same trace ID
   - Correlate with GPU utilization metrics in Grafana
   - Discover: GPU memory contention during specific hours

5. **AI-Assisted Pattern Recognition** (Claude Code)
   - Feed trace data to Claude: "Analyze this latency pattern"
   - AI identifies: Correlation with batch inference jobs
   - Batch jobs scheduled every 30 minutes
   - Cause resource contention with real-time inference

6. **Hypothesis Formation**
   - Primary: Batch jobs starve real-time inference of GPU resources
   - Secondary: Model loading delay when GPU busy
   - Test: Disable batch jobs and monitor latency

7. **Safe Production Testing**
   - Feature flag to disable batch jobs in us-east-2 only
   - Monitor for 1 hour
   - Result: P95 latency drops to 350ms (from 2.1s)
   - Hypothesis confirmed

8. **Solution Design** (AI-Assisted)
   - Claude suggests: Separate GPU pools for batch vs real-time
   - Alternative: Priority-based scheduling in ML framework
   - Decision: Implement priority scheduling (faster, less infrastructure)

9. **Implementation**
   - Add priority queue to ML inference service
   - Real-time requests: high priority
   - Batch requests: low priority
   - Deploy to staging, load test confirms fix

10. **Gradual Rollout with Validation**
    - Deploy to us-east-2 with 10% traffic
    - Monitor latency, error rate, GPU utilization
    - Roll out to 100% us-east-2
    - Roll out to all regions over 48 hours
    - Final result: P95 latency 320ms, no increased error rate

11. **Post-Incident Review**
    - Document root cause in knowledge base
    - Add synthetic monitoring for GPU contention
    - Create alert for priority queue backlog
    - Update ML service runbook with troubleshooting steps

---

Issue to debug: $ARGUMENTS