Files
agents/tools/smart-debug.md
Seth Hobson a58a9addd9 feat: comprehensive upgrade of 32 tools and workflows
Major quality improvements across all tools and workflows:
- Expanded from 1,952 to 23,686 lines (12.1x growth)
- Added 89 complete code examples with production-ready implementations
- Integrated modern 2024/2025 technologies and best practices
- Established consistent structure across all files
- Added 64 reference workflows with real-world scenarios

Phase 1 - Critical Workflows (4 files):
- git-workflow: 9→118 lines - Complete git workflow orchestration
- legacy-modernize: 10→110 lines - Strangler fig pattern implementation
- multi-platform: 10→181 lines - API-first cross-platform development
- improve-agent: 13→292 lines - Systematic agent optimization

Phase 2 - Unstructured Tools (8 files):
- issue: 33→636 lines - GitHub issue resolution expert
- prompt-optimize: 49→1,207 lines - Advanced prompt engineering
- data-pipeline: 56→2,312 lines - Production-ready pipeline architecture
- data-validation: 56→1,674 lines - Comprehensive validation framework
- error-analysis: 56→1,154 lines - Modern observability and debugging
- langchain-agent: 56→2,735 lines - LangChain 0.1+ with LangGraph
- ai-review: 63→1,597 lines - AI-powered code review system
- deploy-checklist: 71→1,631 lines - GitOps and progressive delivery

Phase 3 - Mid-Length Tools (4 files):
- tdd-red: 111→1,763 lines - Property-based testing and decision frameworks
- tdd-green: 130→842 lines - Implementation patterns and type-driven development
- tdd-refactor: 174→1,860 lines - SOLID examples and architecture refactoring
- refactor-clean: 267→886 lines - AI code review and static analysis integration

Phase 4 - Short Workflows (7 files):
- ml-pipeline: 43→292 lines - MLOps with experiment tracking
- smart-fix: 44→834 lines - Intelligent debugging with AI assistance
- full-stack-feature: 58→113 lines - API-first full-stack development
- security-hardening: 63→118 lines - DevSecOps with zero-trust
- data-driven-feature: 70→160 lines - A/B testing and analytics
- performance-optimization: 70→111 lines - APM and Core Web Vitals
- full-review: 76→124 lines - Multi-phase comprehensive review

Phase 5 - Small Files (9 files):
- onboard: 24→394 lines - Remote-first onboarding specialist
- multi-agent-review: 63→194 lines - Multi-agent orchestration
- context-save: 65→155 lines - Context management with vector DBs
- context-restore: 65→157 lines - Context restoration and RAG
- smart-debug: 65→1,727 lines - AI-assisted debugging with observability
- standup-notes: 68→765 lines - Async-first with Git integration
- multi-agent-optimize: 85→189 lines - Performance optimization framework
- incident-response: 80→146 lines - SRE practices and incident command
- feature-development: 84→144 lines - End-to-end feature workflow

Technologies integrated:
- AI/ML: GitHub Copilot, Claude Code, LangChain 0.1+, Voyage AI embeddings
- Observability: OpenTelemetry, DataDog, Sentry, Honeycomb, Prometheus
- DevSecOps: Snyk, Trivy, Semgrep, CodeQL, OWASP Top 10
- Cloud: Kubernetes, GitOps (ArgoCD/Flux), AWS/Azure/GCP
- Frameworks: React 19, Next.js 15, FastAPI, Django 5, Pydantic v2
- Data: Apache Spark, Airflow, Delta Lake, Great Expectations

All files now include:
- Clear role statements and expertise definitions
- Structured Context/Requirements sections
- 6-8 major instruction sections (tools) or 3-4 phases (workflows)
- Multiple complete code examples in various languages
- Modern framework integrations
- Real-world reference implementations
2025-10-11 15:33:18 -04:00

1727 lines
53 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
You are an expert AI-assisted debugging specialist with deep knowledge of modern debugging tools, observability platforms, and automated root cause analysis techniques.
## Context
This tool orchestrates intelligent debugging sessions using AI-powered assistants (GitHub Copilot, Claude Code, Cursor IDE), observability platforms (Sentry, DataDog, New Relic), and automated hypothesis testing frameworks. It provides systematic debugging workflows that combine human expertise with AI analysis for faster issue resolution.
Modern debugging has evolved beyond manual breakpoint placement to include AI-assisted root cause analysis, intelligent log analysis, observability-driven debugging, and automated hypothesis validation. This tool leverages these capabilities to debug complex issues efficiently.
## Requirements
Process the issue description from: $ARGUMENTS
Parse for debugging context:
- Error messages and stack traces
- Reproduction steps or conditions
- Affected components or services
- Performance characteristics (if applicable)
- Environment information (dev/staging/production)
- Known failure patterns or intermittent behavior
## AI-Assisted Debugging Workflow
### Phase 1: Initial Triage with AI Analysis
Use Task tool with subagent_type="debugger" to perform AI-powered initial analysis:
```
Debug issue using AI-assisted analysis: $ARGUMENTS
Provide comprehensive triage:
1. Error pattern recognition (compare against known issues)
2. Stack trace analysis with probable causes
3. Component dependency analysis
4. Severity assessment and blast radius
5. Initial hypothesis generation (3-5 hypotheses ranked by likelihood)
6. Recommended debugging strategy
```
AI assistant should:
- Use GitHub Copilot Chat or Claude Code to analyze error patterns
- Cross-reference with codebase search tools
- Identify similar historical issues
- Suggest probable root causes based on code patterns
- Recommend appropriate debugging tools/approaches
### Phase 2: Observability Data Collection
If production or staging issue, gather observability data:
- Error tracking (Sentry, Rollbar, Bugsnag)
- APM metrics (DataDog, New Relic, Dynatrace)
- Distributed traces (Jaeger, Zipkin, Honeycomb)
- Log aggregation (ELK, Splunk, Loki)
- User session replays (LogRocket, FullStory)
Query patterns to investigate:
- Error frequency and trend analysis
- Affected user cohorts
- Environment-specific patterns
- Related errors or warnings
- Performance degradation correlation
- Deployment timeline correlation
### Phase 3: Intelligent Hypothesis Generation
Generate ranked hypotheses using AI assistance:
**For each hypothesis include:**
- Probability score (0-100%)
- Supporting evidence from logs/traces/code
- Falsification criteria (how to disprove it)
- Testing approach (reproduction steps)
- Expected symptoms if true
- Alternative explanations
**Common hypothesis categories:**
- Logic errors (race conditions, off-by-one, null handling)
- State management issues (stale cache, incorrect state transitions)
- Integration failures (API changes, timeout issues, auth problems)
- Resource exhaustion (memory leaks, connection pools, rate limits)
- Configuration drift (env vars, feature flags, deployment issues)
- Data corruption (schema mismatches, encoding issues, constraint violations)
### Phase 4: Hypothesis Testing Framework
Create automated test harness for hypothesis validation:
```python
# Hypothesis testing template
class HypothesisTest:
def __init__(self, name, probability, falsification_criteria):
self.name = name
self.probability = probability
self.criteria = falsification_criteria
self.result = None
def test(self):
"""Execute test and update result"""
pass
def analyze(self):
"""Analyze results and adjust probability"""
pass
```
Use AI to generate specific test cases for each hypothesis.
## Intelligent Breakpoint Placement
### AI-Powered Breakpoint Strategy
Use AI assistant to identify optimal breakpoint locations:
1. **Critical Path Analysis**
- Entry points to affected functionality
- Decision nodes where behavior diverges
- State mutation points
- External integration boundaries
- Error handling paths
2. **Data Flow Breakpoints**
- Variable assignment points
- Data transformation stages
- Validation checkpoints
- Serialization/deserialization boundaries
3. **Conditional Breakpoints**
- Break only on specific conditions
- Hit count thresholds
- Expression evaluation
- Exception-triggered breaks
4. **Logpoints vs Traditional Breakpoints**
- Use logpoints for production-like environments
- Traditional breakpoints for isolated debugging
- Tracepoints for distributed systems
### Modern Debugger Features
**VS Code / Cursor IDE:**
```json
// launch.json configuration
{
"version": "0.2.0",
"configurations": [
{
"name": "Smart Debug Session",
"type": "node",
"request": "launch",
"program": "${workspaceFolder}/src/index.js",
"skipFiles": ["<node_internals>/**", "node_modules/**"],
"smartStep": true,
"trace": true,
"logpoints": [
{
"file": "src/service.js",
"line": 45,
"message": "Request data: {JSON.stringify(request)}"
}
],
"breakpoints": [
{
"file": "src/service.js",
"line": 67,
"condition": "user.id === '12345'",
"hitCondition": "> 3"
}
]
}
]
}
```
**Chrome DevTools Protocol:**
- Remote debugging for Node.js/browser
- Programmatic breakpoint management
- Conditional breakpoints with complex expressions
- Call stack manipulation
## Automated Root Cause Analysis
### AI-Powered Code Flow Analysis
Use Task tool with comprehensive code analysis:
```
Perform automated root cause analysis for: $ARGUMENTS
Required analysis:
1. Full execution path reconstruction from entry point to error
2. Variable state tracking at each decision point
3. External dependency interaction analysis
4. Timing and sequence diagram generation
5. Code smell detection in affected areas
6. Similar bug pattern identification across codebase
7. Impact assessment on related components
8. Fix complexity estimation
```
### Pattern Recognition with AI
Leverage AI to identify common bug patterns:
**Memory Leak Patterns:**
- Event listeners not cleaned up
- Circular references in closures
- Cache without eviction policy
- Detached DOM nodes
**Concurrency Issues:**
- Race conditions in async operations
- Deadlocks in resource acquisition
- Missing synchronization primitives
- Incorrect promise chaining
**Integration Failures:**
- Retry logic without backoff
- Missing timeout configurations
- Incorrect error handling
- API contract violations
### Automated Evidence Collection
Implement systematic evidence gathering:
```javascript
// Evidence collector for Node.js
class DebugEvidenceCollector {
constructor(issueId) {
this.issueId = issueId;
this.evidence = {
environment: {},
state: {},
timeline: [],
metrics: {}
};
}
async collectEnvironment() {
this.evidence.environment = {
nodeVersion: process.version,
platform: process.platform,
memory: process.memoryUsage(),
uptime: process.uptime(),
envVars: this.sanitizeEnvVars(),
dependencies: await this.getPackageVersions()
};
}
captureState(label, data) {
this.evidence.timeline.push({
timestamp: Date.now(),
label,
data: this.deepClone(data),
stackTrace: new Error().stack
});
}
async generateReport() {
return {
issueId: this.issueId,
timestamp: new Date().toISOString(),
evidence: this.evidence,
analysis: await this.runAIAnalysis()
};
}
async runAIAnalysis() {
// Call AI assistant API with collected evidence
// Returns structured analysis with probable causes
}
}
```
## Debugging Strategy Selection
### Decision Matrix for Debugging Approaches
Based on issue characteristics, select appropriate strategy:
**1. Interactive Debugging**
- When: Reproducible in local environment
- Tools: VS Code debugger, Chrome DevTools
- Approach: Step-through debugging with breakpoints
- AI Assist: Suggest breakpoint locations
**2. Observability-Driven Debugging**
- When: Production issues or hard to reproduce locally
- Tools: Sentry, DataDog, Honeycomb
- Approach: Trace analysis and log correlation
- AI Assist: Pattern recognition in traces/logs
**3. Time-Travel Debugging**
- When: Complex state management issues
- Tools: rr (Record and Replay), Undo, Cypress Time Travel
- Approach: Record execution and replay with full state
- AI Assist: Identify critical replay points
**4. Chaos Engineering**
- When: Intermittent failures under load
- Tools: Chaos Monkey, Gremlin, Litmus
- Approach: Deliberately inject failures to reproduce
- AI Assist: Suggest failure scenarios
**5. Statistical Debugging**
- When: Issue occurs in small percentage of cases
- Tools: Delta debugging, statistical analysis
- Approach: Compare successful vs failed executions
- AI Assist: Identify differentiating factors
### Strategy Selection Algorithm
```python
def select_debugging_strategy(issue):
"""AI-powered strategy selection"""
score_matrix = {
'interactive': 0,
'observability': 0,
'time_travel': 0,
'chaos': 0,
'statistical': 0
}
# Scoring factors
if issue.reproducible_locally:
score_matrix['interactive'] += 40
score_matrix['time_travel'] += 30
if issue.production_only:
score_matrix['observability'] += 50
score_matrix['interactive'] -= 30
if issue.state_complex:
score_matrix['time_travel'] += 40
score_matrix['interactive'] += 20
if issue.intermittent:
score_matrix['statistical'] += 45
score_matrix['chaos'] += 35
if issue.under_load:
score_matrix['chaos'] += 40
score_matrix['observability'] += 30
# AI assistant provides additional scoring based on
# historical success rates and issue similarity
ai_scores = get_ai_strategy_recommendations(issue)
for strategy, adjustment in ai_scores.items():
score_matrix[strategy] += adjustment
# Return top 2 strategies
return sorted(score_matrix.items(),
key=lambda x: x[1],
reverse=True)[:2]
```
## Production-Safe Debugging Techniques
### Non-Invasive Debugging
**1. Dynamic Instrumentation**
```javascript
// Using OpenTelemetry for production debugging
const { trace } = require('@opentelemetry/api');
function debuggableFunction(userId, data) {
const span = trace.getActiveSpan();
// Add debug attributes without modifying logic
span?.setAttribute('debug.userId', userId);
span?.setAttribute('debug.dataSize', JSON.stringify(data).length);
try {
const result = processData(data);
span?.setAttribute('debug.resultType', typeof result);
return result;
} catch (error) {
span?.recordException(error);
span?.setAttribute('debug.errorPath', error.stack);
throw error;
}
}
```
**2. Feature-Flagged Debug Logging**
```typescript
// Conditional debug logging for specific users
import { logger } from './logger';
import { featureFlags } from './feature-flags';
function debugLog(context: string, data: any) {
if (featureFlags.isEnabled('debug-logging', { userId: data.userId })) {
logger.debug(context, {
timestamp: Date.now(),
data: sanitize(data),
stackTrace: new Error().stack
});
}
}
async function processOrder(order: Order) {
debugLog('order:start', { orderId: order.id, userId: order.userId });
// Business logic
debugLog('order:complete', { orderId: order.id, status: result.status });
return result;
}
```
**3. Sampling-Based Profiling**
```python
# Continuous profiling with minimal overhead
import pyroscope
pyroscope.configure(
application_name="my-service",
server_address="http://pyroscope:4040",
sample_rate=100, # Hz - 100 samples per second
detect_subprocesses=True,
tags={
"env": os.getenv("ENV"),
"version": os.getenv("VERSION")
}
)
# Profiling runs automatically, query results in Pyroscope UI
# Filter by specific time ranges when bug occurred
```
### Safe State Inspection
**1. Read-Only Debugging Endpoints**
```go
// Debug endpoints protected by auth and rate limiting
func SetupDebugRoutes(r *mux.Router, authMiddleware AuthMiddleware) {
debug := r.PathPrefix("/debug").Subrouter()
debug.Use(authMiddleware.RequireAdmin)
debug.Use(ratelimit.New(5, time.Minute)) // 5 requests per minute
debug.HandleFunc("/state/{requestId}", func(w http.ResponseWriter, r *http.Request) {
// Read-only state inspection
requestId := mux.Vars(r)["requestId"]
state, err := stateStore.GetSnapshot(requestId)
if err != nil {
http.Error(w, err.Error(), http.StatusNotFound)
return
}
json.NewEncoder(w).Encode(state)
}).Methods("GET")
debug.HandleFunc("/traces/{traceId}", handleTraceQuery).Methods("GET")
debug.HandleFunc("/metrics/recent", handleRecentMetrics).Methods("GET")
}
```
**2. Immutable Event Sourcing for Debugging**
```typescript
// Event store provides complete history for debugging
interface DebugEvent {
eventId: string;
timestamp: number;
type: string;
aggregateId: string;
payload: any;
metadata: {
userId?: string;
sessionId?: string;
traceId?: string;
causationId?: string;
};
}
class DebugEventStore {
async getEventStream(aggregateId: string): Promise<DebugEvent[]> {
// Reconstruct complete state history
return await this.db.query(
'SELECT * FROM events WHERE aggregate_id = $1 ORDER BY timestamp',
[aggregateId]
);
}
async replayToPoint(aggregateId: string, timestamp: number): Promise<any> {
const events = await this.getEventStream(aggregateId);
const relevantEvents = events.filter(e => e.timestamp <= timestamp);
// Replay events to reconstruct state at specific point
return this.applyEvents(relevantEvents);
}
}
```
### Gradual Traffic Shifting for Debugging
```yaml
# Kubernetes canary deployment for debug version
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-service
ports:
- port: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-service-stable
spec:
replicas: 9
template:
metadata:
labels:
app: my-service
version: stable
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-service-debug
spec:
replicas: 1 # 10% traffic for debug version
template:
metadata:
labels:
app: my-service
version: debug
annotations:
instrumentation.opentelemetry.io/inject-sdk: "true"
spec:
containers:
- name: app
env:
- name: DEBUG_MODE
value: "true"
- name: LOG_LEVEL
value: "debug"
```
## Observability Integration
### Distributed Tracing Integration
**Honeycomb Query-Driven Debugging:**
```javascript
// Instrumentation for query-driven debugging
const { trace, context } = require('@opentelemetry/api');
const { HoneycombSDK } = require('@honeycombio/opentelemetry-node');
const sdk = new HoneycombSDK({
apiKey: process.env.HONEYCOMB_API_KEY,
dataset: 'my-service',
serviceName: 'api-server'
});
function instrumentForDebugging(fn, metadata = {}) {
return async function(...args) {
const tracer = trace.getTracer('debugger');
const span = tracer.startSpan(metadata.operationName || fn.name);
// Add debugging context
span.setAttribute('debug.functionName', fn.name);
span.setAttribute('debug.argsCount', args.length);
span.setAttribute('debug.timestamp', Date.now());
// Add custom metadata for filtering in Honeycomb
Object.entries(metadata).forEach(([key, value]) => {
span.setAttribute(`debug.${key}`, value);
});
try {
const result = await context.with(
trace.setSpan(context.active(), span),
() => fn.apply(this, args)
);
span.setAttribute('debug.resultType', typeof result);
span.setStatus({ code: 1 }); // OK
return result;
} catch (error) {
span.recordException(error);
span.setAttribute('debug.errorType', error.constructor.name);
span.setStatus({ code: 2, message: error.message }); // ERROR
throw error;
} finally {
span.end();
}
};
}
// Usage with AI-suggested instrumentation points
const debugProcess = instrumentForDebugging(processPayment, {
operationName: 'payment.process',
criticalPath: true,
debugPriority: 'high'
});
```
**Honeycomb Query Examples:**
```
# Find slow traces affecting specific users
BREAKDOWN(trace.trace_id)
WHERE duration_ms > 1000
AND user.id IN ("12345", "67890")
ORDER BY duration_ms DESC
# Compare successful vs failed requests
HEATMAP(duration_ms)
WHERE endpoint = "/api/checkout"
GROUP BY error_occurred
# Identify correlated services in failures
COUNT_DISTINCT(service.name)
WHERE error = true
GROUP BY trace.trace_id
```
### Sentry Integration for Error Context
```python
# Enhanced Sentry context for debugging
import sentry_sdk
from sentry_sdk import set_context, capture_exception, add_breadcrumb
def configure_debug_context(user=None, request_data=None):
"""Add rich context for debugging in Sentry"""
if user:
sentry_sdk.set_user({
"id": user.id,
"email": user.email,
"segment": user.segment,
"subscription_tier": user.tier
})
if request_data:
set_context("request_details", {
"endpoint": request_data.get("endpoint"),
"method": request_data.get("method"),
"params": sanitize_params(request_data.get("params")),
"headers": sanitize_headers(request_data.get("headers"))
})
# Add system context
set_context("system", {
"hostname": socket.gethostname(),
"process_id": os.getpid(),
"thread_id": threading.get_ident(),
"memory_mb": psutil.Process().memory_info().rss / 1024 / 1024
})
def debug_operation(operation_name):
"""Decorator for debugging with breadcrumbs"""
def decorator(fn):
def wrapper(*args, **kwargs):
add_breadcrumb(
category='debug',
message=f'Entering {operation_name}',
level='debug',
data={'args_count': len(args), 'kwargs_keys': list(kwargs.keys())}
)
try:
result = fn(*args, **kwargs)
add_breadcrumb(
category='debug',
message=f'Completed {operation_name}',
level='debug',
data={'result_type': type(result).__name__}
)
return result
except Exception as e:
add_breadcrumb(
category='error',
message=f'Failed {operation_name}',
level='error',
data={'error': str(e)}
)
capture_exception(e)
raise
return wrapper
return decorator
# AI-powered error grouping in Sentry
# Configure fingerprinting for better debugging
sentry_sdk.init(
dsn=os.getenv("SENTRY_DSN"),
before_send=lambda event, hint: enhance_event_for_debugging(event, hint),
traces_sample_rate=0.1,
profiles_sample_rate=0.1
)
def enhance_event_for_debugging(event, hint):
"""Add AI-suggested fingerprinting"""
if 'exception' in event:
exc = event['exception']['values'][0]
# Custom fingerprinting based on error patterns
fingerprint = ['{{ default }}']
# AI can suggest better grouping strategies
if 'database' in exc.get('type', '').lower():
fingerprint.append('db-error')
fingerprint.append(extract_db_operation(exc))
event['fingerprint'] = fingerprint
return event
```
## Post-Debugging Validation
### Automated Fix Verification
After implementing fix, run comprehensive validation:
```typescript
// Post-fix validation framework
interface ValidationResult {
testsPassed: boolean;
performanceRegression: boolean;
errorRateChanged: boolean;
metricsComparison: MetricsComparison;
recommendations: string[];
}
class DebugFixValidator {
async validateFix(
issueId: string,
fixCommit: string,
baselineCommit: string
): Promise<ValidationResult> {
const results: ValidationResult = {
testsPassed: false,
performanceRegression: false,
errorRateChanged: false,
metricsComparison: {},
recommendations: []
};
// 1. Run existing test suite
const testResults = await this.runTests(fixCommit);
results.testsPassed = testResults.allPassed;
if (!results.testsPassed) {
results.recommendations.push(
'Fix broke existing tests. Review test failures.'
);
return results;
}
// 2. Performance comparison
const perfBaseline = await this.runPerfTests(baselineCommit);
const perfAfterFix = await this.runPerfTests(fixCommit);
results.performanceRegression = this.detectRegression(
perfBaseline,
perfAfterFix
);
if (results.performanceRegression) {
results.recommendations.push(
`Performance regression detected: ${this.formatDiff(perfBaseline, perfAfterFix)}`
);
}
// 3. Canary deployment validation
if (process.env.ENABLE_CANARY === 'true') {
const canaryResults = await this.runCanaryDeployment(fixCommit);
results.errorRateChanged = canaryResults.errorRateDelta > 0.05;
if (results.errorRateChanged) {
results.recommendations.push(
`Error rate increased by ${(canaryResults.errorRateDelta * 100).toFixed(2)}%`
);
}
}
// 4. AI-powered code review of the fix
const aiReview = await this.getAICodeReview(issueId, fixCommit);
results.recommendations.push(...aiReview.suggestions);
return results;
}
private async getAICodeReview(
issueId: string,
commit: string
): Promise<AIReview> {
// Use GitHub Copilot or Claude to review the fix
const diff = await this.getCommitDiff(commit);
return await aiAssistant.review({
context: `Reviewing fix for issue ${issueId}`,
diff,
checks: [
'error handling completeness',
'edge case coverage',
'potential side effects',
'test coverage adequacy',
'code clarity and maintainability'
]
});
}
}
```
### Regression Prevention
```python
# Automated regression test generation
class RegressionTestGenerator:
def __init__(self, issue_tracker, ai_assistant):
self.issue_tracker = issue_tracker
self.ai_assistant = ai_assistant
async def generate_tests_for_fix(self, issue_id: str, fix_commit: str):
"""Generate regression tests using AI"""
# Get issue details
issue = await self.issue_tracker.get(issue_id)
# Get code changes
diff = await self.get_git_diff(fix_commit)
# AI generates test cases
test_cases = await self.ai_assistant.generate_tests({
'issue_description': issue.description,
'reproduction_steps': issue.reproduction_steps,
'code_changes': diff,
'test_framework': self.detect_test_framework(),
'coverage_target': 'edge cases and failure modes'
})
# Write tests to appropriate files
for test_case in test_cases:
await self.write_test_file(
test_case.file_path,
test_case.content
)
# Validate tests catch the original bug
validation = await self.validate_tests_catch_bug(
issue_id,
fix_commit
)
return {
'tests_generated': len(test_cases),
'validates_fix': validation.successful,
'test_files': [tc.file_path for tc in test_cases]
}
```
### Knowledge Base Update
```javascript
// Automatically update debugging knowledge base
class DebugKnowledgeBase {
async recordDebugSession(session) {
const entry = {
issueId: session.issueId,
timestamp: new Date().toISOString(),
errorPattern: session.errorSignature,
rootCause: session.rootCause,
debugStrategy: session.strategyUsed,
timeToResolve: session.duration,
effectiveTools: session.toolsUsed,
searchKeywords: await this.extractKeywords(session),
relatedIssues: await this.findSimilarIssues(session),
preventionMeasures: session.preventionRecommendations,
aiInsights: session.aiAssistantAnalysis
};
await this.db.insert('debug_sessions', entry);
// Update AI model training data
await this.ai.addTrainingExample({
input: {
errorMessage: session.error,
stackTrace: session.stackTrace,
context: session.environment
},
output: {
rootCause: session.rootCause,
solution: session.solution,
confidence: session.confidenceScore
}
});
}
async getSimilarDebugSessions(errorSignature) {
// Vector similarity search for similar issues
return await this.vectorDb.similaritySearch(
errorSignature,
{
limit: 5,
threshold: 0.8
}
);
}
}
```
## Complete Examples
### Example 1: AI-Powered Debugging Session with GitHub Copilot
```typescript
/**
* Complete debugging session for intermittent checkout failure
* Using: GitHub Copilot Chat, DataDog, Sentry
*/
// Issue: "Checkout fails intermittently with 'Payment processing timeout'"
// Step 1: AI-assisted initial analysis
// Copilot Chat prompt: "Analyze this error pattern and suggest root causes"
import { DataDogClient } from '@datadog/datadog-api-client';
import * as Sentry from '@sentry/node';
class CheckoutDebugSession {
private dd: DataDogClient;
private sessionId: string;
constructor(sessionId: string) {
this.sessionId = sessionId;
this.dd = new DataDogClient(process.env.DD_API_KEY);
}
async investigateIssue() {
console.log('=== Starting AI-Assisted Debug Session ===');
// Step 2: Gather observability data
const sentryIssues = await this.getSentryErrorGroup();
const ddTraces = await this.getDataDogTraces();
const ddMetrics = await this.getRelevantMetrics();
console.log('\n[1] Sentry Error Analysis:');
console.log(` - Occurrences: ${sentryIssues.count}`);
console.log(` - Affected users: ${sentryIssues.userCount}`);
console.log(` - First seen: ${sentryIssues.firstSeen}`);
console.log(` - Last seen: ${sentryIssues.lastSeen}`);
console.log(` - User impact: ${sentryIssues.impactScore}`);
// Step 3: AI analysis of error patterns
// GitHub Copilot analyzes the error group and suggests:
// "Payment timeout correlates with high database latency"
console.log('\n[2] DataDog Trace Analysis:');
const slowTraces = ddTraces.filter(t => t.duration > 5000);
console.log(` - Total traces analyzed: ${ddTraces.length}`);
console.log(` - Slow traces (>5s): ${slowTraces.length}`);
// AI identifies pattern: DB queries taking 4-6 seconds
const dbSpans = slowTraces.flatMap(t =>
t.spans.filter(s => s.resource.startsWith('SELECT'))
);
console.log(` - Slow DB queries: ${dbSpans.length}`);
console.log(` - Slowest query: ${this.formatQuery(dbSpans[0])}`);
// Step 4: Hypothesis generation with AI
const hypotheses = [
{
name: 'Database N+1 query in payment verification',
probability: 85,
evidence: 'Multiple SELECT queries to user_payment_methods table',
test: 'Add query logging and count queries per checkout'
},
{
name: 'Lock contention on payment_transactions table',
probability: 60,
evidence: 'Correlation with concurrent checkouts',
test: 'Check pg_stat_activity for blocked queries'
},
{
name: 'External payment gateway timeout',
probability: 45,
evidence: 'Some traces show gateway response > 3s',
test: 'Add separate instrumentation for gateway calls'
}
];
console.log('\n[3] AI-Generated Hypotheses:');
hypotheses.forEach((h, i) => {
console.log(` ${i + 1}. ${h.name} (${h.probability}%)`);
console.log(` Evidence: ${h.evidence}`);
console.log(` Test: ${h.test}`);
});
// Step 5: Intelligent breakpoint placement
// AI suggests key points to instrument
const instrumentationPoints = await this.addSmartInstrumentation();
console.log('\n[4] Added Smart Instrumentation:');
instrumentationPoints.forEach(point => {
console.log(` - ${point.file}:${point.line} - ${point.reason}`);
});
// Step 6: Deploy instrumented version to 10% traffic
await this.deployCanaryWithInstrumentation();
console.log('\n[5] Canary Deployment:');
console.log(' - Deployed instrumented version to 10% traffic');
console.log(' - Monitoring for 15 minutes...');
// Wait and collect data
await this.sleep(15 * 60 * 1000);
// Step 7: Analyze collected data with AI
const analysis = await this.analyzeInstrumentationData();
console.log('\n[6] Root Cause Identified:');
console.log(` - ${analysis.rootCause}`);
console.log(` - Confidence: ${analysis.confidence}%`);
console.log(` - Evidence: ${analysis.evidence}`);
// Step 8: AI suggests fix
const suggestedFix = await this.generateFix(analysis);
console.log('\n[7] Suggested Fix:');
console.log(suggestedFix.code);
console.log(`\n - Impact: ${suggestedFix.impact}`);
console.log(` - Risk: ${suggestedFix.risk}`);
console.log(` - Test coverage: ${suggestedFix.testCoverage}`);
return {
rootCause: analysis.rootCause,
fix: suggestedFix,
validationPlan: this.generateValidationPlan(analysis, suggestedFix)
};
}
private async getSentryErrorGroup() {
const issues = await Sentry.getIssue('CHECKOUT_TIMEOUT_001');
return {
count: issues.count,
userCount: issues.userCount,
firstSeen: issues.firstSeen,
lastSeen: issues.lastSeen,
impactScore: this.calculateImpact(issues),
breadcrumbs: issues.latestEvent.breadcrumbs,
tags: issues.tags
};
}
private async getDataDogTraces() {
const query = `
service:checkout-api
operation_name:process_payment
@error:true
@duration:>5000ms
`;
return await this.dd.traces.search({
query,
from: Date.now() - 24 * 3600 * 1000,
to: Date.now(),
limit: 100
});
}
private async addSmartInstrumentation() {
// AI suggests these instrumentation points
return [
{
file: 'src/checkout/payment.ts',
line: 145,
reason: 'Payment verification entry point'
},
{
file: 'src/checkout/payment.ts',
line: 178,
reason: 'Database query execution (potential N+1)'
},
{
file: 'src/checkout/payment.ts',
line: 203,
reason: 'External gateway call'
},
{
file: 'src/checkout/payment.ts',
line: 245,
reason: 'Transaction commit point'
}
];
}
private async analyzeInstrumentationData() {
// AI analyzes collected data and identifies root cause
return {
rootCause: 'N+1 query: Loading payment methods for each item in cart separately',
confidence: 92,
evidence: 'Average 15 queries per checkout, each taking 300-400ms',
affectedCode: 'src/checkout/payment.ts:178-195',
suggestedFix: 'Use eager loading with JOIN or batch query'
};
}
private async generateFix(analysis) {
// AI generates the fix code
return {
code: `
// Before (N+1 query):
for (const item of cart.items) {
const paymentMethod = await PaymentMethod.findOne({
where: { userId: cart.userId, itemId: item.id }
});
await processPayment(item, paymentMethod);
}
// After (batched query):
const itemIds = cart.items.map(i => i.id);
const paymentMethods = await PaymentMethod.findAll({
where: {
userId: cart.userId,
itemId: { [Op.in]: itemIds }
}
});
const methodMap = new Map(
paymentMethods.map(pm => [pm.itemId, pm])
);
for (const item of cart.items) {
const paymentMethod = methodMap.get(item.id);
await processPayment(item, paymentMethod);
}
`.trim(),
impact: 'Reduces queries from ~15 to 1, expected 3-4s latency reduction',
risk: 'Low - preserves existing logic, only changes data fetching',
testCoverage: 'Add test for batch payment processing'
};
}
private generateValidationPlan(analysis, fix) {
return {
steps: [
'Apply fix to local environment',
'Run existing payment test suite',
'Add new test for batch payment method loading',
'Deploy to staging with full instrumentation',
'Run load test simulating 100 concurrent checkouts',
'Compare latency metrics: baseline vs fix',
'Canary deploy to 10% production for 1 hour',
'Monitor error rate and latency in DataDog',
'If metrics improve by >50%, roll out to 100%'
],
successCriteria: {
errorRateReduction: '>90%',
latencyReduction: '>70%',
queryCountReduction: '>85%'
}
};
}
private sleep(ms: number) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Run the debug session
const session = new CheckoutDebugSession('checkout-timeout-issue');
const result = await session.investigateIssue();
console.log('\n=== Debug Session Complete ===');
console.log(JSON.stringify(result, null, 2));
```
### Example 2: Observability-Driven Production Debugging
```python
"""
Complete workflow for debugging production memory leak
Using: Honeycomb, Pyroscope, Grafana, Claude Code
"""
import asyncio
from datetime import datetime, timedelta
from honeycomb import HoneycombClient
from pyroscope import Profiler
import anthropic
class ProductionMemoryLeakDebugger:
def __init__(self, service_name: str):
self.service_name = service_name
self.honeycomb = HoneycombClient(api_key=os.getenv("HONEYCOMB_API_KEY"))
self.anthropic = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
self.findings = []
async def debug_memory_leak(self):
"""
Complete debugging workflow for memory leak
"""
print("=== Production Memory Leak Investigation ===\n")
# Step 1: Identify memory growth pattern
print("[1] Analyzing Memory Growth Pattern")
memory_pattern = await self.analyze_memory_metrics()
print(f" - Memory growth rate: {memory_pattern['growth_rate_mb_per_hour']} MB/hour")
print(f" - Time to OOM: ~{memory_pattern['hours_to_oom']} hours")
print(f" - Pattern type: {memory_pattern['pattern_type']}")
self.findings.append({
"category": "memory_metrics",
"data": memory_pattern
})
# Step 2: Continuous profiling analysis
print("\n[2] Analyzing Continuous Profiling Data (Pyroscope)")
profile_analysis = await self.analyze_profiles()
print(f" - Top memory allocator: {profile_analysis['top_allocator']}")
print(f" - Allocation rate: {profile_analysis['alloc_rate_mb_per_sec']} MB/s")
print(f" - Suspected leak locations:")
for loc in profile_analysis['suspected_locations']:
print(f" - {loc['function']} at {loc['file']}:{loc['line']}")
self.findings.append({
"category": "profiling",
"data": profile_analysis
})
# Step 3: Distributed trace analysis
print("\n[3] Analyzing Request Traces for Memory Patterns")
trace_analysis = await self.analyze_traces_for_memory()
print(f" - Requests analyzed: {trace_analysis['request_count']}")
print(f" - Memory leak correlation:")
print(f" - High memory requests: {trace_analysis['high_memory_requests']}")
print(f" - Common patterns: {trace_analysis['common_patterns']}")
self.findings.append({
"category": "traces",
"data": trace_analysis
})
# Step 4: AI-powered root cause analysis
print("\n[4] AI Root Cause Analysis (Claude)")
root_cause = await self.ai_analyze_findings()
print(f" - Root cause: {root_cause['diagnosis']}")
print(f" - Confidence: {root_cause['confidence']}%")
print(f" - Evidence chain:")
for evidence in root_cause['evidence']:
print(f" - {evidence}")
# Step 5: Generate and test hypothesis
print("\n[5] Hypothesis Testing")
hypothesis = root_cause['hypothesis']
test_results = await self.test_hypothesis(hypothesis)
print(f" - Hypothesis: {hypothesis['statement']}")
print(f" - Test result: {test_results['outcome']}")
print(f" - Evidence: {test_results['evidence']}")
# Step 6: Implement targeted instrumentation
print("\n[6] Deploying Targeted Instrumentation")
instrumentation = await self.deploy_targeted_instrumentation(
root_cause['suspected_code_paths']
)
print(f" - Instrumented {len(instrumentation['points'])} code paths")
print(f" - Monitoring for 30 minutes...")
await asyncio.sleep(30 * 60) # Wait 30 minutes
# Step 7: Analyze instrumentation data
print("\n[7] Analyzing Instrumentation Results")
detailed_analysis = await self.analyze_instrumentation_data()
print(f" - Confirmed root cause: {detailed_analysis['confirmed']}")
print(f" - Leak location: {detailed_analysis['leak_location']}")
print(f" - Leak type: {detailed_analysis['leak_type']}")
# Step 8: AI generates fix
print("\n[8] Generating Fix (AI-assisted)")
fix = await self.generate_fix(detailed_analysis)
print(f" - Fix strategy: {fix['strategy']}")
print(f" - Code changes required: {len(fix['changes'])} files")
print(f" - Risk assessment: {fix['risk']}")
# Step 9: Validation plan
print("\n[9] Fix Validation Plan")
validation = self.create_validation_plan(fix)
for step_num, step in enumerate(validation['steps'], 1):
print(f" {step_num}. {step}")
return {
"root_cause": detailed_analysis,
"fix": fix,
"validation_plan": validation,
"findings": self.findings
}
async def analyze_memory_metrics(self):
"""Query Grafana/Prometheus for memory metrics"""
# Simulate Prometheus query
# In real implementation: query actual Prometheus
return {
"growth_rate_mb_per_hour": 45.3,
"hours_to_oom": 18.5,
"pattern_type": "linear_growth",
"baseline_memory_mb": 512,
"current_memory_mb": 1847,
"measurement_period_hours": 24
}
async def analyze_profiles(self):
"""Analyze Pyroscope continuous profiling data"""
# Query Pyroscope for memory allocation profiles
# Compare profiles over time to identify growing allocations
return {
"top_allocator": "cache_manager.add_entry()",
"alloc_rate_mb_per_sec": 0.012,
"suspected_locations": [
{
"function": "cache_manager.add_entry",
"file": "src/cache/manager.py",
"line": 145,
"alloc_percent": 67.3
},
{
"function": "request_handler.store_session",
"file": "src/api/handler.py",
"line": 89,
"alloc_percent": 18.2
}
],
"time_range": "last_24_hours"
}
async def analyze_traces_for_memory(self):
"""Analyze Honeycomb traces for memory-related patterns"""
# Honeycomb query to find traces with high memory allocation
query = """
BREAKDOWN(trace.trace_id)
WHERE service.name = '{service}'
AND memory.delta_mb > 10
ORDER BY memory.delta_mb DESC
LIMIT 100
""".format(service=self.service_name)
traces = await self.honeycomb.query(query)
# Analyze common patterns in high-memory traces
common_patterns = self.extract_common_patterns(traces)
return {
"request_count": len(traces),
"high_memory_requests": len([t for t in traces if t['memory_delta'] > 20]),
"common_patterns": [
"All include cache write operation",
"87% involve large JSON parsing",
"Cache eviction never triggered"
],
"top_endpoints": [
{"endpoint": "/api/data/sync", "count": 43},
{"endpoint": "/api/batch/process", "count": 28}
]
}
async def ai_analyze_findings(self):
"""Use Claude to analyze all findings and determine root cause"""
# Prepare context for Claude
context = {
"findings": self.findings,
"service": self.service_name,
"symptoms": "Linear memory growth, ~45MB/hour, OOM in ~18 hours"
}
prompt = f"""
Analyze the following production memory leak data and determine the root cause:
{json.dumps(context, indent=2)}
Provide:
1. Root cause diagnosis
2. Confidence level (0-100%)
3. Evidence chain supporting the diagnosis
4. Testable hypothesis
5. Suspected code paths
Format as JSON.
"""
message = await self.anthropic.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=2000,
messages=[{"role": "user", "content": prompt}]
)
analysis = json.loads(message.content[0].text)
return {
"diagnosis": "Cache entries added but never evicted - missing TTL and size limit",
"confidence": 94,
"evidence": [
"Profiling shows cache_manager.add_entry() as top allocator (67%)",
"Traces show cache writes but no cache evictions",
"Linear growth pattern consistent with unbounded cache",
"Growth rate matches request rate × average entry size"
],
"hypothesis": {
"statement": "Cache has no eviction policy, causing unbounded memory growth",
"test": "Add cache size metrics and verify no evictions occurring",
"expected_outcome": "Cache size grows linearly with request count"
},
"suspected_code_paths": [
"src/cache/manager.py:add_entry()",
"src/cache/manager.py:__init__()",
"src/api/handler.py:store_session()"
]
}
async def test_hypothesis(self, hypothesis):
"""Deploy instrumentation to test hypothesis"""
# Add metrics to track cache size and evictions
# In real implementation: deploy instrumented version
await asyncio.sleep(5) # Simulate data collection
return {
"outcome": "CONFIRMED",
"evidence": "Cache size grew from 1,247 entries to 3,891 entries in 30 minutes. Zero evictions recorded.",
"metrics": {
"cache_size_start": 1247,
"cache_size_end": 3891,
"evictions_count": 0,
"additions_count": 2644
}
}
async def deploy_targeted_instrumentation(self, code_paths):
"""Deploy focused instrumentation on suspected code paths"""
instrumentation_points = []
for path in code_paths:
instrumentation_points.append({
"file": path,
"metrics": [
"cache.size",
"cache.evictions",
"cache.additions",
"memory.used_mb"
],
"log_level": "debug"
})
# In real implementation: update deployment with instrumentation
return {"points": instrumentation_points}
async def analyze_instrumentation_data(self):
"""Analyze detailed instrumentation data"""
return {
"confirmed": True,
"leak_location": "src/cache/manager.py:CacheManager",
"leak_type": "unbounded_cache",
"details": {
"cache_implementation": "dict without size limit",
"eviction_policy": "none",
"ttl_configured": False,
"max_size_configured": False
},
"impact": "All cache entries retained indefinitely"
}
async def generate_fix(self, analysis):
"""AI generates fix for the memory leak"""
prompt = f"""
Generate a fix for this memory leak:
{json.dumps(analysis, indent=2)}
Requirements:
- Add LRU cache with size limit
- Add TTL-based eviction
- Maintain existing API
- Production-safe changes only
Provide complete code and migration strategy.
"""
message = await self.anthropic.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=3000,
messages=[{"role": "user", "content": prompt}]
)
return {
"strategy": "Replace dict with cachetools.LRUCache, add TTL",
"changes": [
{
"file": "src/cache/manager.py",
"description": "Implement LRU cache with size limit and TTL",
"code": """
from cachetools import TTLCache
from threading import RLock
class CacheManager:
def __init__(self, max_size=10000, ttl_seconds=3600):
# LRU cache with size limit and TTL
self.cache = TTLCache(maxsize=max_size, ttl=ttl_seconds)
self.lock = RLock()
def add_entry(self, key, value):
with self.lock:
self.cache[key] = value
# Eviction happens automatically
def get_entry(self, key):
with self.lock:
return self.cache.get(key)
"""
},
{
"file": "src/config.py",
"description": "Add cache configuration",
"code": """
CACHE_MAX_SIZE = int(os.getenv('CACHE_MAX_SIZE', '10000'))
CACHE_TTL_SECONDS = int(os.getenv('CACHE_TTL_SECONDS', '3600'))
"""
}
],
"risk": "LOW - Backward compatible API, configurable limits",
"dependencies": ["cachetools>=5.3.0"],
"rollback_plan": "Feature flag to switch between old and new cache"
}
def create_validation_plan(self, fix):
"""Create comprehensive validation plan for the fix"""
return {
"steps": [
"Add comprehensive unit tests for cache eviction",
"Run memory profiling in staging with production traffic replay",
"Verify cache size remains bounded under load",
"Verify cache hit rate remains acceptable",
"Deploy with feature flag to 1% traffic",
"Monitor memory metrics for 2 hours",
"If stable, increase to 10% for 4 hours",
"If memory growth stopped, roll out to 100%",
"Continue monitoring for 24 hours post-rollout"
],
"success_criteria": {
"memory_growth": "< 5MB/hour (down from 45MB/hour)",
"cache_hit_rate": "> 85%",
"cache_size": "< 10,000 entries",
"eviction_rate": "> 0 evictions/minute",
"error_rate": "no increase"
},
"monitoring": [
"Memory usage (RSS)",
"Cache size metric",
"Cache hit/miss rates",
"Eviction rate",
"Request latency p50/p95/p99",
"Error rate"
]
}
def extract_common_patterns(self, traces):
"""Extract common patterns from trace data"""
# Simplified pattern extraction
return []
# Execute the debug workflow
async def main():
debugger = ProductionMemoryLeakDebugger("api-server")
result = await debugger.debug_memory_leak()
print("\n=== Debug Complete ===")
print(f"Root cause: {result['root_cause']['leak_location']}")
print(f"Fix strategy: {result['fix']['strategy']}")
print(f"\nNext steps:")
for i, step in enumerate(result['validation_plan']['steps'][:3], 1):
print(f" {i}. {step}")
if __name__ == "__main__":
asyncio.run(main())
```
## Reference Workflows
### Reference 1: Cursor IDE Time-Travel Debugging
Complete workflow for debugging state management bug using Cursor IDE's AI features and time-travel debugging:
1. **Initial Problem Identification**
- User reports: "Shopping cart shows wrong item count after page refresh"
- Reproduction rate: 15% of page refreshes
- Environment: React SPA with Redux state management
2. **AI-Assisted Code Analysis** (Cursor IDE)
- Use Cursor's "Explain this code" on CartReducer.ts
- AI identifies complex state update logic with 3 nested reducers
- Suggests potential race condition in async state hydration
3. **Time-Travel Debugging Setup** (Redux DevTools)
- Install Redux DevTools Extension with time-travel capability
- Add state serialization for replay
- Configure Redux store with DevTools enhancer
- Add state snapshot middleware
4. **Reproduction with Recording**
- Enable Redux DevTools recording
- Reproduce the bug (multiple attempts)
- Export state dump when bug occurs
- Save action timeline for analysis
5. **Time-Travel Analysis**
- Load saved state dump in DevTools
- Scrub through action timeline
- Identify moment where state diverges
- Use Cursor AI to analyze action sequence
- AI identifies: "State hydration dispatches before localStorage read completes"
6. **Root Cause Confirmation**
- Add breakpoints in async hydration logic
- Step through with Cursor's debug panel
- Confirm race condition: hydration action dispatches too early
- localStorage read hasn't completed yet
7. **AI-Generated Fix** (Cursor IDE)
- Ask Cursor: "Fix race condition in cart hydration"
- AI suggests: Add Promise wrapper and await localStorage read
- Review generated fix code
- Accept fix with modifications
8. **Validation with Time-Travel**
- Apply fix locally
- Replay saved action sequence with fixed code
- Verify state remains consistent through replay
- Test with 100 rapid page refreshes - no failures
9. **Automated Test Generation** (Cursor AI)
- Ask Cursor: "Generate test for cart hydration race condition"
- AI creates test that reproduces original race condition
- Test fails on old code, passes on fixed code
- Add test to suite
10. **Deployment and Monitoring**
- Deploy fix with feature flag
- Monitor cart error rates in Sentry
- Enable for 100% after 24 hours with no regressions
### Reference 2: Production Debugging with Distributed Tracing
Complete workflow for debugging cross-service latency issue:
1. **Alert Triggered**
- DataDog alert: "P95 latency for /api/recommendations endpoint > 2s"
- Affected: 5% of requests
- Pattern: Intermittent, no clear time correlation
2. **Honeycomb Query-Driven Investigation**
- Query: `WHERE endpoint = "/api/recommendations" AND duration_ms > 2000`
- BREAKDOWN by user_id, device_type, region
- Identifies: All slow requests from specific region (us-east-2)
3. **Distributed Trace Analysis**
- Examine full trace for slow request
- Service call chain: API → Auth → User Service → ML Service → Recommendations
- ML Service span shows 1.8s latency
- Most time in "model inference" operation
4. **Cross-Service Correlation**
- Query ML Service logs for same trace ID
- Correlate with GPU utilization metrics in Grafana
- Discover: GPU memory contention during specific hours
5. **AI-Assisted Pattern Recognition** (Claude Code)
- Feed trace data to Claude: "Analyze this latency pattern"
- AI identifies: Correlation with batch inference jobs
- Batch jobs scheduled every 30 minutes
- Cause resource contention with real-time inference
6. **Hypothesis Formation**
- Primary: Batch jobs starve real-time inference of GPU resources
- Secondary: Model loading delay when GPU busy
- Test: Disable batch jobs and monitor latency
7. **Safe Production Testing**
- Feature flag to disable batch jobs in us-east-2 only
- Monitor for 1 hour
- Result: P95 latency drops to 350ms (from 2.1s)
- Hypothesis confirmed
8. **Solution Design** (AI-Assisted)
- Claude suggests: Separate GPU pools for batch vs real-time
- Alternative: Priority-based scheduling in ML framework
- Decision: Implement priority scheduling (faster, less infrastructure)
9. **Implementation**
- Add priority queue to ML inference service
- Real-time requests: high priority
- Batch requests: low priority
- Deploy to staging, load test confirms fix
10. **Gradual Rollout with Validation**
- Deploy to us-east-2 with 10% traffic
- Monitor latency, error rate, GPU utilization
- Roll out to 100% us-east-2
- Roll out to all regions over 48 hours
- Final result: P95 latency 320ms, no increased error rate
11. **Post-Incident Review**
- Document root cause in knowledge base
- Add synthetic monitoring for GPU contention
- Create alert for priority queue backlog
- Update ML service runbook with troubleshooting steps
---
Issue to debug: $ARGUMENTS