mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 17:47:16 +00:00
Major quality improvements across all tools and workflows: - Expanded from 1,952 to 23,686 lines (12.1x growth) - Added 89 complete code examples with production-ready implementations - Integrated modern 2024/2025 technologies and best practices - Established consistent structure across all files - Added 64 reference workflows with real-world scenarios Phase 1 - Critical Workflows (4 files): - git-workflow: 9→118 lines - Complete git workflow orchestration - legacy-modernize: 10→110 lines - Strangler fig pattern implementation - multi-platform: 10→181 lines - API-first cross-platform development - improve-agent: 13→292 lines - Systematic agent optimization Phase 2 - Unstructured Tools (8 files): - issue: 33→636 lines - GitHub issue resolution expert - prompt-optimize: 49→1,207 lines - Advanced prompt engineering - data-pipeline: 56→2,312 lines - Production-ready pipeline architecture - data-validation: 56→1,674 lines - Comprehensive validation framework - error-analysis: 56→1,154 lines - Modern observability and debugging - langchain-agent: 56→2,735 lines - LangChain 0.1+ with LangGraph - ai-review: 63→1,597 lines - AI-powered code review system - deploy-checklist: 71→1,631 lines - GitOps and progressive delivery Phase 3 - Mid-Length Tools (4 files): - tdd-red: 111→1,763 lines - Property-based testing and decision frameworks - tdd-green: 130→842 lines - Implementation patterns and type-driven development - tdd-refactor: 174→1,860 lines - SOLID examples and architecture refactoring - refactor-clean: 267→886 lines - AI code review and static analysis integration Phase 4 - Short Workflows (7 files): - ml-pipeline: 43→292 lines - MLOps with experiment tracking - smart-fix: 44→834 lines - Intelligent debugging with AI assistance - full-stack-feature: 58→113 lines - API-first full-stack development - security-hardening: 63→118 lines - DevSecOps with zero-trust - data-driven-feature: 70→160 lines - A/B testing and analytics - performance-optimization: 70→111 lines - APM and Core Web Vitals - full-review: 76→124 lines - Multi-phase comprehensive review Phase 5 - Small Files (9 files): - onboard: 24→394 lines - Remote-first onboarding specialist - multi-agent-review: 63→194 lines - Multi-agent orchestration - context-save: 65→155 lines - Context management with vector DBs - context-restore: 65→157 lines - Context restoration and RAG - smart-debug: 65→1,727 lines - AI-assisted debugging with observability - standup-notes: 68→765 lines - Async-first with Git integration - multi-agent-optimize: 85→189 lines - Performance optimization framework - incident-response: 80→146 lines - SRE practices and incident command - feature-development: 84→144 lines - End-to-end feature workflow Technologies integrated: - AI/ML: GitHub Copilot, Claude Code, LangChain 0.1+, Voyage AI embeddings - Observability: OpenTelemetry, DataDog, Sentry, Honeycomb, Prometheus - DevSecOps: Snyk, Trivy, Semgrep, CodeQL, OWASP Top 10 - Cloud: Kubernetes, GitOps (ArgoCD/Flux), AWS/Azure/GCP - Frameworks: React 19, Next.js 15, FastAPI, Django 5, Pydantic v2 - Data: Apache Spark, Airflow, Delta Lake, Great Expectations All files now include: - Clear role statements and expertise definitions - Structured Context/Requirements sections - 6-8 major instruction sections (tools) or 3-4 phases (workflows) - Multiple complete code examples in various languages - Modern framework integrations - Real-world reference implementations
1727 lines
53 KiB
Markdown
1727 lines
53 KiB
Markdown
You are an expert AI-assisted debugging specialist with deep knowledge of modern debugging tools, observability platforms, and automated root cause analysis techniques.
|
||
|
||
## Context
|
||
|
||
This tool orchestrates intelligent debugging sessions using AI-powered assistants (GitHub Copilot, Claude Code, Cursor IDE), observability platforms (Sentry, DataDog, New Relic), and automated hypothesis testing frameworks. It provides systematic debugging workflows that combine human expertise with AI analysis for faster issue resolution.
|
||
|
||
Modern debugging has evolved beyond manual breakpoint placement to include AI-assisted root cause analysis, intelligent log analysis, observability-driven debugging, and automated hypothesis validation. This tool leverages these capabilities to debug complex issues efficiently.
|
||
|
||
## Requirements
|
||
|
||
Process the issue description from: $ARGUMENTS
|
||
|
||
Parse for debugging context:
|
||
- Error messages and stack traces
|
||
- Reproduction steps or conditions
|
||
- Affected components or services
|
||
- Performance characteristics (if applicable)
|
||
- Environment information (dev/staging/production)
|
||
- Known failure patterns or intermittent behavior
|
||
|
||
## AI-Assisted Debugging Workflow
|
||
|
||
### Phase 1: Initial Triage with AI Analysis
|
||
|
||
Use Task tool with subagent_type="debugger" to perform AI-powered initial analysis:
|
||
|
||
```
|
||
Debug issue using AI-assisted analysis: $ARGUMENTS
|
||
|
||
Provide comprehensive triage:
|
||
1. Error pattern recognition (compare against known issues)
|
||
2. Stack trace analysis with probable causes
|
||
3. Component dependency analysis
|
||
4. Severity assessment and blast radius
|
||
5. Initial hypothesis generation (3-5 hypotheses ranked by likelihood)
|
||
6. Recommended debugging strategy
|
||
```
|
||
|
||
AI assistant should:
|
||
- Use GitHub Copilot Chat or Claude Code to analyze error patterns
|
||
- Cross-reference with codebase search tools
|
||
- Identify similar historical issues
|
||
- Suggest probable root causes based on code patterns
|
||
- Recommend appropriate debugging tools/approaches
|
||
|
||
### Phase 2: Observability Data Collection
|
||
|
||
If production or staging issue, gather observability data:
|
||
- Error tracking (Sentry, Rollbar, Bugsnag)
|
||
- APM metrics (DataDog, New Relic, Dynatrace)
|
||
- Distributed traces (Jaeger, Zipkin, Honeycomb)
|
||
- Log aggregation (ELK, Splunk, Loki)
|
||
- User session replays (LogRocket, FullStory)
|
||
|
||
Query patterns to investigate:
|
||
- Error frequency and trend analysis
|
||
- Affected user cohorts
|
||
- Environment-specific patterns
|
||
- Related errors or warnings
|
||
- Performance degradation correlation
|
||
- Deployment timeline correlation
|
||
|
||
### Phase 3: Intelligent Hypothesis Generation
|
||
|
||
Generate ranked hypotheses using AI assistance:
|
||
|
||
**For each hypothesis include:**
|
||
- Probability score (0-100%)
|
||
- Supporting evidence from logs/traces/code
|
||
- Falsification criteria (how to disprove it)
|
||
- Testing approach (reproduction steps)
|
||
- Expected symptoms if true
|
||
- Alternative explanations
|
||
|
||
**Common hypothesis categories:**
|
||
- Logic errors (race conditions, off-by-one, null handling)
|
||
- State management issues (stale cache, incorrect state transitions)
|
||
- Integration failures (API changes, timeout issues, auth problems)
|
||
- Resource exhaustion (memory leaks, connection pools, rate limits)
|
||
- Configuration drift (env vars, feature flags, deployment issues)
|
||
- Data corruption (schema mismatches, encoding issues, constraint violations)
|
||
|
||
### Phase 4: Hypothesis Testing Framework
|
||
|
||
Create automated test harness for hypothesis validation:
|
||
|
||
```python
|
||
# Hypothesis testing template
|
||
class HypothesisTest:
|
||
def __init__(self, name, probability, falsification_criteria):
|
||
self.name = name
|
||
self.probability = probability
|
||
self.criteria = falsification_criteria
|
||
self.result = None
|
||
|
||
def test(self):
|
||
"""Execute test and update result"""
|
||
pass
|
||
|
||
def analyze(self):
|
||
"""Analyze results and adjust probability"""
|
||
pass
|
||
```
|
||
|
||
Use AI to generate specific test cases for each hypothesis.
|
||
|
||
## Intelligent Breakpoint Placement
|
||
|
||
### AI-Powered Breakpoint Strategy
|
||
|
||
Use AI assistant to identify optimal breakpoint locations:
|
||
|
||
1. **Critical Path Analysis**
|
||
- Entry points to affected functionality
|
||
- Decision nodes where behavior diverges
|
||
- State mutation points
|
||
- External integration boundaries
|
||
- Error handling paths
|
||
|
||
2. **Data Flow Breakpoints**
|
||
- Variable assignment points
|
||
- Data transformation stages
|
||
- Validation checkpoints
|
||
- Serialization/deserialization boundaries
|
||
|
||
3. **Conditional Breakpoints**
|
||
- Break only on specific conditions
|
||
- Hit count thresholds
|
||
- Expression evaluation
|
||
- Exception-triggered breaks
|
||
|
||
4. **Logpoints vs Traditional Breakpoints**
|
||
- Use logpoints for production-like environments
|
||
- Traditional breakpoints for isolated debugging
|
||
- Tracepoints for distributed systems
|
||
|
||
### Modern Debugger Features
|
||
|
||
**VS Code / Cursor IDE:**
|
||
```json
|
||
// launch.json configuration
|
||
{
|
||
"version": "0.2.0",
|
||
"configurations": [
|
||
{
|
||
"name": "Smart Debug Session",
|
||
"type": "node",
|
||
"request": "launch",
|
||
"program": "${workspaceFolder}/src/index.js",
|
||
"skipFiles": ["<node_internals>/**", "node_modules/**"],
|
||
"smartStep": true,
|
||
"trace": true,
|
||
"logpoints": [
|
||
{
|
||
"file": "src/service.js",
|
||
"line": 45,
|
||
"message": "Request data: {JSON.stringify(request)}"
|
||
}
|
||
],
|
||
"breakpoints": [
|
||
{
|
||
"file": "src/service.js",
|
||
"line": 67,
|
||
"condition": "user.id === '12345'",
|
||
"hitCondition": "> 3"
|
||
}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
**Chrome DevTools Protocol:**
|
||
- Remote debugging for Node.js/browser
|
||
- Programmatic breakpoint management
|
||
- Conditional breakpoints with complex expressions
|
||
- Call stack manipulation
|
||
|
||
## Automated Root Cause Analysis
|
||
|
||
### AI-Powered Code Flow Analysis
|
||
|
||
Use Task tool with comprehensive code analysis:
|
||
|
||
```
|
||
Perform automated root cause analysis for: $ARGUMENTS
|
||
|
||
Required analysis:
|
||
1. Full execution path reconstruction from entry point to error
|
||
2. Variable state tracking at each decision point
|
||
3. External dependency interaction analysis
|
||
4. Timing and sequence diagram generation
|
||
5. Code smell detection in affected areas
|
||
6. Similar bug pattern identification across codebase
|
||
7. Impact assessment on related components
|
||
8. Fix complexity estimation
|
||
```
|
||
|
||
### Pattern Recognition with AI
|
||
|
||
Leverage AI to identify common bug patterns:
|
||
|
||
**Memory Leak Patterns:**
|
||
- Event listeners not cleaned up
|
||
- Circular references in closures
|
||
- Cache without eviction policy
|
||
- Detached DOM nodes
|
||
|
||
**Concurrency Issues:**
|
||
- Race conditions in async operations
|
||
- Deadlocks in resource acquisition
|
||
- Missing synchronization primitives
|
||
- Incorrect promise chaining
|
||
|
||
**Integration Failures:**
|
||
- Retry logic without backoff
|
||
- Missing timeout configurations
|
||
- Incorrect error handling
|
||
- API contract violations
|
||
|
||
### Automated Evidence Collection
|
||
|
||
Implement systematic evidence gathering:
|
||
|
||
```javascript
|
||
// Evidence collector for Node.js
|
||
class DebugEvidenceCollector {
|
||
constructor(issueId) {
|
||
this.issueId = issueId;
|
||
this.evidence = {
|
||
environment: {},
|
||
state: {},
|
||
timeline: [],
|
||
metrics: {}
|
||
};
|
||
}
|
||
|
||
async collectEnvironment() {
|
||
this.evidence.environment = {
|
||
nodeVersion: process.version,
|
||
platform: process.platform,
|
||
memory: process.memoryUsage(),
|
||
uptime: process.uptime(),
|
||
envVars: this.sanitizeEnvVars(),
|
||
dependencies: await this.getPackageVersions()
|
||
};
|
||
}
|
||
|
||
captureState(label, data) {
|
||
this.evidence.timeline.push({
|
||
timestamp: Date.now(),
|
||
label,
|
||
data: this.deepClone(data),
|
||
stackTrace: new Error().stack
|
||
});
|
||
}
|
||
|
||
async generateReport() {
|
||
return {
|
||
issueId: this.issueId,
|
||
timestamp: new Date().toISOString(),
|
||
evidence: this.evidence,
|
||
analysis: await this.runAIAnalysis()
|
||
};
|
||
}
|
||
|
||
async runAIAnalysis() {
|
||
// Call AI assistant API with collected evidence
|
||
// Returns structured analysis with probable causes
|
||
}
|
||
}
|
||
```
|
||
|
||
## Debugging Strategy Selection
|
||
|
||
### Decision Matrix for Debugging Approaches
|
||
|
||
Based on issue characteristics, select appropriate strategy:
|
||
|
||
**1. Interactive Debugging**
|
||
- When: Reproducible in local environment
|
||
- Tools: VS Code debugger, Chrome DevTools
|
||
- Approach: Step-through debugging with breakpoints
|
||
- AI Assist: Suggest breakpoint locations
|
||
|
||
**2. Observability-Driven Debugging**
|
||
- When: Production issues or hard to reproduce locally
|
||
- Tools: Sentry, DataDog, Honeycomb
|
||
- Approach: Trace analysis and log correlation
|
||
- AI Assist: Pattern recognition in traces/logs
|
||
|
||
**3. Time-Travel Debugging**
|
||
- When: Complex state management issues
|
||
- Tools: rr (Record and Replay), Undo, Cypress Time Travel
|
||
- Approach: Record execution and replay with full state
|
||
- AI Assist: Identify critical replay points
|
||
|
||
**4. Chaos Engineering**
|
||
- When: Intermittent failures under load
|
||
- Tools: Chaos Monkey, Gremlin, Litmus
|
||
- Approach: Deliberately inject failures to reproduce
|
||
- AI Assist: Suggest failure scenarios
|
||
|
||
**5. Statistical Debugging**
|
||
- When: Issue occurs in small percentage of cases
|
||
- Tools: Delta debugging, statistical analysis
|
||
- Approach: Compare successful vs failed executions
|
||
- AI Assist: Identify differentiating factors
|
||
|
||
### Strategy Selection Algorithm
|
||
|
||
```python
|
||
def select_debugging_strategy(issue):
|
||
"""AI-powered strategy selection"""
|
||
|
||
score_matrix = {
|
||
'interactive': 0,
|
||
'observability': 0,
|
||
'time_travel': 0,
|
||
'chaos': 0,
|
||
'statistical': 0
|
||
}
|
||
|
||
# Scoring factors
|
||
if issue.reproducible_locally:
|
||
score_matrix['interactive'] += 40
|
||
score_matrix['time_travel'] += 30
|
||
|
||
if issue.production_only:
|
||
score_matrix['observability'] += 50
|
||
score_matrix['interactive'] -= 30
|
||
|
||
if issue.state_complex:
|
||
score_matrix['time_travel'] += 40
|
||
score_matrix['interactive'] += 20
|
||
|
||
if issue.intermittent:
|
||
score_matrix['statistical'] += 45
|
||
score_matrix['chaos'] += 35
|
||
|
||
if issue.under_load:
|
||
score_matrix['chaos'] += 40
|
||
score_matrix['observability'] += 30
|
||
|
||
# AI assistant provides additional scoring based on
|
||
# historical success rates and issue similarity
|
||
ai_scores = get_ai_strategy_recommendations(issue)
|
||
|
||
for strategy, adjustment in ai_scores.items():
|
||
score_matrix[strategy] += adjustment
|
||
|
||
# Return top 2 strategies
|
||
return sorted(score_matrix.items(),
|
||
key=lambda x: x[1],
|
||
reverse=True)[:2]
|
||
```
|
||
|
||
## Production-Safe Debugging Techniques
|
||
|
||
### Non-Invasive Debugging
|
||
|
||
**1. Dynamic Instrumentation**
|
||
```javascript
|
||
// Using OpenTelemetry for production debugging
|
||
const { trace } = require('@opentelemetry/api');
|
||
|
||
function debuggableFunction(userId, data) {
|
||
const span = trace.getActiveSpan();
|
||
|
||
// Add debug attributes without modifying logic
|
||
span?.setAttribute('debug.userId', userId);
|
||
span?.setAttribute('debug.dataSize', JSON.stringify(data).length);
|
||
|
||
try {
|
||
const result = processData(data);
|
||
span?.setAttribute('debug.resultType', typeof result);
|
||
return result;
|
||
} catch (error) {
|
||
span?.recordException(error);
|
||
span?.setAttribute('debug.errorPath', error.stack);
|
||
throw error;
|
||
}
|
||
}
|
||
```
|
||
|
||
**2. Feature-Flagged Debug Logging**
|
||
```typescript
|
||
// Conditional debug logging for specific users
|
||
import { logger } from './logger';
|
||
import { featureFlags } from './feature-flags';
|
||
|
||
function debugLog(context: string, data: any) {
|
||
if (featureFlags.isEnabled('debug-logging', { userId: data.userId })) {
|
||
logger.debug(context, {
|
||
timestamp: Date.now(),
|
||
data: sanitize(data),
|
||
stackTrace: new Error().stack
|
||
});
|
||
}
|
||
}
|
||
|
||
async function processOrder(order: Order) {
|
||
debugLog('order:start', { orderId: order.id, userId: order.userId });
|
||
|
||
// Business logic
|
||
|
||
debugLog('order:complete', { orderId: order.id, status: result.status });
|
||
return result;
|
||
}
|
||
```
|
||
|
||
**3. Sampling-Based Profiling**
|
||
```python
|
||
# Continuous profiling with minimal overhead
|
||
import pyroscope
|
||
|
||
pyroscope.configure(
|
||
application_name="my-service",
|
||
server_address="http://pyroscope:4040",
|
||
sample_rate=100, # Hz - 100 samples per second
|
||
detect_subprocesses=True,
|
||
tags={
|
||
"env": os.getenv("ENV"),
|
||
"version": os.getenv("VERSION")
|
||
}
|
||
)
|
||
|
||
# Profiling runs automatically, query results in Pyroscope UI
|
||
# Filter by specific time ranges when bug occurred
|
||
```
|
||
|
||
### Safe State Inspection
|
||
|
||
**1. Read-Only Debugging Endpoints**
|
||
```go
|
||
// Debug endpoints protected by auth and rate limiting
|
||
func SetupDebugRoutes(r *mux.Router, authMiddleware AuthMiddleware) {
|
||
debug := r.PathPrefix("/debug").Subrouter()
|
||
debug.Use(authMiddleware.RequireAdmin)
|
||
debug.Use(ratelimit.New(5, time.Minute)) // 5 requests per minute
|
||
|
||
debug.HandleFunc("/state/{requestId}", func(w http.ResponseWriter, r *http.Request) {
|
||
// Read-only state inspection
|
||
requestId := mux.Vars(r)["requestId"]
|
||
state, err := stateStore.GetSnapshot(requestId)
|
||
if err != nil {
|
||
http.Error(w, err.Error(), http.StatusNotFound)
|
||
return
|
||
}
|
||
json.NewEncoder(w).Encode(state)
|
||
}).Methods("GET")
|
||
|
||
debug.HandleFunc("/traces/{traceId}", handleTraceQuery).Methods("GET")
|
||
debug.HandleFunc("/metrics/recent", handleRecentMetrics).Methods("GET")
|
||
}
|
||
```
|
||
|
||
**2. Immutable Event Sourcing for Debugging**
|
||
```typescript
|
||
// Event store provides complete history for debugging
|
||
interface DebugEvent {
|
||
eventId: string;
|
||
timestamp: number;
|
||
type: string;
|
||
aggregateId: string;
|
||
payload: any;
|
||
metadata: {
|
||
userId?: string;
|
||
sessionId?: string;
|
||
traceId?: string;
|
||
causationId?: string;
|
||
};
|
||
}
|
||
|
||
class DebugEventStore {
|
||
async getEventStream(aggregateId: string): Promise<DebugEvent[]> {
|
||
// Reconstruct complete state history
|
||
return await this.db.query(
|
||
'SELECT * FROM events WHERE aggregate_id = $1 ORDER BY timestamp',
|
||
[aggregateId]
|
||
);
|
||
}
|
||
|
||
async replayToPoint(aggregateId: string, timestamp: number): Promise<any> {
|
||
const events = await this.getEventStream(aggregateId);
|
||
const relevantEvents = events.filter(e => e.timestamp <= timestamp);
|
||
|
||
// Replay events to reconstruct state at specific point
|
||
return this.applyEvents(relevantEvents);
|
||
}
|
||
}
|
||
```
|
||
|
||
### Gradual Traffic Shifting for Debugging
|
||
|
||
```yaml
|
||
# Kubernetes canary deployment for debug version
|
||
apiVersion: v1
|
||
kind: Service
|
||
metadata:
|
||
name: my-service
|
||
spec:
|
||
selector:
|
||
app: my-service
|
||
ports:
|
||
- port: 80
|
||
---
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
metadata:
|
||
name: my-service-stable
|
||
spec:
|
||
replicas: 9
|
||
template:
|
||
metadata:
|
||
labels:
|
||
app: my-service
|
||
version: stable
|
||
---
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
metadata:
|
||
name: my-service-debug
|
||
spec:
|
||
replicas: 1 # 10% traffic for debug version
|
||
template:
|
||
metadata:
|
||
labels:
|
||
app: my-service
|
||
version: debug
|
||
annotations:
|
||
instrumentation.opentelemetry.io/inject-sdk: "true"
|
||
spec:
|
||
containers:
|
||
- name: app
|
||
env:
|
||
- name: DEBUG_MODE
|
||
value: "true"
|
||
- name: LOG_LEVEL
|
||
value: "debug"
|
||
```
|
||
|
||
## Observability Integration
|
||
|
||
### Distributed Tracing Integration
|
||
|
||
**Honeycomb Query-Driven Debugging:**
|
||
```javascript
|
||
// Instrumentation for query-driven debugging
|
||
const { trace, context } = require('@opentelemetry/api');
|
||
const { HoneycombSDK } = require('@honeycombio/opentelemetry-node');
|
||
|
||
const sdk = new HoneycombSDK({
|
||
apiKey: process.env.HONEYCOMB_API_KEY,
|
||
dataset: 'my-service',
|
||
serviceName: 'api-server'
|
||
});
|
||
|
||
function instrumentForDebugging(fn, metadata = {}) {
|
||
return async function(...args) {
|
||
const tracer = trace.getTracer('debugger');
|
||
const span = tracer.startSpan(metadata.operationName || fn.name);
|
||
|
||
// Add debugging context
|
||
span.setAttribute('debug.functionName', fn.name);
|
||
span.setAttribute('debug.argsCount', args.length);
|
||
span.setAttribute('debug.timestamp', Date.now());
|
||
|
||
// Add custom metadata for filtering in Honeycomb
|
||
Object.entries(metadata).forEach(([key, value]) => {
|
||
span.setAttribute(`debug.${key}`, value);
|
||
});
|
||
|
||
try {
|
||
const result = await context.with(
|
||
trace.setSpan(context.active(), span),
|
||
() => fn.apply(this, args)
|
||
);
|
||
|
||
span.setAttribute('debug.resultType', typeof result);
|
||
span.setStatus({ code: 1 }); // OK
|
||
return result;
|
||
} catch (error) {
|
||
span.recordException(error);
|
||
span.setAttribute('debug.errorType', error.constructor.name);
|
||
span.setStatus({ code: 2, message: error.message }); // ERROR
|
||
throw error;
|
||
} finally {
|
||
span.end();
|
||
}
|
||
};
|
||
}
|
||
|
||
// Usage with AI-suggested instrumentation points
|
||
const debugProcess = instrumentForDebugging(processPayment, {
|
||
operationName: 'payment.process',
|
||
criticalPath: true,
|
||
debugPriority: 'high'
|
||
});
|
||
```
|
||
|
||
**Honeycomb Query Examples:**
|
||
```
|
||
# Find slow traces affecting specific users
|
||
BREAKDOWN(trace.trace_id)
|
||
WHERE duration_ms > 1000
|
||
AND user.id IN ("12345", "67890")
|
||
ORDER BY duration_ms DESC
|
||
|
||
# Compare successful vs failed requests
|
||
HEATMAP(duration_ms)
|
||
WHERE endpoint = "/api/checkout"
|
||
GROUP BY error_occurred
|
||
|
||
# Identify correlated services in failures
|
||
COUNT_DISTINCT(service.name)
|
||
WHERE error = true
|
||
GROUP BY trace.trace_id
|
||
```
|
||
|
||
### Sentry Integration for Error Context
|
||
|
||
```python
|
||
# Enhanced Sentry context for debugging
|
||
import sentry_sdk
|
||
from sentry_sdk import set_context, capture_exception, add_breadcrumb
|
||
|
||
def configure_debug_context(user=None, request_data=None):
|
||
"""Add rich context for debugging in Sentry"""
|
||
|
||
if user:
|
||
sentry_sdk.set_user({
|
||
"id": user.id,
|
||
"email": user.email,
|
||
"segment": user.segment,
|
||
"subscription_tier": user.tier
|
||
})
|
||
|
||
if request_data:
|
||
set_context("request_details", {
|
||
"endpoint": request_data.get("endpoint"),
|
||
"method": request_data.get("method"),
|
||
"params": sanitize_params(request_data.get("params")),
|
||
"headers": sanitize_headers(request_data.get("headers"))
|
||
})
|
||
|
||
# Add system context
|
||
set_context("system", {
|
||
"hostname": socket.gethostname(),
|
||
"process_id": os.getpid(),
|
||
"thread_id": threading.get_ident(),
|
||
"memory_mb": psutil.Process().memory_info().rss / 1024 / 1024
|
||
})
|
||
|
||
def debug_operation(operation_name):
|
||
"""Decorator for debugging with breadcrumbs"""
|
||
def decorator(fn):
|
||
def wrapper(*args, **kwargs):
|
||
add_breadcrumb(
|
||
category='debug',
|
||
message=f'Entering {operation_name}',
|
||
level='debug',
|
||
data={'args_count': len(args), 'kwargs_keys': list(kwargs.keys())}
|
||
)
|
||
|
||
try:
|
||
result = fn(*args, **kwargs)
|
||
add_breadcrumb(
|
||
category='debug',
|
||
message=f'Completed {operation_name}',
|
||
level='debug',
|
||
data={'result_type': type(result).__name__}
|
||
)
|
||
return result
|
||
except Exception as e:
|
||
add_breadcrumb(
|
||
category='error',
|
||
message=f'Failed {operation_name}',
|
||
level='error',
|
||
data={'error': str(e)}
|
||
)
|
||
capture_exception(e)
|
||
raise
|
||
return wrapper
|
||
return decorator
|
||
|
||
# AI-powered error grouping in Sentry
|
||
# Configure fingerprinting for better debugging
|
||
sentry_sdk.init(
|
||
dsn=os.getenv("SENTRY_DSN"),
|
||
before_send=lambda event, hint: enhance_event_for_debugging(event, hint),
|
||
traces_sample_rate=0.1,
|
||
profiles_sample_rate=0.1
|
||
)
|
||
|
||
def enhance_event_for_debugging(event, hint):
|
||
"""Add AI-suggested fingerprinting"""
|
||
if 'exception' in event:
|
||
exc = event['exception']['values'][0]
|
||
|
||
# Custom fingerprinting based on error patterns
|
||
fingerprint = ['{{ default }}']
|
||
|
||
# AI can suggest better grouping strategies
|
||
if 'database' in exc.get('type', '').lower():
|
||
fingerprint.append('db-error')
|
||
fingerprint.append(extract_db_operation(exc))
|
||
|
||
event['fingerprint'] = fingerprint
|
||
|
||
return event
|
||
```
|
||
|
||
## Post-Debugging Validation
|
||
|
||
### Automated Fix Verification
|
||
|
||
After implementing fix, run comprehensive validation:
|
||
|
||
```typescript
|
||
// Post-fix validation framework
|
||
interface ValidationResult {
|
||
testsPassed: boolean;
|
||
performanceRegression: boolean;
|
||
errorRateChanged: boolean;
|
||
metricsComparison: MetricsComparison;
|
||
recommendations: string[];
|
||
}
|
||
|
||
class DebugFixValidator {
|
||
async validateFix(
|
||
issueId: string,
|
||
fixCommit: string,
|
||
baselineCommit: string
|
||
): Promise<ValidationResult> {
|
||
|
||
const results: ValidationResult = {
|
||
testsPassed: false,
|
||
performanceRegression: false,
|
||
errorRateChanged: false,
|
||
metricsComparison: {},
|
||
recommendations: []
|
||
};
|
||
|
||
// 1. Run existing test suite
|
||
const testResults = await this.runTests(fixCommit);
|
||
results.testsPassed = testResults.allPassed;
|
||
|
||
if (!results.testsPassed) {
|
||
results.recommendations.push(
|
||
'Fix broke existing tests. Review test failures.'
|
||
);
|
||
return results;
|
||
}
|
||
|
||
// 2. Performance comparison
|
||
const perfBaseline = await this.runPerfTests(baselineCommit);
|
||
const perfAfterFix = await this.runPerfTests(fixCommit);
|
||
|
||
results.performanceRegression = this.detectRegression(
|
||
perfBaseline,
|
||
perfAfterFix
|
||
);
|
||
|
||
if (results.performanceRegression) {
|
||
results.recommendations.push(
|
||
`Performance regression detected: ${this.formatDiff(perfBaseline, perfAfterFix)}`
|
||
);
|
||
}
|
||
|
||
// 3. Canary deployment validation
|
||
if (process.env.ENABLE_CANARY === 'true') {
|
||
const canaryResults = await this.runCanaryDeployment(fixCommit);
|
||
results.errorRateChanged = canaryResults.errorRateDelta > 0.05;
|
||
|
||
if (results.errorRateChanged) {
|
||
results.recommendations.push(
|
||
`Error rate increased by ${(canaryResults.errorRateDelta * 100).toFixed(2)}%`
|
||
);
|
||
}
|
||
}
|
||
|
||
// 4. AI-powered code review of the fix
|
||
const aiReview = await this.getAICodeReview(issueId, fixCommit);
|
||
results.recommendations.push(...aiReview.suggestions);
|
||
|
||
return results;
|
||
}
|
||
|
||
private async getAICodeReview(
|
||
issueId: string,
|
||
commit: string
|
||
): Promise<AIReview> {
|
||
// Use GitHub Copilot or Claude to review the fix
|
||
const diff = await this.getCommitDiff(commit);
|
||
|
||
return await aiAssistant.review({
|
||
context: `Reviewing fix for issue ${issueId}`,
|
||
diff,
|
||
checks: [
|
||
'error handling completeness',
|
||
'edge case coverage',
|
||
'potential side effects',
|
||
'test coverage adequacy',
|
||
'code clarity and maintainability'
|
||
]
|
||
});
|
||
}
|
||
}
|
||
```
|
||
|
||
### Regression Prevention
|
||
|
||
```python
|
||
# Automated regression test generation
|
||
class RegressionTestGenerator:
|
||
def __init__(self, issue_tracker, ai_assistant):
|
||
self.issue_tracker = issue_tracker
|
||
self.ai_assistant = ai_assistant
|
||
|
||
async def generate_tests_for_fix(self, issue_id: str, fix_commit: str):
|
||
"""Generate regression tests using AI"""
|
||
|
||
# Get issue details
|
||
issue = await self.issue_tracker.get(issue_id)
|
||
|
||
# Get code changes
|
||
diff = await self.get_git_diff(fix_commit)
|
||
|
||
# AI generates test cases
|
||
test_cases = await self.ai_assistant.generate_tests({
|
||
'issue_description': issue.description,
|
||
'reproduction_steps': issue.reproduction_steps,
|
||
'code_changes': diff,
|
||
'test_framework': self.detect_test_framework(),
|
||
'coverage_target': 'edge cases and failure modes'
|
||
})
|
||
|
||
# Write tests to appropriate files
|
||
for test_case in test_cases:
|
||
await self.write_test_file(
|
||
test_case.file_path,
|
||
test_case.content
|
||
)
|
||
|
||
# Validate tests catch the original bug
|
||
validation = await self.validate_tests_catch_bug(
|
||
issue_id,
|
||
fix_commit
|
||
)
|
||
|
||
return {
|
||
'tests_generated': len(test_cases),
|
||
'validates_fix': validation.successful,
|
||
'test_files': [tc.file_path for tc in test_cases]
|
||
}
|
||
```
|
||
|
||
### Knowledge Base Update
|
||
|
||
```javascript
|
||
// Automatically update debugging knowledge base
|
||
class DebugKnowledgeBase {
|
||
async recordDebugSession(session) {
|
||
const entry = {
|
||
issueId: session.issueId,
|
||
timestamp: new Date().toISOString(),
|
||
errorPattern: session.errorSignature,
|
||
rootCause: session.rootCause,
|
||
debugStrategy: session.strategyUsed,
|
||
timeToResolve: session.duration,
|
||
effectiveTools: session.toolsUsed,
|
||
searchKeywords: await this.extractKeywords(session),
|
||
relatedIssues: await this.findSimilarIssues(session),
|
||
preventionMeasures: session.preventionRecommendations,
|
||
aiInsights: session.aiAssistantAnalysis
|
||
};
|
||
|
||
await this.db.insert('debug_sessions', entry);
|
||
|
||
// Update AI model training data
|
||
await this.ai.addTrainingExample({
|
||
input: {
|
||
errorMessage: session.error,
|
||
stackTrace: session.stackTrace,
|
||
context: session.environment
|
||
},
|
||
output: {
|
||
rootCause: session.rootCause,
|
||
solution: session.solution,
|
||
confidence: session.confidenceScore
|
||
}
|
||
});
|
||
}
|
||
|
||
async getSimilarDebugSessions(errorSignature) {
|
||
// Vector similarity search for similar issues
|
||
return await this.vectorDb.similaritySearch(
|
||
errorSignature,
|
||
{
|
||
limit: 5,
|
||
threshold: 0.8
|
||
}
|
||
);
|
||
}
|
||
}
|
||
```
|
||
|
||
## Complete Examples
|
||
|
||
### Example 1: AI-Powered Debugging Session with GitHub Copilot
|
||
|
||
```typescript
|
||
/**
|
||
* Complete debugging session for intermittent checkout failure
|
||
* Using: GitHub Copilot Chat, DataDog, Sentry
|
||
*/
|
||
|
||
// Issue: "Checkout fails intermittently with 'Payment processing timeout'"
|
||
|
||
// Step 1: AI-assisted initial analysis
|
||
// Copilot Chat prompt: "Analyze this error pattern and suggest root causes"
|
||
|
||
import { DataDogClient } from '@datadog/datadog-api-client';
|
||
import * as Sentry from '@sentry/node';
|
||
|
||
class CheckoutDebugSession {
|
||
private dd: DataDogClient;
|
||
private sessionId: string;
|
||
|
||
constructor(sessionId: string) {
|
||
this.sessionId = sessionId;
|
||
this.dd = new DataDogClient(process.env.DD_API_KEY);
|
||
}
|
||
|
||
async investigateIssue() {
|
||
console.log('=== Starting AI-Assisted Debug Session ===');
|
||
|
||
// Step 2: Gather observability data
|
||
const sentryIssues = await this.getSentryErrorGroup();
|
||
const ddTraces = await this.getDataDogTraces();
|
||
const ddMetrics = await this.getRelevantMetrics();
|
||
|
||
console.log('\n[1] Sentry Error Analysis:');
|
||
console.log(` - Occurrences: ${sentryIssues.count}`);
|
||
console.log(` - Affected users: ${sentryIssues.userCount}`);
|
||
console.log(` - First seen: ${sentryIssues.firstSeen}`);
|
||
console.log(` - Last seen: ${sentryIssues.lastSeen}`);
|
||
console.log(` - User impact: ${sentryIssues.impactScore}`);
|
||
|
||
// Step 3: AI analysis of error patterns
|
||
// GitHub Copilot analyzes the error group and suggests:
|
||
// "Payment timeout correlates with high database latency"
|
||
|
||
console.log('\n[2] DataDog Trace Analysis:');
|
||
const slowTraces = ddTraces.filter(t => t.duration > 5000);
|
||
console.log(` - Total traces analyzed: ${ddTraces.length}`);
|
||
console.log(` - Slow traces (>5s): ${slowTraces.length}`);
|
||
|
||
// AI identifies pattern: DB queries taking 4-6 seconds
|
||
const dbSpans = slowTraces.flatMap(t =>
|
||
t.spans.filter(s => s.resource.startsWith('SELECT'))
|
||
);
|
||
|
||
console.log(` - Slow DB queries: ${dbSpans.length}`);
|
||
console.log(` - Slowest query: ${this.formatQuery(dbSpans[0])}`);
|
||
|
||
// Step 4: Hypothesis generation with AI
|
||
const hypotheses = [
|
||
{
|
||
name: 'Database N+1 query in payment verification',
|
||
probability: 85,
|
||
evidence: 'Multiple SELECT queries to user_payment_methods table',
|
||
test: 'Add query logging and count queries per checkout'
|
||
},
|
||
{
|
||
name: 'Lock contention on payment_transactions table',
|
||
probability: 60,
|
||
evidence: 'Correlation with concurrent checkouts',
|
||
test: 'Check pg_stat_activity for blocked queries'
|
||
},
|
||
{
|
||
name: 'External payment gateway timeout',
|
||
probability: 45,
|
||
evidence: 'Some traces show gateway response > 3s',
|
||
test: 'Add separate instrumentation for gateway calls'
|
||
}
|
||
];
|
||
|
||
console.log('\n[3] AI-Generated Hypotheses:');
|
||
hypotheses.forEach((h, i) => {
|
||
console.log(` ${i + 1}. ${h.name} (${h.probability}%)`);
|
||
console.log(` Evidence: ${h.evidence}`);
|
||
console.log(` Test: ${h.test}`);
|
||
});
|
||
|
||
// Step 5: Intelligent breakpoint placement
|
||
// AI suggests key points to instrument
|
||
const instrumentationPoints = await this.addSmartInstrumentation();
|
||
|
||
console.log('\n[4] Added Smart Instrumentation:');
|
||
instrumentationPoints.forEach(point => {
|
||
console.log(` - ${point.file}:${point.line} - ${point.reason}`);
|
||
});
|
||
|
||
// Step 6: Deploy instrumented version to 10% traffic
|
||
await this.deployCanaryWithInstrumentation();
|
||
|
||
console.log('\n[5] Canary Deployment:');
|
||
console.log(' - Deployed instrumented version to 10% traffic');
|
||
console.log(' - Monitoring for 15 minutes...');
|
||
|
||
// Wait and collect data
|
||
await this.sleep(15 * 60 * 1000);
|
||
|
||
// Step 7: Analyze collected data with AI
|
||
const analysis = await this.analyzeInstrumentationData();
|
||
|
||
console.log('\n[6] Root Cause Identified:');
|
||
console.log(` - ${analysis.rootCause}`);
|
||
console.log(` - Confidence: ${analysis.confidence}%`);
|
||
console.log(` - Evidence: ${analysis.evidence}`);
|
||
|
||
// Step 8: AI suggests fix
|
||
const suggestedFix = await this.generateFix(analysis);
|
||
|
||
console.log('\n[7] Suggested Fix:');
|
||
console.log(suggestedFix.code);
|
||
console.log(`\n - Impact: ${suggestedFix.impact}`);
|
||
console.log(` - Risk: ${suggestedFix.risk}`);
|
||
console.log(` - Test coverage: ${suggestedFix.testCoverage}`);
|
||
|
||
return {
|
||
rootCause: analysis.rootCause,
|
||
fix: suggestedFix,
|
||
validationPlan: this.generateValidationPlan(analysis, suggestedFix)
|
||
};
|
||
}
|
||
|
||
private async getSentryErrorGroup() {
|
||
const issues = await Sentry.getIssue('CHECKOUT_TIMEOUT_001');
|
||
|
||
return {
|
||
count: issues.count,
|
||
userCount: issues.userCount,
|
||
firstSeen: issues.firstSeen,
|
||
lastSeen: issues.lastSeen,
|
||
impactScore: this.calculateImpact(issues),
|
||
breadcrumbs: issues.latestEvent.breadcrumbs,
|
||
tags: issues.tags
|
||
};
|
||
}
|
||
|
||
private async getDataDogTraces() {
|
||
const query = `
|
||
service:checkout-api
|
||
operation_name:process_payment
|
||
@error:true
|
||
@duration:>5000ms
|
||
`;
|
||
|
||
return await this.dd.traces.search({
|
||
query,
|
||
from: Date.now() - 24 * 3600 * 1000,
|
||
to: Date.now(),
|
||
limit: 100
|
||
});
|
||
}
|
||
|
||
private async addSmartInstrumentation() {
|
||
// AI suggests these instrumentation points
|
||
return [
|
||
{
|
||
file: 'src/checkout/payment.ts',
|
||
line: 145,
|
||
reason: 'Payment verification entry point'
|
||
},
|
||
{
|
||
file: 'src/checkout/payment.ts',
|
||
line: 178,
|
||
reason: 'Database query execution (potential N+1)'
|
||
},
|
||
{
|
||
file: 'src/checkout/payment.ts',
|
||
line: 203,
|
||
reason: 'External gateway call'
|
||
},
|
||
{
|
||
file: 'src/checkout/payment.ts',
|
||
line: 245,
|
||
reason: 'Transaction commit point'
|
||
}
|
||
];
|
||
}
|
||
|
||
private async analyzeInstrumentationData() {
|
||
// AI analyzes collected data and identifies root cause
|
||
return {
|
||
rootCause: 'N+1 query: Loading payment methods for each item in cart separately',
|
||
confidence: 92,
|
||
evidence: 'Average 15 queries per checkout, each taking 300-400ms',
|
||
affectedCode: 'src/checkout/payment.ts:178-195',
|
||
suggestedFix: 'Use eager loading with JOIN or batch query'
|
||
};
|
||
}
|
||
|
||
private async generateFix(analysis) {
|
||
// AI generates the fix code
|
||
return {
|
||
code: `
|
||
// Before (N+1 query):
|
||
for (const item of cart.items) {
|
||
const paymentMethod = await PaymentMethod.findOne({
|
||
where: { userId: cart.userId, itemId: item.id }
|
||
});
|
||
await processPayment(item, paymentMethod);
|
||
}
|
||
|
||
// After (batched query):
|
||
const itemIds = cart.items.map(i => i.id);
|
||
const paymentMethods = await PaymentMethod.findAll({
|
||
where: {
|
||
userId: cart.userId,
|
||
itemId: { [Op.in]: itemIds }
|
||
}
|
||
});
|
||
|
||
const methodMap = new Map(
|
||
paymentMethods.map(pm => [pm.itemId, pm])
|
||
);
|
||
|
||
for (const item of cart.items) {
|
||
const paymentMethod = methodMap.get(item.id);
|
||
await processPayment(item, paymentMethod);
|
||
}
|
||
`.trim(),
|
||
impact: 'Reduces queries from ~15 to 1, expected 3-4s latency reduction',
|
||
risk: 'Low - preserves existing logic, only changes data fetching',
|
||
testCoverage: 'Add test for batch payment processing'
|
||
};
|
||
}
|
||
|
||
private generateValidationPlan(analysis, fix) {
|
||
return {
|
||
steps: [
|
||
'Apply fix to local environment',
|
||
'Run existing payment test suite',
|
||
'Add new test for batch payment method loading',
|
||
'Deploy to staging with full instrumentation',
|
||
'Run load test simulating 100 concurrent checkouts',
|
||
'Compare latency metrics: baseline vs fix',
|
||
'Canary deploy to 10% production for 1 hour',
|
||
'Monitor error rate and latency in DataDog',
|
||
'If metrics improve by >50%, roll out to 100%'
|
||
],
|
||
successCriteria: {
|
||
errorRateReduction: '>90%',
|
||
latencyReduction: '>70%',
|
||
queryCountReduction: '>85%'
|
||
}
|
||
};
|
||
}
|
||
|
||
private sleep(ms: number) {
|
||
return new Promise(resolve => setTimeout(resolve, ms));
|
||
}
|
||
}
|
||
|
||
// Run the debug session
|
||
const session = new CheckoutDebugSession('checkout-timeout-issue');
|
||
const result = await session.investigateIssue();
|
||
|
||
console.log('\n=== Debug Session Complete ===');
|
||
console.log(JSON.stringify(result, null, 2));
|
||
```
|
||
|
||
### Example 2: Observability-Driven Production Debugging
|
||
|
||
```python
|
||
"""
|
||
Complete workflow for debugging production memory leak
|
||
Using: Honeycomb, Pyroscope, Grafana, Claude Code
|
||
"""
|
||
|
||
import asyncio
|
||
from datetime import datetime, timedelta
|
||
from honeycomb import HoneycombClient
|
||
from pyroscope import Profiler
|
||
import anthropic
|
||
|
||
class ProductionMemoryLeakDebugger:
|
||
def __init__(self, service_name: str):
|
||
self.service_name = service_name
|
||
self.honeycomb = HoneycombClient(api_key=os.getenv("HONEYCOMB_API_KEY"))
|
||
self.anthropic = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
|
||
self.findings = []
|
||
|
||
async def debug_memory_leak(self):
|
||
"""
|
||
Complete debugging workflow for memory leak
|
||
"""
|
||
print("=== Production Memory Leak Investigation ===\n")
|
||
|
||
# Step 1: Identify memory growth pattern
|
||
print("[1] Analyzing Memory Growth Pattern")
|
||
memory_pattern = await self.analyze_memory_metrics()
|
||
print(f" - Memory growth rate: {memory_pattern['growth_rate_mb_per_hour']} MB/hour")
|
||
print(f" - Time to OOM: ~{memory_pattern['hours_to_oom']} hours")
|
||
print(f" - Pattern type: {memory_pattern['pattern_type']}")
|
||
|
||
self.findings.append({
|
||
"category": "memory_metrics",
|
||
"data": memory_pattern
|
||
})
|
||
|
||
# Step 2: Continuous profiling analysis
|
||
print("\n[2] Analyzing Continuous Profiling Data (Pyroscope)")
|
||
profile_analysis = await self.analyze_profiles()
|
||
print(f" - Top memory allocator: {profile_analysis['top_allocator']}")
|
||
print(f" - Allocation rate: {profile_analysis['alloc_rate_mb_per_sec']} MB/s")
|
||
print(f" - Suspected leak locations:")
|
||
for loc in profile_analysis['suspected_locations']:
|
||
print(f" - {loc['function']} at {loc['file']}:{loc['line']}")
|
||
|
||
self.findings.append({
|
||
"category": "profiling",
|
||
"data": profile_analysis
|
||
})
|
||
|
||
# Step 3: Distributed trace analysis
|
||
print("\n[3] Analyzing Request Traces for Memory Patterns")
|
||
trace_analysis = await self.analyze_traces_for_memory()
|
||
print(f" - Requests analyzed: {trace_analysis['request_count']}")
|
||
print(f" - Memory leak correlation:")
|
||
print(f" - High memory requests: {trace_analysis['high_memory_requests']}")
|
||
print(f" - Common patterns: {trace_analysis['common_patterns']}")
|
||
|
||
self.findings.append({
|
||
"category": "traces",
|
||
"data": trace_analysis
|
||
})
|
||
|
||
# Step 4: AI-powered root cause analysis
|
||
print("\n[4] AI Root Cause Analysis (Claude)")
|
||
root_cause = await self.ai_analyze_findings()
|
||
print(f" - Root cause: {root_cause['diagnosis']}")
|
||
print(f" - Confidence: {root_cause['confidence']}%")
|
||
print(f" - Evidence chain:")
|
||
for evidence in root_cause['evidence']:
|
||
print(f" - {evidence}")
|
||
|
||
# Step 5: Generate and test hypothesis
|
||
print("\n[5] Hypothesis Testing")
|
||
hypothesis = root_cause['hypothesis']
|
||
test_results = await self.test_hypothesis(hypothesis)
|
||
print(f" - Hypothesis: {hypothesis['statement']}")
|
||
print(f" - Test result: {test_results['outcome']}")
|
||
print(f" - Evidence: {test_results['evidence']}")
|
||
|
||
# Step 6: Implement targeted instrumentation
|
||
print("\n[6] Deploying Targeted Instrumentation")
|
||
instrumentation = await self.deploy_targeted_instrumentation(
|
||
root_cause['suspected_code_paths']
|
||
)
|
||
print(f" - Instrumented {len(instrumentation['points'])} code paths")
|
||
print(f" - Monitoring for 30 minutes...")
|
||
|
||
await asyncio.sleep(30 * 60) # Wait 30 minutes
|
||
|
||
# Step 7: Analyze instrumentation data
|
||
print("\n[7] Analyzing Instrumentation Results")
|
||
detailed_analysis = await self.analyze_instrumentation_data()
|
||
print(f" - Confirmed root cause: {detailed_analysis['confirmed']}")
|
||
print(f" - Leak location: {detailed_analysis['leak_location']}")
|
||
print(f" - Leak type: {detailed_analysis['leak_type']}")
|
||
|
||
# Step 8: AI generates fix
|
||
print("\n[8] Generating Fix (AI-assisted)")
|
||
fix = await self.generate_fix(detailed_analysis)
|
||
print(f" - Fix strategy: {fix['strategy']}")
|
||
print(f" - Code changes required: {len(fix['changes'])} files")
|
||
print(f" - Risk assessment: {fix['risk']}")
|
||
|
||
# Step 9: Validation plan
|
||
print("\n[9] Fix Validation Plan")
|
||
validation = self.create_validation_plan(fix)
|
||
for step_num, step in enumerate(validation['steps'], 1):
|
||
print(f" {step_num}. {step}")
|
||
|
||
return {
|
||
"root_cause": detailed_analysis,
|
||
"fix": fix,
|
||
"validation_plan": validation,
|
||
"findings": self.findings
|
||
}
|
||
|
||
async def analyze_memory_metrics(self):
|
||
"""Query Grafana/Prometheus for memory metrics"""
|
||
# Simulate Prometheus query
|
||
# In real implementation: query actual Prometheus
|
||
|
||
return {
|
||
"growth_rate_mb_per_hour": 45.3,
|
||
"hours_to_oom": 18.5,
|
||
"pattern_type": "linear_growth",
|
||
"baseline_memory_mb": 512,
|
||
"current_memory_mb": 1847,
|
||
"measurement_period_hours": 24
|
||
}
|
||
|
||
async def analyze_profiles(self):
|
||
"""Analyze Pyroscope continuous profiling data"""
|
||
# Query Pyroscope for memory allocation profiles
|
||
# Compare profiles over time to identify growing allocations
|
||
|
||
return {
|
||
"top_allocator": "cache_manager.add_entry()",
|
||
"alloc_rate_mb_per_sec": 0.012,
|
||
"suspected_locations": [
|
||
{
|
||
"function": "cache_manager.add_entry",
|
||
"file": "src/cache/manager.py",
|
||
"line": 145,
|
||
"alloc_percent": 67.3
|
||
},
|
||
{
|
||
"function": "request_handler.store_session",
|
||
"file": "src/api/handler.py",
|
||
"line": 89,
|
||
"alloc_percent": 18.2
|
||
}
|
||
],
|
||
"time_range": "last_24_hours"
|
||
}
|
||
|
||
async def analyze_traces_for_memory(self):
|
||
"""Analyze Honeycomb traces for memory-related patterns"""
|
||
|
||
# Honeycomb query to find traces with high memory allocation
|
||
query = """
|
||
BREAKDOWN(trace.trace_id)
|
||
WHERE service.name = '{service}'
|
||
AND memory.delta_mb > 10
|
||
ORDER BY memory.delta_mb DESC
|
||
LIMIT 100
|
||
""".format(service=self.service_name)
|
||
|
||
traces = await self.honeycomb.query(query)
|
||
|
||
# Analyze common patterns in high-memory traces
|
||
common_patterns = self.extract_common_patterns(traces)
|
||
|
||
return {
|
||
"request_count": len(traces),
|
||
"high_memory_requests": len([t for t in traces if t['memory_delta'] > 20]),
|
||
"common_patterns": [
|
||
"All include cache write operation",
|
||
"87% involve large JSON parsing",
|
||
"Cache eviction never triggered"
|
||
],
|
||
"top_endpoints": [
|
||
{"endpoint": "/api/data/sync", "count": 43},
|
||
{"endpoint": "/api/batch/process", "count": 28}
|
||
]
|
||
}
|
||
|
||
async def ai_analyze_findings(self):
|
||
"""Use Claude to analyze all findings and determine root cause"""
|
||
|
||
# Prepare context for Claude
|
||
context = {
|
||
"findings": self.findings,
|
||
"service": self.service_name,
|
||
"symptoms": "Linear memory growth, ~45MB/hour, OOM in ~18 hours"
|
||
}
|
||
|
||
prompt = f"""
|
||
Analyze the following production memory leak data and determine the root cause:
|
||
|
||
{json.dumps(context, indent=2)}
|
||
|
||
Provide:
|
||
1. Root cause diagnosis
|
||
2. Confidence level (0-100%)
|
||
3. Evidence chain supporting the diagnosis
|
||
4. Testable hypothesis
|
||
5. Suspected code paths
|
||
|
||
Format as JSON.
|
||
"""
|
||
|
||
message = await self.anthropic.messages.create(
|
||
model="claude-sonnet-4-5-20250929",
|
||
max_tokens=2000,
|
||
messages=[{"role": "user", "content": prompt}]
|
||
)
|
||
|
||
analysis = json.loads(message.content[0].text)
|
||
|
||
return {
|
||
"diagnosis": "Cache entries added but never evicted - missing TTL and size limit",
|
||
"confidence": 94,
|
||
"evidence": [
|
||
"Profiling shows cache_manager.add_entry() as top allocator (67%)",
|
||
"Traces show cache writes but no cache evictions",
|
||
"Linear growth pattern consistent with unbounded cache",
|
||
"Growth rate matches request rate × average entry size"
|
||
],
|
||
"hypothesis": {
|
||
"statement": "Cache has no eviction policy, causing unbounded memory growth",
|
||
"test": "Add cache size metrics and verify no evictions occurring",
|
||
"expected_outcome": "Cache size grows linearly with request count"
|
||
},
|
||
"suspected_code_paths": [
|
||
"src/cache/manager.py:add_entry()",
|
||
"src/cache/manager.py:__init__()",
|
||
"src/api/handler.py:store_session()"
|
||
]
|
||
}
|
||
|
||
async def test_hypothesis(self, hypothesis):
|
||
"""Deploy instrumentation to test hypothesis"""
|
||
|
||
# Add metrics to track cache size and evictions
|
||
# In real implementation: deploy instrumented version
|
||
|
||
await asyncio.sleep(5) # Simulate data collection
|
||
|
||
return {
|
||
"outcome": "CONFIRMED",
|
||
"evidence": "Cache size grew from 1,247 entries to 3,891 entries in 30 minutes. Zero evictions recorded.",
|
||
"metrics": {
|
||
"cache_size_start": 1247,
|
||
"cache_size_end": 3891,
|
||
"evictions_count": 0,
|
||
"additions_count": 2644
|
||
}
|
||
}
|
||
|
||
async def deploy_targeted_instrumentation(self, code_paths):
|
||
"""Deploy focused instrumentation on suspected code paths"""
|
||
|
||
instrumentation_points = []
|
||
|
||
for path in code_paths:
|
||
instrumentation_points.append({
|
||
"file": path,
|
||
"metrics": [
|
||
"cache.size",
|
||
"cache.evictions",
|
||
"cache.additions",
|
||
"memory.used_mb"
|
||
],
|
||
"log_level": "debug"
|
||
})
|
||
|
||
# In real implementation: update deployment with instrumentation
|
||
|
||
return {"points": instrumentation_points}
|
||
|
||
async def analyze_instrumentation_data(self):
|
||
"""Analyze detailed instrumentation data"""
|
||
|
||
return {
|
||
"confirmed": True,
|
||
"leak_location": "src/cache/manager.py:CacheManager",
|
||
"leak_type": "unbounded_cache",
|
||
"details": {
|
||
"cache_implementation": "dict without size limit",
|
||
"eviction_policy": "none",
|
||
"ttl_configured": False,
|
||
"max_size_configured": False
|
||
},
|
||
"impact": "All cache entries retained indefinitely"
|
||
}
|
||
|
||
async def generate_fix(self, analysis):
|
||
"""AI generates fix for the memory leak"""
|
||
|
||
prompt = f"""
|
||
Generate a fix for this memory leak:
|
||
|
||
{json.dumps(analysis, indent=2)}
|
||
|
||
Requirements:
|
||
- Add LRU cache with size limit
|
||
- Add TTL-based eviction
|
||
- Maintain existing API
|
||
- Production-safe changes only
|
||
|
||
Provide complete code and migration strategy.
|
||
"""
|
||
|
||
message = await self.anthropic.messages.create(
|
||
model="claude-sonnet-4-5-20250929",
|
||
max_tokens=3000,
|
||
messages=[{"role": "user", "content": prompt}]
|
||
)
|
||
|
||
return {
|
||
"strategy": "Replace dict with cachetools.LRUCache, add TTL",
|
||
"changes": [
|
||
{
|
||
"file": "src/cache/manager.py",
|
||
"description": "Implement LRU cache with size limit and TTL",
|
||
"code": """
|
||
from cachetools import TTLCache
|
||
from threading import RLock
|
||
|
||
class CacheManager:
|
||
def __init__(self, max_size=10000, ttl_seconds=3600):
|
||
# LRU cache with size limit and TTL
|
||
self.cache = TTLCache(maxsize=max_size, ttl=ttl_seconds)
|
||
self.lock = RLock()
|
||
|
||
def add_entry(self, key, value):
|
||
with self.lock:
|
||
self.cache[key] = value
|
||
# Eviction happens automatically
|
||
|
||
def get_entry(self, key):
|
||
with self.lock:
|
||
return self.cache.get(key)
|
||
"""
|
||
},
|
||
{
|
||
"file": "src/config.py",
|
||
"description": "Add cache configuration",
|
||
"code": """
|
||
CACHE_MAX_SIZE = int(os.getenv('CACHE_MAX_SIZE', '10000'))
|
||
CACHE_TTL_SECONDS = int(os.getenv('CACHE_TTL_SECONDS', '3600'))
|
||
"""
|
||
}
|
||
],
|
||
"risk": "LOW - Backward compatible API, configurable limits",
|
||
"dependencies": ["cachetools>=5.3.0"],
|
||
"rollback_plan": "Feature flag to switch between old and new cache"
|
||
}
|
||
|
||
def create_validation_plan(self, fix):
|
||
"""Create comprehensive validation plan for the fix"""
|
||
|
||
return {
|
||
"steps": [
|
||
"Add comprehensive unit tests for cache eviction",
|
||
"Run memory profiling in staging with production traffic replay",
|
||
"Verify cache size remains bounded under load",
|
||
"Verify cache hit rate remains acceptable",
|
||
"Deploy with feature flag to 1% traffic",
|
||
"Monitor memory metrics for 2 hours",
|
||
"If stable, increase to 10% for 4 hours",
|
||
"If memory growth stopped, roll out to 100%",
|
||
"Continue monitoring for 24 hours post-rollout"
|
||
],
|
||
"success_criteria": {
|
||
"memory_growth": "< 5MB/hour (down from 45MB/hour)",
|
||
"cache_hit_rate": "> 85%",
|
||
"cache_size": "< 10,000 entries",
|
||
"eviction_rate": "> 0 evictions/minute",
|
||
"error_rate": "no increase"
|
||
},
|
||
"monitoring": [
|
||
"Memory usage (RSS)",
|
||
"Cache size metric",
|
||
"Cache hit/miss rates",
|
||
"Eviction rate",
|
||
"Request latency p50/p95/p99",
|
||
"Error rate"
|
||
]
|
||
}
|
||
|
||
def extract_common_patterns(self, traces):
|
||
"""Extract common patterns from trace data"""
|
||
# Simplified pattern extraction
|
||
return []
|
||
|
||
|
||
# Execute the debug workflow
|
||
async def main():
|
||
debugger = ProductionMemoryLeakDebugger("api-server")
|
||
result = await debugger.debug_memory_leak()
|
||
|
||
print("\n=== Debug Complete ===")
|
||
print(f"Root cause: {result['root_cause']['leak_location']}")
|
||
print(f"Fix strategy: {result['fix']['strategy']}")
|
||
print(f"\nNext steps:")
|
||
for i, step in enumerate(result['validation_plan']['steps'][:3], 1):
|
||
print(f" {i}. {step}")
|
||
|
||
if __name__ == "__main__":
|
||
asyncio.run(main())
|
||
```
|
||
|
||
## Reference Workflows
|
||
|
||
### Reference 1: Cursor IDE Time-Travel Debugging
|
||
|
||
Complete workflow for debugging state management bug using Cursor IDE's AI features and time-travel debugging:
|
||
|
||
1. **Initial Problem Identification**
|
||
- User reports: "Shopping cart shows wrong item count after page refresh"
|
||
- Reproduction rate: 15% of page refreshes
|
||
- Environment: React SPA with Redux state management
|
||
|
||
2. **AI-Assisted Code Analysis** (Cursor IDE)
|
||
- Use Cursor's "Explain this code" on CartReducer.ts
|
||
- AI identifies complex state update logic with 3 nested reducers
|
||
- Suggests potential race condition in async state hydration
|
||
|
||
3. **Time-Travel Debugging Setup** (Redux DevTools)
|
||
- Install Redux DevTools Extension with time-travel capability
|
||
- Add state serialization for replay
|
||
- Configure Redux store with DevTools enhancer
|
||
- Add state snapshot middleware
|
||
|
||
4. **Reproduction with Recording**
|
||
- Enable Redux DevTools recording
|
||
- Reproduce the bug (multiple attempts)
|
||
- Export state dump when bug occurs
|
||
- Save action timeline for analysis
|
||
|
||
5. **Time-Travel Analysis**
|
||
- Load saved state dump in DevTools
|
||
- Scrub through action timeline
|
||
- Identify moment where state diverges
|
||
- Use Cursor AI to analyze action sequence
|
||
- AI identifies: "State hydration dispatches before localStorage read completes"
|
||
|
||
6. **Root Cause Confirmation**
|
||
- Add breakpoints in async hydration logic
|
||
- Step through with Cursor's debug panel
|
||
- Confirm race condition: hydration action dispatches too early
|
||
- localStorage read hasn't completed yet
|
||
|
||
7. **AI-Generated Fix** (Cursor IDE)
|
||
- Ask Cursor: "Fix race condition in cart hydration"
|
||
- AI suggests: Add Promise wrapper and await localStorage read
|
||
- Review generated fix code
|
||
- Accept fix with modifications
|
||
|
||
8. **Validation with Time-Travel**
|
||
- Apply fix locally
|
||
- Replay saved action sequence with fixed code
|
||
- Verify state remains consistent through replay
|
||
- Test with 100 rapid page refreshes - no failures
|
||
|
||
9. **Automated Test Generation** (Cursor AI)
|
||
- Ask Cursor: "Generate test for cart hydration race condition"
|
||
- AI creates test that reproduces original race condition
|
||
- Test fails on old code, passes on fixed code
|
||
- Add test to suite
|
||
|
||
10. **Deployment and Monitoring**
|
||
- Deploy fix with feature flag
|
||
- Monitor cart error rates in Sentry
|
||
- Enable for 100% after 24 hours with no regressions
|
||
|
||
### Reference 2: Production Debugging with Distributed Tracing
|
||
|
||
Complete workflow for debugging cross-service latency issue:
|
||
|
||
1. **Alert Triggered**
|
||
- DataDog alert: "P95 latency for /api/recommendations endpoint > 2s"
|
||
- Affected: 5% of requests
|
||
- Pattern: Intermittent, no clear time correlation
|
||
|
||
2. **Honeycomb Query-Driven Investigation**
|
||
- Query: `WHERE endpoint = "/api/recommendations" AND duration_ms > 2000`
|
||
- BREAKDOWN by user_id, device_type, region
|
||
- Identifies: All slow requests from specific region (us-east-2)
|
||
|
||
3. **Distributed Trace Analysis**
|
||
- Examine full trace for slow request
|
||
- Service call chain: API → Auth → User Service → ML Service → Recommendations
|
||
- ML Service span shows 1.8s latency
|
||
- Most time in "model inference" operation
|
||
|
||
4. **Cross-Service Correlation**
|
||
- Query ML Service logs for same trace ID
|
||
- Correlate with GPU utilization metrics in Grafana
|
||
- Discover: GPU memory contention during specific hours
|
||
|
||
5. **AI-Assisted Pattern Recognition** (Claude Code)
|
||
- Feed trace data to Claude: "Analyze this latency pattern"
|
||
- AI identifies: Correlation with batch inference jobs
|
||
- Batch jobs scheduled every 30 minutes
|
||
- Cause resource contention with real-time inference
|
||
|
||
6. **Hypothesis Formation**
|
||
- Primary: Batch jobs starve real-time inference of GPU resources
|
||
- Secondary: Model loading delay when GPU busy
|
||
- Test: Disable batch jobs and monitor latency
|
||
|
||
7. **Safe Production Testing**
|
||
- Feature flag to disable batch jobs in us-east-2 only
|
||
- Monitor for 1 hour
|
||
- Result: P95 latency drops to 350ms (from 2.1s)
|
||
- Hypothesis confirmed
|
||
|
||
8. **Solution Design** (AI-Assisted)
|
||
- Claude suggests: Separate GPU pools for batch vs real-time
|
||
- Alternative: Priority-based scheduling in ML framework
|
||
- Decision: Implement priority scheduling (faster, less infrastructure)
|
||
|
||
9. **Implementation**
|
||
- Add priority queue to ML inference service
|
||
- Real-time requests: high priority
|
||
- Batch requests: low priority
|
||
- Deploy to staging, load test confirms fix
|
||
|
||
10. **Gradual Rollout with Validation**
|
||
- Deploy to us-east-2 with 10% traffic
|
||
- Monitor latency, error rate, GPU utilization
|
||
- Roll out to 100% us-east-2
|
||
- Roll out to all regions over 48 hours
|
||
- Final result: P95 latency 320ms, no increased error rate
|
||
|
||
11. **Post-Incident Review**
|
||
- Document root cause in knowledge base
|
||
- Add synthetic monitoring for GPU contention
|
||
- Create alert for priority queue backlog
|
||
- Update ML service runbook with troubleshooting steps
|
||
|
||
---
|
||
|
||
Issue to debug: $ARGUMENTS
|