mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 17:47:16 +00:00
style: format all files with prettier
This commit is contained in:
@@ -19,12 +19,14 @@ The analysis scope may include specific error messages, stack traces, log files,
|
||||
Classify errors into these categories to inform your debugging strategy:
|
||||
|
||||
**By Severity:**
|
||||
|
||||
- **Critical**: System down, data loss, security breach, complete service unavailability
|
||||
- **High**: Major feature broken, significant user impact, data corruption risk
|
||||
- **Medium**: Partial feature degradation, workarounds available, performance issues
|
||||
- **Low**: Minor bugs, cosmetic issues, edge cases with minimal impact
|
||||
|
||||
**By Type:**
|
||||
|
||||
- **Runtime Errors**: Exceptions, crashes, segmentation faults, null pointer dereferences
|
||||
- **Logic Errors**: Incorrect behavior, wrong calculations, invalid state transitions
|
||||
- **Integration Errors**: API failures, network timeouts, external service issues
|
||||
@@ -33,6 +35,7 @@ Classify errors into these categories to inform your debugging strategy:
|
||||
- **Security Errors**: Authentication failures, authorization violations, injection attempts
|
||||
|
||||
**By Observability:**
|
||||
|
||||
- **Deterministic**: Consistently reproducible with known inputs
|
||||
- **Intermittent**: Occurs sporadically, often timing or race condition related
|
||||
- **Environmental**: Only happens in specific environments or configurations
|
||||
@@ -106,6 +109,7 @@ For errors in microservices and distributed systems:
|
||||
Extract maximum information from stack traces:
|
||||
|
||||
**Key Elements:**
|
||||
|
||||
- **Error Type**: What kind of exception/error occurred
|
||||
- **Error Message**: Contextual information about the failure
|
||||
- **Origin Point**: The deepest frame where the error was thrown
|
||||
@@ -114,6 +118,7 @@ Extract maximum information from stack traces:
|
||||
- **Async Boundaries**: Identify where asynchronous operations break the trace
|
||||
|
||||
**Analysis Strategy:**
|
||||
|
||||
1. Start at the top of the stack (origin of error)
|
||||
2. Identify the first frame in your application code (not framework/library)
|
||||
3. Examine that frame's context: input parameters, local variables, state
|
||||
@@ -134,28 +139,34 @@ Modern error tracking tools provide enhanced stack traces:
|
||||
### Common Stack Trace Patterns
|
||||
|
||||
**Pattern: Null Pointer Exception Deep in Framework Code**
|
||||
|
||||
```
|
||||
NullPointerException
|
||||
at java.util.HashMap.hash(HashMap.java:339)
|
||||
at java.util.HashMap.get(HashMap.java:556)
|
||||
at com.myapp.service.UserService.findUser(UserService.java:45)
|
||||
```
|
||||
|
||||
Root Cause: Application passed null to framework code. Focus on UserService.java:45.
|
||||
|
||||
**Pattern: Timeout After Long Wait**
|
||||
|
||||
```
|
||||
TimeoutException: Operation timed out after 30000ms
|
||||
at okhttp3.internal.http2.Http2Stream.waitForIo
|
||||
at com.myapp.api.PaymentClient.processPayment(PaymentClient.java:89)
|
||||
```
|
||||
|
||||
Root Cause: External service slow/unresponsive. Need retry logic and circuit breaker.
|
||||
|
||||
**Pattern: Race Condition in Concurrent Code**
|
||||
|
||||
```
|
||||
ConcurrentModificationException
|
||||
at java.util.ArrayList$Itr.checkForComodification
|
||||
at com.myapp.processor.BatchProcessor.process(BatchProcessor.java:112)
|
||||
```
|
||||
|
||||
Root Cause: Collection modified while being iterated. Need thread-safe data structures or synchronization.
|
||||
|
||||
## Log Aggregation and Pattern Matching
|
||||
@@ -165,6 +176,7 @@ Root Cause: Collection modified while being iterated. Need thread-safe data stru
|
||||
Implement JSON-based structured logging for machine-readable logs:
|
||||
|
||||
**Standard Log Schema:**
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2025-10-11T14:23:45.123Z",
|
||||
@@ -203,6 +215,7 @@ Implement JSON-based structured logging for machine-readable logs:
|
||||
```
|
||||
|
||||
**Key Fields to Always Include:**
|
||||
|
||||
- `timestamp`: ISO 8601 format in UTC
|
||||
- `level`: ERROR, WARN, INFO, DEBUG, TRACE
|
||||
- `correlation_id`: Unique ID for the entire request chain
|
||||
@@ -216,48 +229,52 @@ Implement JSON-based structured logging for machine-readable logs:
|
||||
Implement correlation IDs to track requests across distributed systems:
|
||||
|
||||
**Node.js/Express Middleware:**
|
||||
|
||||
```javascript
|
||||
const { v4: uuidv4 } = require('uuid');
|
||||
const asyncLocalStorage = require('async-local-storage');
|
||||
const { v4: uuidv4 } = require("uuid");
|
||||
const asyncLocalStorage = require("async-local-storage");
|
||||
|
||||
// Middleware to generate/propagate correlation ID
|
||||
function correlationIdMiddleware(req, res, next) {
|
||||
const correlationId = req.headers['x-correlation-id'] || uuidv4();
|
||||
const correlationId = req.headers["x-correlation-id"] || uuidv4();
|
||||
req.correlationId = correlationId;
|
||||
res.setHeader('x-correlation-id', correlationId);
|
||||
res.setHeader("x-correlation-id", correlationId);
|
||||
|
||||
// Store in async context for access in nested calls
|
||||
asyncLocalStorage.run(new Map(), () => {
|
||||
asyncLocalStorage.set('correlationId', correlationId);
|
||||
asyncLocalStorage.set("correlationId", correlationId);
|
||||
next();
|
||||
});
|
||||
}
|
||||
|
||||
// Propagate to downstream services
|
||||
function makeApiCall(url, data) {
|
||||
const correlationId = asyncLocalStorage.get('correlationId');
|
||||
const correlationId = asyncLocalStorage.get("correlationId");
|
||||
return axios.post(url, data, {
|
||||
headers: {
|
||||
'x-correlation-id': correlationId,
|
||||
'x-source-service': 'api-gateway'
|
||||
}
|
||||
"x-correlation-id": correlationId,
|
||||
"x-source-service": "api-gateway",
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
// Include in all log statements
|
||||
function log(level, message, context = {}) {
|
||||
const correlationId = asyncLocalStorage.get('correlationId');
|
||||
console.log(JSON.stringify({
|
||||
timestamp: new Date().toISOString(),
|
||||
level,
|
||||
correlation_id: correlationId,
|
||||
message,
|
||||
...context
|
||||
}));
|
||||
const correlationId = asyncLocalStorage.get("correlationId");
|
||||
console.log(
|
||||
JSON.stringify({
|
||||
timestamp: new Date().toISOString(),
|
||||
level,
|
||||
correlation_id: correlationId,
|
||||
message,
|
||||
...context,
|
||||
}),
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
**Python/Flask Implementation:**
|
||||
|
||||
```python
|
||||
import uuid
|
||||
import logging
|
||||
@@ -302,6 +319,7 @@ def log_structured(level, message, **context):
|
||||
### Log Aggregation Architecture
|
||||
|
||||
**Centralized Logging Pipeline:**
|
||||
|
||||
1. **Application**: Outputs structured JSON logs to stdout/stderr
|
||||
2. **Log Shipper**: Fluentd/Fluent Bit/Vector collects logs from containers
|
||||
3. **Log Aggregator**: Elasticsearch/Loki/DataDog receives and indexes logs
|
||||
@@ -309,6 +327,7 @@ def log_structured(level, message, **context):
|
||||
5. **Alerting**: Trigger alerts on error patterns and thresholds
|
||||
|
||||
**Log Query Examples (Elasticsearch DSL):**
|
||||
|
||||
```json
|
||||
// Find all errors for a specific correlation ID
|
||||
{
|
||||
@@ -382,6 +401,7 @@ Use log analysis to identify patterns:
|
||||
For deterministic errors in development:
|
||||
|
||||
**Debugger Setup:**
|
||||
|
||||
1. Set breakpoint before the error occurs
|
||||
2. Step through code execution line by line
|
||||
3. Inspect variable values and object state
|
||||
@@ -390,6 +410,7 @@ For deterministic errors in development:
|
||||
6. Modify variables to test hypotheses
|
||||
|
||||
**Modern Debugging Tools:**
|
||||
|
||||
- **VS Code Debugger**: Integrated debugging for JavaScript, Python, Go, Java, C++
|
||||
- **Chrome DevTools**: Frontend debugging with network, performance, and memory profiling
|
||||
- **pdb/ipdb (Python)**: Interactive debugger with post-mortem analysis
|
||||
@@ -412,6 +433,7 @@ For errors in production environments where debuggers aren't available:
|
||||
8. **Traffic Mirroring**: Replay production traffic in staging for safe investigation
|
||||
|
||||
**Remote Debugging (Use Cautiously):**
|
||||
|
||||
- Attach debugger to running process only in non-critical services
|
||||
- Use read-only breakpoints that don't pause execution
|
||||
- Time-box debugging sessions strictly
|
||||
@@ -420,10 +442,11 @@ For errors in production environments where debuggers aren't available:
|
||||
### Memory and Performance Debugging
|
||||
|
||||
**Memory Leak Detection:**
|
||||
|
||||
```javascript
|
||||
// Node.js heap snapshot comparison
|
||||
const v8 = require('v8');
|
||||
const fs = require('fs');
|
||||
const v8 = require("v8");
|
||||
const fs = require("fs");
|
||||
|
||||
function takeHeapSnapshot(filename) {
|
||||
const snapshot = v8.writeHeapSnapshot(filename);
|
||||
@@ -431,15 +454,16 @@ function takeHeapSnapshot(filename) {
|
||||
}
|
||||
|
||||
// Take snapshots at intervals
|
||||
takeHeapSnapshot('heap-before.heapsnapshot');
|
||||
takeHeapSnapshot("heap-before.heapsnapshot");
|
||||
// ... run operations that might leak ...
|
||||
takeHeapSnapshot('heap-after.heapsnapshot');
|
||||
takeHeapSnapshot("heap-after.heapsnapshot");
|
||||
|
||||
// Analyze in Chrome DevTools Memory profiler
|
||||
// Look for objects with increasing retained size
|
||||
```
|
||||
|
||||
**Performance Profiling:**
|
||||
|
||||
```python
|
||||
# Python profiling with cProfile
|
||||
import cProfile
|
||||
@@ -465,6 +489,7 @@ def profile_function():
|
||||
### Input Validation and Type Safety
|
||||
|
||||
**Defensive Programming:**
|
||||
|
||||
```typescript
|
||||
// TypeScript: Leverage type system for compile-time safety
|
||||
interface PaymentRequest {
|
||||
@@ -477,19 +502,19 @@ interface PaymentRequest {
|
||||
function processPayment(request: PaymentRequest): PaymentResult {
|
||||
// Runtime validation for external inputs
|
||||
if (request.amount <= 0) {
|
||||
throw new ValidationError('Amount must be positive');
|
||||
throw new ValidationError("Amount must be positive");
|
||||
}
|
||||
|
||||
if (!['USD', 'EUR', 'GBP'].includes(request.currency)) {
|
||||
throw new ValidationError('Unsupported currency');
|
||||
if (!["USD", "EUR", "GBP"].includes(request.currency)) {
|
||||
throw new ValidationError("Unsupported currency");
|
||||
}
|
||||
|
||||
// Use Zod or Yup for complex validation
|
||||
const schema = z.object({
|
||||
amount: z.number().positive().max(1000000),
|
||||
currency: z.enum(['USD', 'EUR', 'GBP']),
|
||||
currency: z.enum(["USD", "EUR", "GBP"]),
|
||||
customerId: z.string().uuid(),
|
||||
paymentMethodId: z.string().min(1)
|
||||
paymentMethodId: z.string().min(1),
|
||||
});
|
||||
|
||||
const validated = schema.parse(request);
|
||||
@@ -500,6 +525,7 @@ function processPayment(request: PaymentRequest): PaymentResult {
|
||||
```
|
||||
|
||||
**Python Type Hints and Validation:**
|
||||
|
||||
```python
|
||||
from typing import Optional
|
||||
from pydantic import BaseModel, validator, Field
|
||||
@@ -532,6 +558,7 @@ def process_payment(request: PaymentRequest) -> PaymentResult:
|
||||
### Error Boundaries and Graceful Degradation
|
||||
|
||||
**React Error Boundaries:**
|
||||
|
||||
```typescript
|
||||
import React, { Component, ErrorInfo, ReactNode } from 'react';
|
||||
import * as Sentry from '@sentry/react';
|
||||
@@ -589,6 +616,7 @@ export default ErrorBoundary;
|
||||
```
|
||||
|
||||
**Circuit Breaker Pattern:**
|
||||
|
||||
```python
|
||||
from datetime import datetime, timedelta
|
||||
from enum import Enum
|
||||
@@ -672,8 +700,8 @@ async function retryWithBackoff<T>(
|
||||
maxAttempts: 3,
|
||||
baseDelayMs: 1000,
|
||||
maxDelayMs: 30000,
|
||||
exponentialBase: 2
|
||||
}
|
||||
exponentialBase: 2,
|
||||
},
|
||||
): Promise<T> {
|
||||
let lastError: Error;
|
||||
|
||||
@@ -684,23 +712,27 @@ async function retryWithBackoff<T>(
|
||||
lastError = error as Error;
|
||||
|
||||
// Check if error is retryable
|
||||
if (options.retryableErrors &&
|
||||
!options.retryableErrors.includes(error.name)) {
|
||||
if (
|
||||
options.retryableErrors &&
|
||||
!options.retryableErrors.includes(error.name)
|
||||
) {
|
||||
throw error; // Don't retry non-retryable errors
|
||||
}
|
||||
|
||||
if (attempt < options.maxAttempts - 1) {
|
||||
const delay = Math.min(
|
||||
options.baseDelayMs * Math.pow(options.exponentialBase, attempt),
|
||||
options.maxDelayMs
|
||||
options.maxDelayMs,
|
||||
);
|
||||
|
||||
// Add jitter to prevent thundering herd
|
||||
const jitter = Math.random() * 0.1 * delay;
|
||||
const actualDelay = delay + jitter;
|
||||
|
||||
console.log(`Attempt ${attempt + 1} failed, retrying in ${actualDelay}ms`);
|
||||
await new Promise(resolve => setTimeout(resolve, actualDelay));
|
||||
console.log(
|
||||
`Attempt ${attempt + 1} failed, retrying in ${actualDelay}ms`,
|
||||
);
|
||||
await new Promise((resolve) => setTimeout(resolve, actualDelay));
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -710,14 +742,14 @@ async function retryWithBackoff<T>(
|
||||
|
||||
// Usage
|
||||
const result = await retryWithBackoff(
|
||||
() => fetch('https://api.example.com/data'),
|
||||
() => fetch("https://api.example.com/data"),
|
||||
{
|
||||
maxAttempts: 3,
|
||||
baseDelayMs: 1000,
|
||||
maxDelayMs: 10000,
|
||||
exponentialBase: 2,
|
||||
retryableErrors: ['NetworkError', 'TimeoutError']
|
||||
}
|
||||
retryableErrors: ["NetworkError", "TimeoutError"],
|
||||
},
|
||||
);
|
||||
```
|
||||
|
||||
@@ -726,6 +758,7 @@ const result = await retryWithBackoff(
|
||||
### Modern Observability Stack (2025)
|
||||
|
||||
**Recommended Architecture:**
|
||||
|
||||
- **Metrics**: Prometheus + Grafana or DataDog
|
||||
- **Logs**: Elasticsearch/Loki + Fluentd or DataDog Logs
|
||||
- **Traces**: OpenTelemetry + Jaeger/Tempo or DataDog APM
|
||||
@@ -736,9 +769,10 @@ const result = await retryWithBackoff(
|
||||
### Sentry Integration
|
||||
|
||||
**Node.js/Express Setup:**
|
||||
|
||||
```javascript
|
||||
const Sentry = require('@sentry/node');
|
||||
const { ProfilingIntegration } = require('@sentry/profiling-node');
|
||||
const Sentry = require("@sentry/node");
|
||||
const { ProfilingIntegration } = require("@sentry/profiling-node");
|
||||
|
||||
Sentry.init({
|
||||
dsn: process.env.SENTRY_DSN,
|
||||
@@ -766,11 +800,11 @@ Sentry.init({
|
||||
event.tags = {
|
||||
...event.tags,
|
||||
region: process.env.AWS_REGION,
|
||||
instance_id: process.env.INSTANCE_ID
|
||||
instance_id: process.env.INSTANCE_ID,
|
||||
};
|
||||
|
||||
return event;
|
||||
}
|
||||
},
|
||||
});
|
||||
|
||||
// Express middleware
|
||||
@@ -790,19 +824,19 @@ function processOrder(orderId) {
|
||||
} catch (error) {
|
||||
Sentry.captureException(error, {
|
||||
tags: {
|
||||
operation: 'process_order',
|
||||
order_id: orderId
|
||||
operation: "process_order",
|
||||
order_id: orderId,
|
||||
},
|
||||
contexts: {
|
||||
order: {
|
||||
id: orderId,
|
||||
status: order?.status,
|
||||
amount: order?.amount
|
||||
}
|
||||
amount: order?.amount,
|
||||
},
|
||||
},
|
||||
user: {
|
||||
id: order?.customerId
|
||||
}
|
||||
id: order?.customerId,
|
||||
},
|
||||
});
|
||||
throw error;
|
||||
}
|
||||
@@ -812,6 +846,7 @@ function processOrder(orderId) {
|
||||
### DataDog APM Integration
|
||||
|
||||
**Python/Flask Setup:**
|
||||
|
||||
```python
|
||||
from ddtrace import patch_all, tracer
|
||||
from ddtrace.contrib.flask import TraceMiddleware
|
||||
@@ -854,6 +889,7 @@ def charge_payment():
|
||||
### OpenTelemetry Implementation
|
||||
|
||||
**Go Service with OpenTelemetry:**
|
||||
|
||||
```go
|
||||
package main
|
||||
|
||||
@@ -968,7 +1004,7 @@ monitors:
|
||||
|
||||
- name: "New Error Type Detected"
|
||||
type: log
|
||||
query: "logs(\"level:ERROR service:payment-service\").rollup(\"count\").by(\"error.fingerprint\").last(\"5m\") > 0"
|
||||
query: 'logs("level:ERROR service:payment-service").rollup("count").by("error.fingerprint").last("5m") > 0'
|
||||
message: |
|
||||
New error type detected in payment service: {{error.fingerprint}}
|
||||
|
||||
@@ -1001,6 +1037,7 @@ monitors:
|
||||
### Incident Response Workflow
|
||||
|
||||
**Phase 1: Detection and Triage (0-5 minutes)**
|
||||
|
||||
1. Acknowledge the alert/incident
|
||||
2. Check incident severity and user impact
|
||||
3. Assign incident commander
|
||||
@@ -1008,6 +1045,7 @@ monitors:
|
||||
5. Update status page if customer-facing
|
||||
|
||||
**Phase 2: Investigation (5-30 minutes)**
|
||||
|
||||
1. Gather observability data:
|
||||
- Error rates from Sentry/DataDog
|
||||
- Traces showing failed requests
|
||||
@@ -1022,6 +1060,7 @@ monitors:
|
||||
4. Document findings in incident log
|
||||
|
||||
**Phase 3: Mitigation (Immediate)**
|
||||
|
||||
1. Implement immediate fix based on hypothesis:
|
||||
- Rollback recent deployment
|
||||
- Scale up resources
|
||||
@@ -1032,6 +1071,7 @@ monitors:
|
||||
3. Monitor for 15-30 minutes to ensure stability
|
||||
|
||||
**Phase 4: Recovery and Validation**
|
||||
|
||||
1. Verify all systems operational
|
||||
2. Check data consistency
|
||||
3. Process queued/failed requests
|
||||
@@ -1039,6 +1079,7 @@ monitors:
|
||||
5. Notify stakeholders
|
||||
|
||||
**Phase 5: Post-Incident Review**
|
||||
|
||||
1. Schedule postmortem within 48 hours
|
||||
2. Create detailed timeline of events
|
||||
3. Identify root cause (may differ from initial hypothesis)
|
||||
@@ -1090,6 +1131,7 @@ GET /logs-*/_search
|
||||
### Communication Templates
|
||||
|
||||
**Initial Incident Notification:**
|
||||
|
||||
```
|
||||
🚨 INCIDENT: Payment Processing Errors
|
||||
|
||||
@@ -1113,6 +1155,7 @@ Status Page: https://status.company.com/incident/abc123
|
||||
```
|
||||
|
||||
**Mitigation Notification:**
|
||||
|
||||
```
|
||||
✅ INCIDENT UPDATE: Mitigation Applied
|
||||
|
||||
|
||||
Reference in New Issue
Block a user