mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 09:37:15 +00:00
* Add extra python skills covering code style, design patterns, resilience, resource management, testing patterns, and type safety ...etc * fix: correct code examples in Python skills - Clarify Python version requirements for type statement (3.10+ vs 3.12+) - Add missing ValidationError import in configuration example - Add missing httpx import and url parameter in async example --------- Co-authored-by: Seth Hobson <wshobson@gmail.com>
377 lines
11 KiB
Markdown
377 lines
11 KiB
Markdown
---
|
|
name: python-resilience
|
|
description: Python resilience patterns including automatic retries, exponential backoff, timeouts, and fault-tolerant decorators. Use when adding retry logic, implementing timeouts, building fault-tolerant services, or handling transient failures.
|
|
---
|
|
|
|
# Python Resilience Patterns
|
|
|
|
Build fault-tolerant Python applications that gracefully handle transient failures, network issues, and service outages. Resilience patterns keep systems running when dependencies are unreliable.
|
|
|
|
## When to Use This Skill
|
|
|
|
- Adding retry logic to external service calls
|
|
- Implementing timeouts for network operations
|
|
- Building fault-tolerant microservices
|
|
- Handling rate limiting and backpressure
|
|
- Creating infrastructure decorators
|
|
- Designing circuit breakers
|
|
|
|
## Core Concepts
|
|
|
|
### 1. Transient vs Permanent Failures
|
|
|
|
Retry transient errors (network timeouts, temporary service issues). Don't retry permanent errors (invalid credentials, bad requests).
|
|
|
|
### 2. Exponential Backoff
|
|
|
|
Increase wait time between retries to avoid overwhelming recovering services.
|
|
|
|
### 3. Jitter
|
|
|
|
Add randomness to backoff to prevent thundering herd when many clients retry simultaneously.
|
|
|
|
### 4. Bounded Retries
|
|
|
|
Cap both attempt count and total duration to prevent infinite retry loops.
|
|
|
|
## Quick Start
|
|
|
|
```python
|
|
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
|
|
|
|
@retry(
|
|
stop=stop_after_attempt(3),
|
|
wait=wait_exponential_jitter(initial=1, max=10),
|
|
)
|
|
def call_external_service(request: dict) -> dict:
|
|
return httpx.post("https://api.example.com", json=request).json()
|
|
```
|
|
|
|
## Fundamental Patterns
|
|
|
|
### Pattern 1: Basic Retry with Tenacity
|
|
|
|
Use the `tenacity` library for production-grade retry logic. For simpler cases, consider built-in retry functionality or a lightweight custom implementation.
|
|
|
|
```python
|
|
from tenacity import (
|
|
retry,
|
|
stop_after_attempt,
|
|
stop_after_delay,
|
|
wait_exponential_jitter,
|
|
retry_if_exception_type,
|
|
)
|
|
|
|
TRANSIENT_ERRORS = (ConnectionError, TimeoutError, OSError)
|
|
|
|
@retry(
|
|
retry=retry_if_exception_type(TRANSIENT_ERRORS),
|
|
stop=stop_after_attempt(5) | stop_after_delay(60),
|
|
wait=wait_exponential_jitter(initial=1, max=30),
|
|
)
|
|
def fetch_data(url: str) -> dict:
|
|
"""Fetch data with automatic retry on transient failures."""
|
|
response = httpx.get(url, timeout=30)
|
|
response.raise_for_status()
|
|
return response.json()
|
|
```
|
|
|
|
### Pattern 2: Retry Only Appropriate Errors
|
|
|
|
Whitelist specific transient exceptions. Never retry:
|
|
|
|
- `ValueError`, `TypeError` - These are bugs, not transient issues
|
|
- `AuthenticationError` - Invalid credentials won't become valid
|
|
- HTTP 4xx errors (except 429) - Client errors are permanent
|
|
|
|
```python
|
|
from tenacity import retry, retry_if_exception_type
|
|
import httpx
|
|
|
|
# Define what's retryable
|
|
RETRYABLE_EXCEPTIONS = (
|
|
ConnectionError,
|
|
TimeoutError,
|
|
httpx.ConnectTimeout,
|
|
httpx.ReadTimeout,
|
|
)
|
|
|
|
@retry(
|
|
retry=retry_if_exception_type(RETRYABLE_EXCEPTIONS),
|
|
stop=stop_after_attempt(3),
|
|
wait=wait_exponential_jitter(initial=1, max=10),
|
|
)
|
|
def resilient_api_call(endpoint: str) -> dict:
|
|
"""Make API call with retry on network issues."""
|
|
return httpx.get(endpoint, timeout=10).json()
|
|
```
|
|
|
|
### Pattern 3: HTTP Status Code Retries
|
|
|
|
Retry specific HTTP status codes that indicate transient issues.
|
|
|
|
```python
|
|
from tenacity import retry, retry_if_result, stop_after_attempt
|
|
import httpx
|
|
|
|
RETRY_STATUS_CODES = {429, 502, 503, 504}
|
|
|
|
def should_retry_response(response: httpx.Response) -> bool:
|
|
"""Check if response indicates a retryable error."""
|
|
return response.status_code in RETRY_STATUS_CODES
|
|
|
|
@retry(
|
|
retry=retry_if_result(should_retry_response),
|
|
stop=stop_after_attempt(3),
|
|
wait=wait_exponential_jitter(initial=1, max=10),
|
|
)
|
|
def http_request(method: str, url: str, **kwargs) -> httpx.Response:
|
|
"""Make HTTP request with retry on transient status codes."""
|
|
return httpx.request(method, url, timeout=30, **kwargs)
|
|
```
|
|
|
|
### Pattern 4: Combined Exception and Status Retry
|
|
|
|
Handle both network exceptions and HTTP status codes.
|
|
|
|
```python
|
|
from tenacity import (
|
|
retry,
|
|
retry_if_exception_type,
|
|
retry_if_result,
|
|
stop_after_attempt,
|
|
wait_exponential_jitter,
|
|
before_sleep_log,
|
|
)
|
|
import logging
|
|
import httpx
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
TRANSIENT_EXCEPTIONS = (
|
|
ConnectionError,
|
|
TimeoutError,
|
|
httpx.ConnectError,
|
|
httpx.ReadTimeout,
|
|
)
|
|
RETRY_STATUS_CODES = {429, 500, 502, 503, 504}
|
|
|
|
def is_retryable_response(response: httpx.Response) -> bool:
|
|
return response.status_code in RETRY_STATUS_CODES
|
|
|
|
@retry(
|
|
retry=(
|
|
retry_if_exception_type(TRANSIENT_EXCEPTIONS) |
|
|
retry_if_result(is_retryable_response)
|
|
),
|
|
stop=stop_after_attempt(5),
|
|
wait=wait_exponential_jitter(initial=1, max=30),
|
|
before_sleep=before_sleep_log(logger, logging.WARNING),
|
|
)
|
|
def robust_http_call(
|
|
method: str,
|
|
url: str,
|
|
**kwargs,
|
|
) -> httpx.Response:
|
|
"""HTTP call with comprehensive retry handling."""
|
|
return httpx.request(method, url, timeout=30, **kwargs)
|
|
```
|
|
|
|
## Advanced Patterns
|
|
|
|
### Pattern 5: Logging Retry Attempts
|
|
|
|
Track retry behavior for debugging and alerting.
|
|
|
|
```python
|
|
from tenacity import retry, stop_after_attempt, wait_exponential
|
|
import structlog
|
|
|
|
logger = structlog.get_logger()
|
|
|
|
def log_retry_attempt(retry_state):
|
|
"""Log detailed retry information."""
|
|
exception = retry_state.outcome.exception()
|
|
logger.warning(
|
|
"Retrying operation",
|
|
attempt=retry_state.attempt_number,
|
|
exception_type=type(exception).__name__,
|
|
exception_message=str(exception),
|
|
next_wait_seconds=retry_state.next_action.sleep if retry_state.next_action else None,
|
|
)
|
|
|
|
@retry(
|
|
stop=stop_after_attempt(3),
|
|
wait=wait_exponential(multiplier=1, max=10),
|
|
before_sleep=log_retry_attempt,
|
|
)
|
|
def call_with_logging(request: dict) -> dict:
|
|
"""External call with retry logging."""
|
|
...
|
|
```
|
|
|
|
### Pattern 6: Timeout Decorator
|
|
|
|
Create reusable timeout decorators for consistent timeout handling.
|
|
|
|
```python
|
|
import asyncio
|
|
from functools import wraps
|
|
from typing import TypeVar, Callable
|
|
|
|
T = TypeVar("T")
|
|
|
|
def with_timeout(seconds: float):
|
|
"""Decorator to add timeout to async functions."""
|
|
def decorator(func: Callable[..., T]) -> Callable[..., T]:
|
|
@wraps(func)
|
|
async def wrapper(*args, **kwargs) -> T:
|
|
return await asyncio.wait_for(
|
|
func(*args, **kwargs),
|
|
timeout=seconds,
|
|
)
|
|
return wrapper
|
|
return decorator
|
|
|
|
@with_timeout(30)
|
|
async def fetch_with_timeout(url: str) -> dict:
|
|
"""Fetch URL with 30 second timeout."""
|
|
async with httpx.AsyncClient() as client:
|
|
response = await client.get(url)
|
|
return response.json()
|
|
```
|
|
|
|
### Pattern 7: Cross-Cutting Concerns via Decorators
|
|
|
|
Stack decorators to separate infrastructure from business logic.
|
|
|
|
```python
|
|
from functools import wraps
|
|
from typing import TypeVar, Callable
|
|
import structlog
|
|
|
|
logger = structlog.get_logger()
|
|
T = TypeVar("T")
|
|
|
|
def traced(name: str | None = None):
|
|
"""Add tracing to function calls."""
|
|
def decorator(func: Callable[..., T]) -> Callable[..., T]:
|
|
span_name = name or func.__name__
|
|
|
|
@wraps(func)
|
|
async def wrapper(*args, **kwargs) -> T:
|
|
logger.info("Operation started", operation=span_name)
|
|
try:
|
|
result = await func(*args, **kwargs)
|
|
logger.info("Operation completed", operation=span_name)
|
|
return result
|
|
except Exception as e:
|
|
logger.error("Operation failed", operation=span_name, error=str(e))
|
|
raise
|
|
return wrapper
|
|
return decorator
|
|
|
|
# Stack multiple concerns
|
|
@traced("fetch_user_data")
|
|
@with_timeout(30)
|
|
@retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter())
|
|
async def fetch_user_data(user_id: str) -> dict:
|
|
"""Fetch user with tracing, timeout, and retry."""
|
|
...
|
|
```
|
|
|
|
### Pattern 8: Dependency Injection for Testability
|
|
|
|
Pass infrastructure components through constructors for easy testing.
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
from typing import Protocol
|
|
|
|
class Logger(Protocol):
|
|
def info(self, msg: str, **kwargs) -> None: ...
|
|
def error(self, msg: str, **kwargs) -> None: ...
|
|
|
|
class MetricsClient(Protocol):
|
|
def increment(self, metric: str, tags: dict | None = None) -> None: ...
|
|
def timing(self, metric: str, value: float) -> None: ...
|
|
|
|
@dataclass
|
|
class UserService:
|
|
"""Service with injected infrastructure."""
|
|
|
|
repository: UserRepository
|
|
logger: Logger
|
|
metrics: MetricsClient
|
|
|
|
async def get_user(self, user_id: str) -> User:
|
|
self.logger.info("Fetching user", user_id=user_id)
|
|
start = time.perf_counter()
|
|
|
|
try:
|
|
user = await self.repository.get(user_id)
|
|
self.metrics.increment("user.fetch.success")
|
|
return user
|
|
except Exception as e:
|
|
self.metrics.increment("user.fetch.error")
|
|
self.logger.error("Failed to fetch user", user_id=user_id, error=str(e))
|
|
raise
|
|
finally:
|
|
elapsed = time.perf_counter() - start
|
|
self.metrics.timing("user.fetch.duration", elapsed)
|
|
|
|
# Easy to test with fakes
|
|
service = UserService(
|
|
repository=FakeRepository(),
|
|
logger=FakeLogger(),
|
|
metrics=FakeMetrics(),
|
|
)
|
|
```
|
|
|
|
### Pattern 9: Fail-Safe Defaults
|
|
|
|
Degrade gracefully when non-critical operations fail.
|
|
|
|
```python
|
|
from typing import TypeVar
|
|
from collections.abc import Callable
|
|
|
|
T = TypeVar("T")
|
|
|
|
def fail_safe(default: T, log_failure: bool = True):
|
|
"""Return default value on failure instead of raising."""
|
|
def decorator(func: Callable[..., T]) -> Callable[..., T]:
|
|
@wraps(func)
|
|
async def wrapper(*args, **kwargs) -> T:
|
|
try:
|
|
return await func(*args, **kwargs)
|
|
except Exception as e:
|
|
if log_failure:
|
|
logger.warning(
|
|
"Operation failed, using default",
|
|
function=func.__name__,
|
|
error=str(e),
|
|
)
|
|
return default
|
|
return wrapper
|
|
return decorator
|
|
|
|
@fail_safe(default=[])
|
|
async def get_recommendations(user_id: str) -> list[str]:
|
|
"""Get recommendations, return empty list on failure."""
|
|
...
|
|
```
|
|
|
|
## Best Practices Summary
|
|
|
|
1. **Retry only transient errors** - Don't retry bugs or authentication failures
|
|
2. **Use exponential backoff** - Give services time to recover
|
|
3. **Add jitter** - Prevent thundering herd from synchronized retries
|
|
4. **Cap total duration** - `stop_after_attempt(5) | stop_after_delay(60)`
|
|
5. **Log every retry** - Silent retries hide systemic problems
|
|
6. **Use decorators** - Keep retry logic separate from business logic
|
|
7. **Inject dependencies** - Make infrastructure testable
|
|
8. **Set timeouts everywhere** - Every network call needs a timeout
|
|
9. **Fail gracefully** - Return cached/default values for non-critical paths
|
|
10. **Monitor retry rates** - High retry rates indicate underlying issues
|