mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 09:37:15 +00:00
Add Comprehensive Python Development Skills (#419)
* Add extra python skills covering code style, design patterns, resilience, resource management, testing patterns, and type safety ...etc * fix: correct code examples in Python skills - Clarify Python version requirements for type statement (3.10+ vs 3.12+) - Add missing ValidationError import in configuration example - Add missing httpx import and url parameter in async example --------- Co-authored-by: Seth Hobson <wshobson@gmail.com>
This commit is contained in:
376
plugins/python-development/skills/python-resilience/SKILL.md
Normal file
376
plugins/python-development/skills/python-resilience/SKILL.md
Normal file
@@ -0,0 +1,376 @@
|
||||
---
|
||||
name: python-resilience
|
||||
description: Python resilience patterns including automatic retries, exponential backoff, timeouts, and fault-tolerant decorators. Use when adding retry logic, implementing timeouts, building fault-tolerant services, or handling transient failures.
|
||||
---
|
||||
|
||||
# Python Resilience Patterns
|
||||
|
||||
Build fault-tolerant Python applications that gracefully handle transient failures, network issues, and service outages. Resilience patterns keep systems running when dependencies are unreliable.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
- Adding retry logic to external service calls
|
||||
- Implementing timeouts for network operations
|
||||
- Building fault-tolerant microservices
|
||||
- Handling rate limiting and backpressure
|
||||
- Creating infrastructure decorators
|
||||
- Designing circuit breakers
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Transient vs Permanent Failures
|
||||
|
||||
Retry transient errors (network timeouts, temporary service issues). Don't retry permanent errors (invalid credentials, bad requests).
|
||||
|
||||
### 2. Exponential Backoff
|
||||
|
||||
Increase wait time between retries to avoid overwhelming recovering services.
|
||||
|
||||
### 3. Jitter
|
||||
|
||||
Add randomness to backoff to prevent thundering herd when many clients retry simultaneously.
|
||||
|
||||
### 4. Bounded Retries
|
||||
|
||||
Cap both attempt count and total duration to prevent infinite retry loops.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
|
||||
|
||||
@retry(
|
||||
stop=stop_after_attempt(3),
|
||||
wait=wait_exponential_jitter(initial=1, max=10),
|
||||
)
|
||||
def call_external_service(request: dict) -> dict:
|
||||
return httpx.post("https://api.example.com", json=request).json()
|
||||
```
|
||||
|
||||
## Fundamental Patterns
|
||||
|
||||
### Pattern 1: Basic Retry with Tenacity
|
||||
|
||||
Use the `tenacity` library for production-grade retry logic. For simpler cases, consider built-in retry functionality or a lightweight custom implementation.
|
||||
|
||||
```python
|
||||
from tenacity import (
|
||||
retry,
|
||||
stop_after_attempt,
|
||||
stop_after_delay,
|
||||
wait_exponential_jitter,
|
||||
retry_if_exception_type,
|
||||
)
|
||||
|
||||
TRANSIENT_ERRORS = (ConnectionError, TimeoutError, OSError)
|
||||
|
||||
@retry(
|
||||
retry=retry_if_exception_type(TRANSIENT_ERRORS),
|
||||
stop=stop_after_attempt(5) | stop_after_delay(60),
|
||||
wait=wait_exponential_jitter(initial=1, max=30),
|
||||
)
|
||||
def fetch_data(url: str) -> dict:
|
||||
"""Fetch data with automatic retry on transient failures."""
|
||||
response = httpx.get(url, timeout=30)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
```
|
||||
|
||||
### Pattern 2: Retry Only Appropriate Errors
|
||||
|
||||
Whitelist specific transient exceptions. Never retry:
|
||||
|
||||
- `ValueError`, `TypeError` - These are bugs, not transient issues
|
||||
- `AuthenticationError` - Invalid credentials won't become valid
|
||||
- HTTP 4xx errors (except 429) - Client errors are permanent
|
||||
|
||||
```python
|
||||
from tenacity import retry, retry_if_exception_type
|
||||
import httpx
|
||||
|
||||
# Define what's retryable
|
||||
RETRYABLE_EXCEPTIONS = (
|
||||
ConnectionError,
|
||||
TimeoutError,
|
||||
httpx.ConnectTimeout,
|
||||
httpx.ReadTimeout,
|
||||
)
|
||||
|
||||
@retry(
|
||||
retry=retry_if_exception_type(RETRYABLE_EXCEPTIONS),
|
||||
stop=stop_after_attempt(3),
|
||||
wait=wait_exponential_jitter(initial=1, max=10),
|
||||
)
|
||||
def resilient_api_call(endpoint: str) -> dict:
|
||||
"""Make API call with retry on network issues."""
|
||||
return httpx.get(endpoint, timeout=10).json()
|
||||
```
|
||||
|
||||
### Pattern 3: HTTP Status Code Retries
|
||||
|
||||
Retry specific HTTP status codes that indicate transient issues.
|
||||
|
||||
```python
|
||||
from tenacity import retry, retry_if_result, stop_after_attempt
|
||||
import httpx
|
||||
|
||||
RETRY_STATUS_CODES = {429, 502, 503, 504}
|
||||
|
||||
def should_retry_response(response: httpx.Response) -> bool:
|
||||
"""Check if response indicates a retryable error."""
|
||||
return response.status_code in RETRY_STATUS_CODES
|
||||
|
||||
@retry(
|
||||
retry=retry_if_result(should_retry_response),
|
||||
stop=stop_after_attempt(3),
|
||||
wait=wait_exponential_jitter(initial=1, max=10),
|
||||
)
|
||||
def http_request(method: str, url: str, **kwargs) -> httpx.Response:
|
||||
"""Make HTTP request with retry on transient status codes."""
|
||||
return httpx.request(method, url, timeout=30, **kwargs)
|
||||
```
|
||||
|
||||
### Pattern 4: Combined Exception and Status Retry
|
||||
|
||||
Handle both network exceptions and HTTP status codes.
|
||||
|
||||
```python
|
||||
from tenacity import (
|
||||
retry,
|
||||
retry_if_exception_type,
|
||||
retry_if_result,
|
||||
stop_after_attempt,
|
||||
wait_exponential_jitter,
|
||||
before_sleep_log,
|
||||
)
|
||||
import logging
|
||||
import httpx
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
TRANSIENT_EXCEPTIONS = (
|
||||
ConnectionError,
|
||||
TimeoutError,
|
||||
httpx.ConnectError,
|
||||
httpx.ReadTimeout,
|
||||
)
|
||||
RETRY_STATUS_CODES = {429, 500, 502, 503, 504}
|
||||
|
||||
def is_retryable_response(response: httpx.Response) -> bool:
|
||||
return response.status_code in RETRY_STATUS_CODES
|
||||
|
||||
@retry(
|
||||
retry=(
|
||||
retry_if_exception_type(TRANSIENT_EXCEPTIONS) |
|
||||
retry_if_result(is_retryable_response)
|
||||
),
|
||||
stop=stop_after_attempt(5),
|
||||
wait=wait_exponential_jitter(initial=1, max=30),
|
||||
before_sleep=before_sleep_log(logger, logging.WARNING),
|
||||
)
|
||||
def robust_http_call(
|
||||
method: str,
|
||||
url: str,
|
||||
**kwargs,
|
||||
) -> httpx.Response:
|
||||
"""HTTP call with comprehensive retry handling."""
|
||||
return httpx.request(method, url, timeout=30, **kwargs)
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Pattern 5: Logging Retry Attempts
|
||||
|
||||
Track retry behavior for debugging and alerting.
|
||||
|
||||
```python
|
||||
from tenacity import retry, stop_after_attempt, wait_exponential
|
||||
import structlog
|
||||
|
||||
logger = structlog.get_logger()
|
||||
|
||||
def log_retry_attempt(retry_state):
|
||||
"""Log detailed retry information."""
|
||||
exception = retry_state.outcome.exception()
|
||||
logger.warning(
|
||||
"Retrying operation",
|
||||
attempt=retry_state.attempt_number,
|
||||
exception_type=type(exception).__name__,
|
||||
exception_message=str(exception),
|
||||
next_wait_seconds=retry_state.next_action.sleep if retry_state.next_action else None,
|
||||
)
|
||||
|
||||
@retry(
|
||||
stop=stop_after_attempt(3),
|
||||
wait=wait_exponential(multiplier=1, max=10),
|
||||
before_sleep=log_retry_attempt,
|
||||
)
|
||||
def call_with_logging(request: dict) -> dict:
|
||||
"""External call with retry logging."""
|
||||
...
|
||||
```
|
||||
|
||||
### Pattern 6: Timeout Decorator
|
||||
|
||||
Create reusable timeout decorators for consistent timeout handling.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from functools import wraps
|
||||
from typing import TypeVar, Callable
|
||||
|
||||
T = TypeVar("T")
|
||||
|
||||
def with_timeout(seconds: float):
|
||||
"""Decorator to add timeout to async functions."""
|
||||
def decorator(func: Callable[..., T]) -> Callable[..., T]:
|
||||
@wraps(func)
|
||||
async def wrapper(*args, **kwargs) -> T:
|
||||
return await asyncio.wait_for(
|
||||
func(*args, **kwargs),
|
||||
timeout=seconds,
|
||||
)
|
||||
return wrapper
|
||||
return decorator
|
||||
|
||||
@with_timeout(30)
|
||||
async def fetch_with_timeout(url: str) -> dict:
|
||||
"""Fetch URL with 30 second timeout."""
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.get(url)
|
||||
return response.json()
|
||||
```
|
||||
|
||||
### Pattern 7: Cross-Cutting Concerns via Decorators
|
||||
|
||||
Stack decorators to separate infrastructure from business logic.
|
||||
|
||||
```python
|
||||
from functools import wraps
|
||||
from typing import TypeVar, Callable
|
||||
import structlog
|
||||
|
||||
logger = structlog.get_logger()
|
||||
T = TypeVar("T")
|
||||
|
||||
def traced(name: str | None = None):
|
||||
"""Add tracing to function calls."""
|
||||
def decorator(func: Callable[..., T]) -> Callable[..., T]:
|
||||
span_name = name or func.__name__
|
||||
|
||||
@wraps(func)
|
||||
async def wrapper(*args, **kwargs) -> T:
|
||||
logger.info("Operation started", operation=span_name)
|
||||
try:
|
||||
result = await func(*args, **kwargs)
|
||||
logger.info("Operation completed", operation=span_name)
|
||||
return result
|
||||
except Exception as e:
|
||||
logger.error("Operation failed", operation=span_name, error=str(e))
|
||||
raise
|
||||
return wrapper
|
||||
return decorator
|
||||
|
||||
# Stack multiple concerns
|
||||
@traced("fetch_user_data")
|
||||
@with_timeout(30)
|
||||
@retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter())
|
||||
async def fetch_user_data(user_id: str) -> dict:
|
||||
"""Fetch user with tracing, timeout, and retry."""
|
||||
...
|
||||
```
|
||||
|
||||
### Pattern 8: Dependency Injection for Testability
|
||||
|
||||
Pass infrastructure components through constructors for easy testing.
|
||||
|
||||
```python
|
||||
from dataclasses import dataclass
|
||||
from typing import Protocol
|
||||
|
||||
class Logger(Protocol):
|
||||
def info(self, msg: str, **kwargs) -> None: ...
|
||||
def error(self, msg: str, **kwargs) -> None: ...
|
||||
|
||||
class MetricsClient(Protocol):
|
||||
def increment(self, metric: str, tags: dict | None = None) -> None: ...
|
||||
def timing(self, metric: str, value: float) -> None: ...
|
||||
|
||||
@dataclass
|
||||
class UserService:
|
||||
"""Service with injected infrastructure."""
|
||||
|
||||
repository: UserRepository
|
||||
logger: Logger
|
||||
metrics: MetricsClient
|
||||
|
||||
async def get_user(self, user_id: str) -> User:
|
||||
self.logger.info("Fetching user", user_id=user_id)
|
||||
start = time.perf_counter()
|
||||
|
||||
try:
|
||||
user = await self.repository.get(user_id)
|
||||
self.metrics.increment("user.fetch.success")
|
||||
return user
|
||||
except Exception as e:
|
||||
self.metrics.increment("user.fetch.error")
|
||||
self.logger.error("Failed to fetch user", user_id=user_id, error=str(e))
|
||||
raise
|
||||
finally:
|
||||
elapsed = time.perf_counter() - start
|
||||
self.metrics.timing("user.fetch.duration", elapsed)
|
||||
|
||||
# Easy to test with fakes
|
||||
service = UserService(
|
||||
repository=FakeRepository(),
|
||||
logger=FakeLogger(),
|
||||
metrics=FakeMetrics(),
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 9: Fail-Safe Defaults
|
||||
|
||||
Degrade gracefully when non-critical operations fail.
|
||||
|
||||
```python
|
||||
from typing import TypeVar
|
||||
from collections.abc import Callable
|
||||
|
||||
T = TypeVar("T")
|
||||
|
||||
def fail_safe(default: T, log_failure: bool = True):
|
||||
"""Return default value on failure instead of raising."""
|
||||
def decorator(func: Callable[..., T]) -> Callable[..., T]:
|
||||
@wraps(func)
|
||||
async def wrapper(*args, **kwargs) -> T:
|
||||
try:
|
||||
return await func(*args, **kwargs)
|
||||
except Exception as e:
|
||||
if log_failure:
|
||||
logger.warning(
|
||||
"Operation failed, using default",
|
||||
function=func.__name__,
|
||||
error=str(e),
|
||||
)
|
||||
return default
|
||||
return wrapper
|
||||
return decorator
|
||||
|
||||
@fail_safe(default=[])
|
||||
async def get_recommendations(user_id: str) -> list[str]:
|
||||
"""Get recommendations, return empty list on failure."""
|
||||
...
|
||||
```
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
1. **Retry only transient errors** - Don't retry bugs or authentication failures
|
||||
2. **Use exponential backoff** - Give services time to recover
|
||||
3. **Add jitter** - Prevent thundering herd from synchronized retries
|
||||
4. **Cap total duration** - `stop_after_attempt(5) | stop_after_delay(60)`
|
||||
5. **Log every retry** - Silent retries hide systemic problems
|
||||
6. **Use decorators** - Keep retry logic separate from business logic
|
||||
7. **Inject dependencies** - Make infrastructure testable
|
||||
8. **Set timeouts everywhere** - Every network call needs a timeout
|
||||
9. **Fail gracefully** - Return cached/default values for non-critical paths
|
||||
10. **Monitor retry rates** - High retry rates indicate underlying issues
|
||||
Reference in New Issue
Block a user