mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 17:47:16 +00:00
feat: add 5 new specialized agents with 20 skills
Add domain expert agents with comprehensive skill sets: - service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability - event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas - vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search - monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces - threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements Update all documentation to reflect correct counts: - 67 plugins, 99 agents, 107 skills, 71 commands
This commit is contained in:
@@ -0,0 +1,383 @@
|
||||
---
|
||||
name: service-mesh-observability
|
||||
description: Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.
|
||||
---
|
||||
|
||||
# Service Mesh Observability
|
||||
|
||||
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
- Setting up distributed tracing across services
|
||||
- Implementing service mesh metrics and dashboards
|
||||
- Debugging latency and error issues
|
||||
- Defining SLOs for service communication
|
||||
- Visualizing service dependencies
|
||||
- Troubleshooting mesh connectivity
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Three Pillars of Observability
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ Observability │
|
||||
├─────────────────┬─────────────────┬─────────────────┤
|
||||
│ Metrics │ Traces │ Logs │
|
||||
│ │ │ │
|
||||
│ • Request rate │ • Span context │ • Access logs │
|
||||
│ • Error rate │ • Latency │ • Error details │
|
||||
│ • Latency P50 │ • Dependencies │ • Debug info │
|
||||
│ • Saturation │ • Bottlenecks │ • Audit trail │
|
||||
└─────────────────┴─────────────────┴─────────────────┘
|
||||
```
|
||||
|
||||
### 2. Golden Signals for Mesh
|
||||
|
||||
| Signal | Description | Alert Threshold |
|
||||
|--------|-------------|-----------------|
|
||||
| **Latency** | Request duration P50, P99 | P99 > 500ms |
|
||||
| **Traffic** | Requests per second | Anomaly detection |
|
||||
| **Errors** | 5xx error rate | > 1% |
|
||||
| **Saturation** | Resource utilization | > 80% |
|
||||
|
||||
## Templates
|
||||
|
||||
### Template 1: Istio with Prometheus & Grafana
|
||||
|
||||
```yaml
|
||||
# Install Prometheus
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: prometheus
|
||||
namespace: istio-system
|
||||
data:
|
||||
prometheus.yml: |
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
scrape_configs:
|
||||
- job_name: 'istio-mesh'
|
||||
kubernetes_sd_configs:
|
||||
- role: endpoints
|
||||
namespaces:
|
||||
names:
|
||||
- istio-system
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_service_name]
|
||||
action: keep
|
||||
regex: istio-telemetry
|
||||
---
|
||||
# ServiceMonitor for Prometheus Operator
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: istio-mesh
|
||||
namespace: istio-system
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: istiod
|
||||
endpoints:
|
||||
- port: http-monitoring
|
||||
interval: 15s
|
||||
```
|
||||
|
||||
### Template 2: Key Istio Metrics Queries
|
||||
|
||||
```promql
|
||||
# Request rate by service
|
||||
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
|
||||
|
||||
# Error rate (5xx)
|
||||
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
|
||||
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
|
||||
|
||||
# P99 latency
|
||||
histogram_quantile(0.99,
|
||||
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
|
||||
by (le, destination_service_name))
|
||||
|
||||
# TCP connections
|
||||
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
|
||||
|
||||
# Request size
|
||||
histogram_quantile(0.99,
|
||||
sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
|
||||
by (le, destination_service_name))
|
||||
```
|
||||
|
||||
### Template 3: Jaeger Distributed Tracing
|
||||
|
||||
```yaml
|
||||
# Jaeger installation for Istio
|
||||
apiVersion: install.istio.io/v1alpha1
|
||||
kind: IstioOperator
|
||||
spec:
|
||||
meshConfig:
|
||||
enableTracing: true
|
||||
defaultConfig:
|
||||
tracing:
|
||||
sampling: 100.0 # 100% in dev, lower in prod
|
||||
zipkin:
|
||||
address: jaeger-collector.istio-system:9411
|
||||
---
|
||||
# Jaeger deployment
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: jaeger
|
||||
namespace: istio-system
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: jaeger
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: jaeger
|
||||
spec:
|
||||
containers:
|
||||
- name: jaeger
|
||||
image: jaegertracing/all-in-one:1.50
|
||||
ports:
|
||||
- containerPort: 5775 # UDP
|
||||
- containerPort: 6831 # Thrift
|
||||
- containerPort: 6832 # Thrift
|
||||
- containerPort: 5778 # Config
|
||||
- containerPort: 16686 # UI
|
||||
- containerPort: 14268 # HTTP
|
||||
- containerPort: 14250 # gRPC
|
||||
- containerPort: 9411 # Zipkin
|
||||
env:
|
||||
- name: COLLECTOR_ZIPKIN_HOST_PORT
|
||||
value: ":9411"
|
||||
```
|
||||
|
||||
### Template 4: Linkerd Viz Dashboard
|
||||
|
||||
```bash
|
||||
# Install Linkerd viz extension
|
||||
linkerd viz install | kubectl apply -f -
|
||||
|
||||
# Access dashboard
|
||||
linkerd viz dashboard
|
||||
|
||||
# CLI commands for observability
|
||||
# Top requests
|
||||
linkerd viz top deploy/my-app
|
||||
|
||||
# Per-route metrics
|
||||
linkerd viz routes deploy/my-app --to deploy/backend
|
||||
|
||||
# Live traffic inspection
|
||||
linkerd viz tap deploy/my-app --to deploy/backend
|
||||
|
||||
# Service edges (dependencies)
|
||||
linkerd viz edges deployment -n my-namespace
|
||||
```
|
||||
|
||||
### Template 5: Grafana Dashboard JSON
|
||||
|
||||
```json
|
||||
{
|
||||
"dashboard": {
|
||||
"title": "Service Mesh Overview",
|
||||
"panels": [
|
||||
{
|
||||
"title": "Request Rate",
|
||||
"type": "graph",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
|
||||
"legendFormat": "{{destination_service_name}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Error Rate",
|
||||
"type": "gauge",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{"value": 0, "color": "green"},
|
||||
{"value": 1, "color": "yellow"},
|
||||
{"value": 5, "color": "red"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "P99 Latency",
|
||||
"type": "graph",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
|
||||
"legendFormat": "{{destination_service_name}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Service Topology",
|
||||
"type": "nodeGraph",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Template 6: Kiali Service Mesh Visualization
|
||||
|
||||
```yaml
|
||||
# Kiali installation
|
||||
apiVersion: kiali.io/v1alpha1
|
||||
kind: Kiali
|
||||
metadata:
|
||||
name: kiali
|
||||
namespace: istio-system
|
||||
spec:
|
||||
auth:
|
||||
strategy: anonymous # or openid, token
|
||||
deployment:
|
||||
accessible_namespaces:
|
||||
- "**"
|
||||
external_services:
|
||||
prometheus:
|
||||
url: http://prometheus.istio-system:9090
|
||||
tracing:
|
||||
url: http://jaeger-query.istio-system:16686
|
||||
grafana:
|
||||
url: http://grafana.istio-system:3000
|
||||
```
|
||||
|
||||
### Template 7: OpenTelemetry Integration
|
||||
|
||||
```yaml
|
||||
# OpenTelemetry Collector for mesh
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: otel-collector-config
|
||||
data:
|
||||
config.yaml: |
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
zipkin:
|
||||
endpoint: 0.0.0.0:9411
|
||||
|
||||
processors:
|
||||
batch:
|
||||
timeout: 10s
|
||||
|
||||
exporters:
|
||||
jaeger:
|
||||
endpoint: jaeger-collector:14250
|
||||
tls:
|
||||
insecure: true
|
||||
prometheus:
|
||||
endpoint: 0.0.0.0:8889
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
traces:
|
||||
receivers: [otlp, zipkin]
|
||||
processors: [batch]
|
||||
exporters: [jaeger]
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
processors: [batch]
|
||||
exporters: [prometheus]
|
||||
---
|
||||
# Istio Telemetry v2 with OTel
|
||||
apiVersion: telemetry.istio.io/v1alpha1
|
||||
kind: Telemetry
|
||||
metadata:
|
||||
name: mesh-default
|
||||
namespace: istio-system
|
||||
spec:
|
||||
tracing:
|
||||
- providers:
|
||||
- name: otel
|
||||
randomSamplingPercentage: 10
|
||||
```
|
||||
|
||||
## Alerting Rules
|
||||
|
||||
```yaml
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: PrometheusRule
|
||||
metadata:
|
||||
name: mesh-alerts
|
||||
namespace: istio-system
|
||||
spec:
|
||||
groups:
|
||||
- name: mesh.rules
|
||||
rules:
|
||||
- alert: HighErrorRate
|
||||
expr: |
|
||||
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
|
||||
/ sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "High error rate for {{ $labels.destination_service_name }}"
|
||||
|
||||
- alert: HighLatency
|
||||
expr: |
|
||||
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
|
||||
by (le, destination_service_name)) > 1000
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High P99 latency for {{ $labels.destination_service_name }}"
|
||||
|
||||
- alert: MeshCertExpiring
|
||||
expr: |
|
||||
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Mesh certificate expiring in less than 7 days"
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
- **Sample appropriately** - 100% in dev, 1-10% in prod
|
||||
- **Use trace context** - Propagate headers consistently
|
||||
- **Set up alerts** - For golden signals
|
||||
- **Correlate metrics/traces** - Use exemplars
|
||||
- **Retain strategically** - Hot/cold storage tiers
|
||||
|
||||
### Don'ts
|
||||
- **Don't over-sample** - Storage costs add up
|
||||
- **Don't ignore cardinality** - Limit label values
|
||||
- **Don't skip dashboards** - Visualize dependencies
|
||||
- **Don't forget costs** - Monitor observability costs
|
||||
|
||||
## Resources
|
||||
|
||||
- [Istio Observability](https://istio.io/latest/docs/tasks/observability/)
|
||||
- [Linkerd Observability](https://linkerd.io/2.14/features/dashboard/)
|
||||
- [OpenTelemetry](https://opentelemetry.io/)
|
||||
- [Kiali](https://kiali.io/)
|
||||
Reference in New Issue
Block a user