mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 09:37:15 +00:00
Add domain expert agents with comprehensive skill sets: - service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability - event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas - vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search - monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces - threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements Update all documentation to reflect correct counts: - 67 plugins, 99 agents, 107 skills, 71 commands
384 lines
10 KiB
Markdown
384 lines
10 KiB
Markdown
---
|
|
name: service-mesh-observability
|
|
description: Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.
|
|
---
|
|
|
|
# Service Mesh Observability
|
|
|
|
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
|
|
|
|
## When to Use This Skill
|
|
|
|
- Setting up distributed tracing across services
|
|
- Implementing service mesh metrics and dashboards
|
|
- Debugging latency and error issues
|
|
- Defining SLOs for service communication
|
|
- Visualizing service dependencies
|
|
- Troubleshooting mesh connectivity
|
|
|
|
## Core Concepts
|
|
|
|
### 1. Three Pillars of Observability
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ Observability │
|
|
├─────────────────┬─────────────────┬─────────────────┤
|
|
│ Metrics │ Traces │ Logs │
|
|
│ │ │ │
|
|
│ • Request rate │ • Span context │ • Access logs │
|
|
│ • Error rate │ • Latency │ • Error details │
|
|
│ • Latency P50 │ • Dependencies │ • Debug info │
|
|
│ • Saturation │ • Bottlenecks │ • Audit trail │
|
|
└─────────────────┴─────────────────┴─────────────────┘
|
|
```
|
|
|
|
### 2. Golden Signals for Mesh
|
|
|
|
| Signal | Description | Alert Threshold |
|
|
|--------|-------------|-----------------|
|
|
| **Latency** | Request duration P50, P99 | P99 > 500ms |
|
|
| **Traffic** | Requests per second | Anomaly detection |
|
|
| **Errors** | 5xx error rate | > 1% |
|
|
| **Saturation** | Resource utilization | > 80% |
|
|
|
|
## Templates
|
|
|
|
### Template 1: Istio with Prometheus & Grafana
|
|
|
|
```yaml
|
|
# Install Prometheus
|
|
apiVersion: v1
|
|
kind: ConfigMap
|
|
metadata:
|
|
name: prometheus
|
|
namespace: istio-system
|
|
data:
|
|
prometheus.yml: |
|
|
global:
|
|
scrape_interval: 15s
|
|
scrape_configs:
|
|
- job_name: 'istio-mesh'
|
|
kubernetes_sd_configs:
|
|
- role: endpoints
|
|
namespaces:
|
|
names:
|
|
- istio-system
|
|
relabel_configs:
|
|
- source_labels: [__meta_kubernetes_service_name]
|
|
action: keep
|
|
regex: istio-telemetry
|
|
---
|
|
# ServiceMonitor for Prometheus Operator
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: ServiceMonitor
|
|
metadata:
|
|
name: istio-mesh
|
|
namespace: istio-system
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app: istiod
|
|
endpoints:
|
|
- port: http-monitoring
|
|
interval: 15s
|
|
```
|
|
|
|
### Template 2: Key Istio Metrics Queries
|
|
|
|
```promql
|
|
# Request rate by service
|
|
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
|
|
|
|
# Error rate (5xx)
|
|
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
|
|
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
|
|
|
|
# P99 latency
|
|
histogram_quantile(0.99,
|
|
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
|
|
by (le, destination_service_name))
|
|
|
|
# TCP connections
|
|
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
|
|
|
|
# Request size
|
|
histogram_quantile(0.99,
|
|
sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
|
|
by (le, destination_service_name))
|
|
```
|
|
|
|
### Template 3: Jaeger Distributed Tracing
|
|
|
|
```yaml
|
|
# Jaeger installation for Istio
|
|
apiVersion: install.istio.io/v1alpha1
|
|
kind: IstioOperator
|
|
spec:
|
|
meshConfig:
|
|
enableTracing: true
|
|
defaultConfig:
|
|
tracing:
|
|
sampling: 100.0 # 100% in dev, lower in prod
|
|
zipkin:
|
|
address: jaeger-collector.istio-system:9411
|
|
---
|
|
# Jaeger deployment
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: jaeger
|
|
namespace: istio-system
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app: jaeger
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: jaeger
|
|
spec:
|
|
containers:
|
|
- name: jaeger
|
|
image: jaegertracing/all-in-one:1.50
|
|
ports:
|
|
- containerPort: 5775 # UDP
|
|
- containerPort: 6831 # Thrift
|
|
- containerPort: 6832 # Thrift
|
|
- containerPort: 5778 # Config
|
|
- containerPort: 16686 # UI
|
|
- containerPort: 14268 # HTTP
|
|
- containerPort: 14250 # gRPC
|
|
- containerPort: 9411 # Zipkin
|
|
env:
|
|
- name: COLLECTOR_ZIPKIN_HOST_PORT
|
|
value: ":9411"
|
|
```
|
|
|
|
### Template 4: Linkerd Viz Dashboard
|
|
|
|
```bash
|
|
# Install Linkerd viz extension
|
|
linkerd viz install | kubectl apply -f -
|
|
|
|
# Access dashboard
|
|
linkerd viz dashboard
|
|
|
|
# CLI commands for observability
|
|
# Top requests
|
|
linkerd viz top deploy/my-app
|
|
|
|
# Per-route metrics
|
|
linkerd viz routes deploy/my-app --to deploy/backend
|
|
|
|
# Live traffic inspection
|
|
linkerd viz tap deploy/my-app --to deploy/backend
|
|
|
|
# Service edges (dependencies)
|
|
linkerd viz edges deployment -n my-namespace
|
|
```
|
|
|
|
### Template 5: Grafana Dashboard JSON
|
|
|
|
```json
|
|
{
|
|
"dashboard": {
|
|
"title": "Service Mesh Overview",
|
|
"panels": [
|
|
{
|
|
"title": "Request Rate",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
|
|
"legendFormat": "{{destination_service_name}}"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Error Rate",
|
|
"type": "gauge",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
|
|
}
|
|
],
|
|
"fieldConfig": {
|
|
"defaults": {
|
|
"thresholds": {
|
|
"steps": [
|
|
{"value": 0, "color": "green"},
|
|
{"value": 1, "color": "yellow"},
|
|
{"value": 5, "color": "red"}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"title": "P99 Latency",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
|
|
"legendFormat": "{{destination_service_name}}"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Service Topology",
|
|
"type": "nodeGraph",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Template 6: Kiali Service Mesh Visualization
|
|
|
|
```yaml
|
|
# Kiali installation
|
|
apiVersion: kiali.io/v1alpha1
|
|
kind: Kiali
|
|
metadata:
|
|
name: kiali
|
|
namespace: istio-system
|
|
spec:
|
|
auth:
|
|
strategy: anonymous # or openid, token
|
|
deployment:
|
|
accessible_namespaces:
|
|
- "**"
|
|
external_services:
|
|
prometheus:
|
|
url: http://prometheus.istio-system:9090
|
|
tracing:
|
|
url: http://jaeger-query.istio-system:16686
|
|
grafana:
|
|
url: http://grafana.istio-system:3000
|
|
```
|
|
|
|
### Template 7: OpenTelemetry Integration
|
|
|
|
```yaml
|
|
# OpenTelemetry Collector for mesh
|
|
apiVersion: v1
|
|
kind: ConfigMap
|
|
metadata:
|
|
name: otel-collector-config
|
|
data:
|
|
config.yaml: |
|
|
receivers:
|
|
otlp:
|
|
protocols:
|
|
grpc:
|
|
endpoint: 0.0.0.0:4317
|
|
http:
|
|
endpoint: 0.0.0.0:4318
|
|
zipkin:
|
|
endpoint: 0.0.0.0:9411
|
|
|
|
processors:
|
|
batch:
|
|
timeout: 10s
|
|
|
|
exporters:
|
|
jaeger:
|
|
endpoint: jaeger-collector:14250
|
|
tls:
|
|
insecure: true
|
|
prometheus:
|
|
endpoint: 0.0.0.0:8889
|
|
|
|
service:
|
|
pipelines:
|
|
traces:
|
|
receivers: [otlp, zipkin]
|
|
processors: [batch]
|
|
exporters: [jaeger]
|
|
metrics:
|
|
receivers: [otlp]
|
|
processors: [batch]
|
|
exporters: [prometheus]
|
|
---
|
|
# Istio Telemetry v2 with OTel
|
|
apiVersion: telemetry.istio.io/v1alpha1
|
|
kind: Telemetry
|
|
metadata:
|
|
name: mesh-default
|
|
namespace: istio-system
|
|
spec:
|
|
tracing:
|
|
- providers:
|
|
- name: otel
|
|
randomSamplingPercentage: 10
|
|
```
|
|
|
|
## Alerting Rules
|
|
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: PrometheusRule
|
|
metadata:
|
|
name: mesh-alerts
|
|
namespace: istio-system
|
|
spec:
|
|
groups:
|
|
- name: mesh.rules
|
|
rules:
|
|
- alert: HighErrorRate
|
|
expr: |
|
|
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
|
|
/ sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "High error rate for {{ $labels.destination_service_name }}"
|
|
|
|
- alert: HighLatency
|
|
expr: |
|
|
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
|
|
by (le, destination_service_name)) > 1000
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High P99 latency for {{ $labels.destination_service_name }}"
|
|
|
|
- alert: MeshCertExpiring
|
|
expr: |
|
|
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Mesh certificate expiring in less than 7 days"
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Do's
|
|
- **Sample appropriately** - 100% in dev, 1-10% in prod
|
|
- **Use trace context** - Propagate headers consistently
|
|
- **Set up alerts** - For golden signals
|
|
- **Correlate metrics/traces** - Use exemplars
|
|
- **Retain strategically** - Hot/cold storage tiers
|
|
|
|
### Don'ts
|
|
- **Don't over-sample** - Storage costs add up
|
|
- **Don't ignore cardinality** - Limit label values
|
|
- **Don't skip dashboards** - Visualize dependencies
|
|
- **Don't forget costs** - Monitor observability costs
|
|
|
|
## Resources
|
|
|
|
- [Istio Observability](https://istio.io/latest/docs/tasks/observability/)
|
|
- [Linkerd Observability](https://linkerd.io/2.14/features/dashboard/)
|
|
- [OpenTelemetry](https://opentelemetry.io/)
|
|
- [Kiali](https://kiali.io/)
|