mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 17:47:16 +00:00
Remove references to non-existent resource files (references/, assets/, scripts/, examples/) from 115 skill SKILL.md files. These sections pointed to directories and files that were never created, causing confusion when users install skills. Also fix broken Code of Conduct links in issue templates to use absolute GitHub URLs instead of relative paths that 404.
379 lines
10 KiB
Markdown
379 lines
10 KiB
Markdown
---
|
|
name: service-mesh-observability
|
|
description: Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.
|
|
---
|
|
|
|
# Service Mesh Observability
|
|
|
|
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
|
|
|
|
## When to Use This Skill
|
|
|
|
- Setting up distributed tracing across services
|
|
- Implementing service mesh metrics and dashboards
|
|
- Debugging latency and error issues
|
|
- Defining SLOs for service communication
|
|
- Visualizing service dependencies
|
|
- Troubleshooting mesh connectivity
|
|
|
|
## Core Concepts
|
|
|
|
### 1. Three Pillars of Observability
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ Observability │
|
|
├─────────────────┬─────────────────┬─────────────────┤
|
|
│ Metrics │ Traces │ Logs │
|
|
│ │ │ │
|
|
│ • Request rate │ • Span context │ • Access logs │
|
|
│ • Error rate │ • Latency │ • Error details │
|
|
│ • Latency P50 │ • Dependencies │ • Debug info │
|
|
│ • Saturation │ • Bottlenecks │ • Audit trail │
|
|
└─────────────────┴─────────────────┴─────────────────┘
|
|
```
|
|
|
|
### 2. Golden Signals for Mesh
|
|
|
|
| Signal | Description | Alert Threshold |
|
|
| -------------- | ------------------------- | ----------------- |
|
|
| **Latency** | Request duration P50, P99 | P99 > 500ms |
|
|
| **Traffic** | Requests per second | Anomaly detection |
|
|
| **Errors** | 5xx error rate | > 1% |
|
|
| **Saturation** | Resource utilization | > 80% |
|
|
|
|
## Templates
|
|
|
|
### Template 1: Istio with Prometheus & Grafana
|
|
|
|
```yaml
|
|
# Install Prometheus
|
|
apiVersion: v1
|
|
kind: ConfigMap
|
|
metadata:
|
|
name: prometheus
|
|
namespace: istio-system
|
|
data:
|
|
prometheus.yml: |
|
|
global:
|
|
scrape_interval: 15s
|
|
scrape_configs:
|
|
- job_name: 'istio-mesh'
|
|
kubernetes_sd_configs:
|
|
- role: endpoints
|
|
namespaces:
|
|
names:
|
|
- istio-system
|
|
relabel_configs:
|
|
- source_labels: [__meta_kubernetes_service_name]
|
|
action: keep
|
|
regex: istio-telemetry
|
|
---
|
|
# ServiceMonitor for Prometheus Operator
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: ServiceMonitor
|
|
metadata:
|
|
name: istio-mesh
|
|
namespace: istio-system
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app: istiod
|
|
endpoints:
|
|
- port: http-monitoring
|
|
interval: 15s
|
|
```
|
|
|
|
### Template 2: Key Istio Metrics Queries
|
|
|
|
```promql
|
|
# Request rate by service
|
|
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
|
|
|
|
# Error rate (5xx)
|
|
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
|
|
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
|
|
|
|
# P99 latency
|
|
histogram_quantile(0.99,
|
|
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
|
|
by (le, destination_service_name))
|
|
|
|
# TCP connections
|
|
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
|
|
|
|
# Request size
|
|
histogram_quantile(0.99,
|
|
sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
|
|
by (le, destination_service_name))
|
|
```
|
|
|
|
### Template 3: Jaeger Distributed Tracing
|
|
|
|
```yaml
|
|
# Jaeger installation for Istio
|
|
apiVersion: install.istio.io/v1alpha1
|
|
kind: IstioOperator
|
|
spec:
|
|
meshConfig:
|
|
enableTracing: true
|
|
defaultConfig:
|
|
tracing:
|
|
sampling: 100.0 # 100% in dev, lower in prod
|
|
zipkin:
|
|
address: jaeger-collector.istio-system:9411
|
|
---
|
|
# Jaeger deployment
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: jaeger
|
|
namespace: istio-system
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app: jaeger
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: jaeger
|
|
spec:
|
|
containers:
|
|
- name: jaeger
|
|
image: jaegertracing/all-in-one:1.50
|
|
ports:
|
|
- containerPort: 5775 # UDP
|
|
- containerPort: 6831 # Thrift
|
|
- containerPort: 6832 # Thrift
|
|
- containerPort: 5778 # Config
|
|
- containerPort: 16686 # UI
|
|
- containerPort: 14268 # HTTP
|
|
- containerPort: 14250 # gRPC
|
|
- containerPort: 9411 # Zipkin
|
|
env:
|
|
- name: COLLECTOR_ZIPKIN_HOST_PORT
|
|
value: ":9411"
|
|
```
|
|
|
|
### Template 4: Linkerd Viz Dashboard
|
|
|
|
```bash
|
|
# Install Linkerd viz extension
|
|
linkerd viz install | kubectl apply -f -
|
|
|
|
# Access dashboard
|
|
linkerd viz dashboard
|
|
|
|
# CLI commands for observability
|
|
# Top requests
|
|
linkerd viz top deploy/my-app
|
|
|
|
# Per-route metrics
|
|
linkerd viz routes deploy/my-app --to deploy/backend
|
|
|
|
# Live traffic inspection
|
|
linkerd viz tap deploy/my-app --to deploy/backend
|
|
|
|
# Service edges (dependencies)
|
|
linkerd viz edges deployment -n my-namespace
|
|
```
|
|
|
|
### Template 5: Grafana Dashboard JSON
|
|
|
|
```json
|
|
{
|
|
"dashboard": {
|
|
"title": "Service Mesh Overview",
|
|
"panels": [
|
|
{
|
|
"title": "Request Rate",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
|
|
"legendFormat": "{{destination_service_name}}"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Error Rate",
|
|
"type": "gauge",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
|
|
}
|
|
],
|
|
"fieldConfig": {
|
|
"defaults": {
|
|
"thresholds": {
|
|
"steps": [
|
|
{ "value": 0, "color": "green" },
|
|
{ "value": 1, "color": "yellow" },
|
|
{ "value": 5, "color": "red" }
|
|
]
|
|
}
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"title": "P99 Latency",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
|
|
"legendFormat": "{{destination_service_name}}"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Service Topology",
|
|
"type": "nodeGraph",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Template 6: Kiali Service Mesh Visualization
|
|
|
|
```yaml
|
|
# Kiali installation
|
|
apiVersion: kiali.io/v1alpha1
|
|
kind: Kiali
|
|
metadata:
|
|
name: kiali
|
|
namespace: istio-system
|
|
spec:
|
|
auth:
|
|
strategy: anonymous # or openid, token
|
|
deployment:
|
|
accessible_namespaces:
|
|
- "**"
|
|
external_services:
|
|
prometheus:
|
|
url: http://prometheus.istio-system:9090
|
|
tracing:
|
|
url: http://jaeger-query.istio-system:16686
|
|
grafana:
|
|
url: http://grafana.istio-system:3000
|
|
```
|
|
|
|
### Template 7: OpenTelemetry Integration
|
|
|
|
```yaml
|
|
# OpenTelemetry Collector for mesh
|
|
apiVersion: v1
|
|
kind: ConfigMap
|
|
metadata:
|
|
name: otel-collector-config
|
|
data:
|
|
config.yaml: |
|
|
receivers:
|
|
otlp:
|
|
protocols:
|
|
grpc:
|
|
endpoint: 0.0.0.0:4317
|
|
http:
|
|
endpoint: 0.0.0.0:4318
|
|
zipkin:
|
|
endpoint: 0.0.0.0:9411
|
|
|
|
processors:
|
|
batch:
|
|
timeout: 10s
|
|
|
|
exporters:
|
|
jaeger:
|
|
endpoint: jaeger-collector:14250
|
|
tls:
|
|
insecure: true
|
|
prometheus:
|
|
endpoint: 0.0.0.0:8889
|
|
|
|
service:
|
|
pipelines:
|
|
traces:
|
|
receivers: [otlp, zipkin]
|
|
processors: [batch]
|
|
exporters: [jaeger]
|
|
metrics:
|
|
receivers: [otlp]
|
|
processors: [batch]
|
|
exporters: [prometheus]
|
|
---
|
|
# Istio Telemetry v2 with OTel
|
|
apiVersion: telemetry.istio.io/v1alpha1
|
|
kind: Telemetry
|
|
metadata:
|
|
name: mesh-default
|
|
namespace: istio-system
|
|
spec:
|
|
tracing:
|
|
- providers:
|
|
- name: otel
|
|
randomSamplingPercentage: 10
|
|
```
|
|
|
|
## Alerting Rules
|
|
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: PrometheusRule
|
|
metadata:
|
|
name: mesh-alerts
|
|
namespace: istio-system
|
|
spec:
|
|
groups:
|
|
- name: mesh.rules
|
|
rules:
|
|
- alert: HighErrorRate
|
|
expr: |
|
|
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
|
|
/ sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "High error rate for {{ $labels.destination_service_name }}"
|
|
|
|
- alert: HighLatency
|
|
expr: |
|
|
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
|
|
by (le, destination_service_name)) > 1000
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High P99 latency for {{ $labels.destination_service_name }}"
|
|
|
|
- alert: MeshCertExpiring
|
|
expr: |
|
|
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Mesh certificate expiring in less than 7 days"
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Do's
|
|
|
|
- **Sample appropriately** - 100% in dev, 1-10% in prod
|
|
- **Use trace context** - Propagate headers consistently
|
|
- **Set up alerts** - For golden signals
|
|
- **Correlate metrics/traces** - Use exemplars
|
|
- **Retain strategically** - Hot/cold storage tiers
|
|
|
|
### Don'ts
|
|
|
|
- **Don't over-sample** - Storage costs add up
|
|
- **Don't ignore cardinality** - Limit label values
|
|
- **Don't skip dashboards** - Visualize dependencies
|
|
- **Don't forget costs** - Monitor observability costs
|