feat: add 5 new specialized agents with 20 skills

Add domain expert agents with comprehensive skill sets:
- service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability
- event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas
- vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search
- monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces
- threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements

Update all documentation to reflect correct counts:
- 67 plugins, 99 agents, 107 skills, 71 commands
This commit is contained in:
Seth Hobson
2025-12-16 16:00:58 -05:00
parent c7ad381360
commit 01d93fc227
58 changed files with 24830 additions and 50 deletions

View File

@@ -0,0 +1,325 @@
---
name: istio-traffic-management
description: Configure Istio traffic management including routing, load balancing, circuit breakers, and canary deployments. Use when implementing service mesh traffic policies, progressive delivery, or resilience patterns.
---
# Istio Traffic Management
Comprehensive guide to Istio traffic management for production service mesh deployments.
## When to Use This Skill
- Configuring service-to-service routing
- Implementing canary or blue-green deployments
- Setting up circuit breakers and retries
- Load balancing configuration
- Traffic mirroring for testing
- Fault injection for chaos engineering
## Core Concepts
### 1. Traffic Management Resources
| Resource | Purpose | Scope |
|----------|---------|-------|
| **VirtualService** | Route traffic to destinations | Host-based |
| **DestinationRule** | Define policies after routing | Service-based |
| **Gateway** | Configure ingress/egress | Cluster edge |
| **ServiceEntry** | Add external services | Mesh-wide |
### 2. Traffic Flow
```
Client → Gateway → VirtualService → DestinationRule → Service
(routing) (policies) (pods)
```
## Templates
### Template 1: Basic Routing
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews-route
namespace: bookinfo
spec:
hosts:
- reviews
http:
- match:
- headers:
end-user:
exact: jason
route:
- destination:
host: reviews
subset: v2
- route:
- destination:
host: reviews
subset: v1
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews-destination
namespace: bookinfo
spec:
host: reviews
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
- name: v3
labels:
version: v3
```
### Template 2: Canary Deployment
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: my-service-canary
spec:
hosts:
- my-service
http:
- route:
- destination:
host: my-service
subset: stable
weight: 90
- destination:
host: my-service
subset: canary
weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: my-service-dr
spec:
host: my-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
subsets:
- name: stable
labels:
version: stable
- name: canary
labels:
version: canary
```
### Template 3: Circuit Breaker
```yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: circuit-breaker
spec:
host: my-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 100
http2MaxRequests: 1000
maxRequestsPerConnection: 10
maxRetries: 3
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
minHealthPercent: 30
```
### Template 4: Retry and Timeout
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ratings-retry
spec:
hosts:
- ratings
http:
- route:
- destination:
host: ratings
timeout: 10s
retries:
attempts: 3
perTryTimeout: 3s
retryOn: connect-failure,refused-stream,unavailable,cancelled,retriable-4xx,503
retryRemoteLocalities: true
```
### Template 5: Traffic Mirroring
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: mirror-traffic
spec:
hosts:
- my-service
http:
- route:
- destination:
host: my-service
subset: v1
mirror:
host: my-service
subset: v2
mirrorPercentage:
value: 100.0
```
### Template 6: Fault Injection
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: fault-injection
spec:
hosts:
- ratings
http:
- fault:
delay:
percentage:
value: 10
fixedDelay: 5s
abort:
percentage:
value: 5
httpStatus: 503
route:
- destination:
host: ratings
```
### Template 7: Ingress Gateway
```yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: my-gateway
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 443
name: https
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: my-tls-secret
hosts:
- "*.example.com"
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: my-vs
spec:
hosts:
- "api.example.com"
gateways:
- my-gateway
http:
- match:
- uri:
prefix: /api/v1
route:
- destination:
host: api-service
port:
number: 8080
```
## Load Balancing Strategies
```yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: load-balancing
spec:
host: my-service
trafficPolicy:
loadBalancer:
simple: ROUND_ROBIN # or LEAST_CONN, RANDOM, PASSTHROUGH
---
# Consistent hashing for sticky sessions
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: sticky-sessions
spec:
host: my-service
trafficPolicy:
loadBalancer:
consistentHash:
httpHeaderName: x-user-id
# or: httpCookie, useSourceIp, httpQueryParameterName
```
## Best Practices
### Do's
- **Start simple** - Add complexity incrementally
- **Use subsets** - Version your services clearly
- **Set timeouts** - Always configure reasonable timeouts
- **Enable retries** - But with backoff and limits
- **Monitor** - Use Kiali and Jaeger for visibility
### Don'ts
- **Don't over-retry** - Can cause cascading failures
- **Don't ignore outlier detection** - Enable circuit breakers
- **Don't mirror to production** - Mirror to test environments
- **Don't skip canary** - Test with small traffic percentage first
## Debugging Commands
```bash
# Check VirtualService configuration
istioctl analyze
# View effective routes
istioctl proxy-config routes deploy/my-app -o json
# Check endpoint discovery
istioctl proxy-config endpoints deploy/my-app
# Debug traffic
istioctl proxy-config log deploy/my-app --level debug
```
## Resources
- [Istio Traffic Management](https://istio.io/latest/docs/concepts/traffic-management/)
- [Virtual Service Reference](https://istio.io/latest/docs/reference/config/networking/virtual-service/)
- [Destination Rule Reference](https://istio.io/latest/docs/reference/config/networking/destination-rule/)

View File

@@ -0,0 +1,309 @@
---
name: linkerd-patterns
description: Implement Linkerd service mesh patterns for lightweight, security-focused service mesh deployments. Use when setting up Linkerd, configuring traffic policies, or implementing zero-trust networking with minimal overhead.
---
# Linkerd Patterns
Production patterns for Linkerd service mesh - the lightweight, security-first service mesh for Kubernetes.
## When to Use This Skill
- Setting up a lightweight service mesh
- Implementing automatic mTLS
- Configuring traffic splits for canary deployments
- Setting up service profiles for per-route metrics
- Implementing retries and timeouts
- Multi-cluster service mesh
## Core Concepts
### 1. Linkerd Architecture
```
┌─────────────────────────────────────────────┐
│ Control Plane │
│ ┌─────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ destiny │ │ identity │ │ proxy-inject │ │
│ └─────────┘ └──────────┘ └──────────────┘ │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Data Plane │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │proxy│────│proxy│────│proxy│ │
│ └─────┘ └─────┘ └─────┘ │
│ │ │ │ │
│ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ │
│ │ app │ │ app │ │ app │ │
│ └─────┘ └─────┘ └─────┘ │
└─────────────────────────────────────────────┘
```
### 2. Key Resources
| Resource | Purpose |
|----------|---------|
| **ServiceProfile** | Per-route metrics, retries, timeouts |
| **TrafficSplit** | Canary deployments, A/B testing |
| **Server** | Define server-side policies |
| **ServerAuthorization** | Access control policies |
## Templates
### Template 1: Mesh Installation
```bash
# Install CLI
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
# Validate cluster
linkerd check --pre
# Install CRDs
linkerd install --crds | kubectl apply -f -
# Install control plane
linkerd install | kubectl apply -f -
# Verify installation
linkerd check
# Install viz extension (optional)
linkerd viz install | kubectl apply -f -
```
### Template 2: Inject Namespace
```yaml
# Automatic injection for namespace
apiVersion: v1
kind: Namespace
metadata:
name: my-app
annotations:
linkerd.io/inject: enabled
---
# Or inject specific deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
annotations:
linkerd.io/inject: enabled
spec:
template:
metadata:
annotations:
linkerd.io/inject: enabled
```
### Template 3: Service Profile with Retries
```yaml
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
name: my-service.my-namespace.svc.cluster.local
namespace: my-namespace
spec:
routes:
- name: GET /api/users
condition:
method: GET
pathRegex: /api/users
responseClasses:
- condition:
status:
min: 500
max: 599
isFailure: true
isRetryable: true
- name: POST /api/users
condition:
method: POST
pathRegex: /api/users
# POST not retryable by default
isRetryable: false
- name: GET /api/users/{id}
condition:
method: GET
pathRegex: /api/users/[^/]+
timeout: 5s
isRetryable: true
retryBudget:
retryRatio: 0.2
minRetriesPerSecond: 10
ttl: 10s
```
### Template 4: Traffic Split (Canary)
```yaml
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
name: my-service-canary
namespace: my-namespace
spec:
service: my-service
backends:
- service: my-service-stable
weight: 900m # 90%
- service: my-service-canary
weight: 100m # 10%
```
### Template 5: Server Authorization Policy
```yaml
# Define the server
apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
name: my-service-http
namespace: my-namespace
spec:
podSelector:
matchLabels:
app: my-service
port: http
proxyProtocol: HTTP/1
---
# Allow traffic from specific clients
apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
name: allow-frontend
namespace: my-namespace
spec:
server:
name: my-service-http
client:
meshTLS:
serviceAccounts:
- name: frontend
namespace: my-namespace
---
# Allow unauthenticated traffic (e.g., from ingress)
apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
name: allow-ingress
namespace: my-namespace
spec:
server:
name: my-service-http
client:
unauthenticated: true
networks:
- cidr: 10.0.0.0/8
```
### Template 6: HTTPRoute for Advanced Routing
```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: HTTPRoute
metadata:
name: my-route
namespace: my-namespace
spec:
parentRefs:
- name: my-service
kind: Service
group: core
port: 8080
rules:
- matches:
- path:
type: PathPrefix
value: /api/v2
- headers:
- name: x-api-version
value: v2
backendRefs:
- name: my-service-v2
port: 8080
- matches:
- path:
type: PathPrefix
value: /api
backendRefs:
- name: my-service-v1
port: 8080
```
### Template 7: Multi-cluster Setup
```bash
# On each cluster, install with cluster credentials
linkerd multicluster install | kubectl apply -f -
# Link clusters
linkerd multicluster link --cluster-name west \
--api-server-address https://west.example.com:6443 \
| kubectl apply -f -
# Export a service to other clusters
kubectl label svc/my-service mirror.linkerd.io/exported=true
# Verify cross-cluster connectivity
linkerd multicluster check
linkerd multicluster gateways
```
## Monitoring Commands
```bash
# Live traffic view
linkerd viz top deploy/my-app
# Per-route metrics
linkerd viz routes deploy/my-app
# Check proxy status
linkerd viz stat deploy -n my-namespace
# View service dependencies
linkerd viz edges deploy -n my-namespace
# Dashboard
linkerd viz dashboard
```
## Debugging
```bash
# Check injection status
linkerd check --proxy -n my-namespace
# View proxy logs
kubectl logs deploy/my-app -c linkerd-proxy
# Debug identity/TLS
linkerd identity -n my-namespace
# Tap traffic (live)
linkerd viz tap deploy/my-app --to deploy/my-backend
```
## Best Practices
### Do's
- **Enable mTLS everywhere** - It's automatic with Linkerd
- **Use ServiceProfiles** - Get per-route metrics and retries
- **Set retry budgets** - Prevent retry storms
- **Monitor golden metrics** - Success rate, latency, throughput
### Don'ts
- **Don't skip check** - Always run `linkerd check` after changes
- **Don't over-configure** - Linkerd defaults are sensible
- **Don't ignore ServiceProfiles** - They unlock advanced features
- **Don't forget timeouts** - Set appropriate values per route
## Resources
- [Linkerd Documentation](https://linkerd.io/2.14/overview/)
- [Service Profiles](https://linkerd.io/2.14/features/service-profiles/)
- [Authorization Policy](https://linkerd.io/2.14/features/server-policy/)

View File

@@ -0,0 +1,347 @@
---
name: mtls-configuration
description: Configure mutual TLS (mTLS) for zero-trust service-to-service communication. Use when implementing zero-trust networking, certificate management, or securing internal service communication.
---
# mTLS Configuration
Comprehensive guide to implementing mutual TLS for zero-trust service mesh communication.
## When to Use This Skill
- Implementing zero-trust networking
- Securing service-to-service communication
- Certificate rotation and management
- Debugging TLS handshake issues
- Compliance requirements (PCI-DSS, HIPAA)
- Multi-cluster secure communication
## Core Concepts
### 1. mTLS Flow
```
┌─────────┐ ┌─────────┐
│ Service │ │ Service │
│ A │ │ B │
└────┬────┘ └────┬────┘
│ │
┌────┴────┐ TLS Handshake ┌────┴────┐
│ Proxy │◄───────────────────────────►│ Proxy │
│(Sidecar)│ 1. ClientHello │(Sidecar)│
│ │ 2. ServerHello + Cert │ │
│ │ 3. Client Cert │ │
│ │ 4. Verify Both Certs │ │
│ │ 5. Encrypted Channel │ │
└─────────┘ └─────────┘
```
### 2. Certificate Hierarchy
```
Root CA (Self-signed, long-lived)
├── Intermediate CA (Cluster-level)
│ │
│ ├── Workload Cert (Service A)
│ └── Workload Cert (Service B)
└── Intermediate CA (Multi-cluster)
└── Cross-cluster certs
```
## Templates
### Template 1: Istio mTLS (Strict Mode)
```yaml
# Enable strict mTLS mesh-wide
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
mtls:
mode: STRICT
---
# Namespace-level override (permissive for migration)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: legacy-namespace
spec:
mtls:
mode: PERMISSIVE
---
# Workload-specific policy
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: payment-service
namespace: production
spec:
selector:
matchLabels:
app: payment-service
mtls:
mode: STRICT
portLevelMtls:
8080:
mode: STRICT
9090:
mode: DISABLE # Metrics port, no mTLS
```
### Template 2: Istio Destination Rule for mTLS
```yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: default
namespace: istio-system
spec:
host: "*.local"
trafficPolicy:
tls:
mode: ISTIO_MUTUAL
---
# TLS to external service
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: external-api
spec:
host: api.external.com
trafficPolicy:
tls:
mode: SIMPLE
caCertificates: /etc/certs/external-ca.pem
---
# Mutual TLS to external service
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: partner-api
spec:
host: api.partner.com
trafficPolicy:
tls:
mode: MUTUAL
clientCertificate: /etc/certs/client.pem
privateKey: /etc/certs/client-key.pem
caCertificates: /etc/certs/partner-ca.pem
```
### Template 3: Cert-Manager with Istio
```yaml
# Install cert-manager issuer for Istio
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: istio-ca
spec:
ca:
secretName: istio-ca-secret
---
# Create Istio CA secret
apiVersion: v1
kind: Secret
metadata:
name: istio-ca-secret
namespace: cert-manager
type: kubernetes.io/tls
data:
tls.crt: <base64-encoded-ca-cert>
tls.key: <base64-encoded-ca-key>
---
# Certificate for workload
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: my-service-cert
namespace: my-namespace
spec:
secretName: my-service-tls
duration: 24h
renewBefore: 8h
issuerRef:
name: istio-ca
kind: ClusterIssuer
commonName: my-service.my-namespace.svc.cluster.local
dnsNames:
- my-service
- my-service.my-namespace
- my-service.my-namespace.svc
- my-service.my-namespace.svc.cluster.local
usages:
- server auth
- client auth
```
### Template 4: SPIFFE/SPIRE Integration
```yaml
# SPIRE Server configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: spire-server
namespace: spire
data:
server.conf: |
server {
bind_address = "0.0.0.0"
bind_port = "8081"
trust_domain = "example.org"
data_dir = "/run/spire/data"
log_level = "INFO"
ca_ttl = "168h"
default_x509_svid_ttl = "1h"
}
plugins {
DataStore "sql" {
plugin_data {
database_type = "sqlite3"
connection_string = "/run/spire/data/datastore.sqlite3"
}
}
NodeAttestor "k8s_psat" {
plugin_data {
clusters = {
"demo-cluster" = {
service_account_allow_list = ["spire:spire-agent"]
}
}
}
}
KeyManager "memory" {
plugin_data {}
}
UpstreamAuthority "disk" {
plugin_data {
key_file_path = "/run/spire/secrets/bootstrap.key"
cert_file_path = "/run/spire/secrets/bootstrap.crt"
}
}
}
---
# SPIRE Agent DaemonSet (abbreviated)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: spire-agent
namespace: spire
spec:
selector:
matchLabels:
app: spire-agent
template:
spec:
containers:
- name: spire-agent
image: ghcr.io/spiffe/spire-agent:1.8.0
volumeMounts:
- name: spire-agent-socket
mountPath: /run/spire/sockets
volumes:
- name: spire-agent-socket
hostPath:
path: /run/spire/sockets
type: DirectoryOrCreate
```
### Template 5: Linkerd mTLS (Automatic)
```yaml
# Linkerd enables mTLS automatically
# Verify with:
# linkerd viz edges deployment -n my-namespace
# For external services without mTLS
apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
name: external-api
namespace: my-namespace
spec:
podSelector:
matchLabels:
app: my-app
port: external-api
proxyProtocol: HTTP/1 # or TLS for passthrough
---
# Skip TLS for specific port
apiVersion: v1
kind: Service
metadata:
name: my-service
annotations:
config.linkerd.io/skip-outbound-ports: "3306" # MySQL
```
## Certificate Rotation
```bash
# Istio - Check certificate expiry
istioctl proxy-config secret deploy/my-app -o json | \
jq '.dynamicActiveSecrets[0].secret.tlsCertificate.certificateChain.inlineBytes' | \
tr -d '"' | base64 -d | openssl x509 -text -noout
# Force certificate rotation
kubectl rollout restart deployment/my-app
# Check Linkerd identity
linkerd identity -n my-namespace
```
## Debugging mTLS Issues
```bash
# Istio - Check if mTLS is enabled
istioctl authn tls-check my-service.my-namespace.svc.cluster.local
# Verify peer authentication
kubectl get peerauthentication --all-namespaces
# Check destination rules
kubectl get destinationrule --all-namespaces
# Debug TLS handshake
istioctl proxy-config log deploy/my-app --level debug
kubectl logs deploy/my-app -c istio-proxy | grep -i tls
# Linkerd - Check mTLS status
linkerd viz edges deployment -n my-namespace
linkerd viz tap deploy/my-app --to deploy/my-backend
```
## Best Practices
### Do's
- **Start with PERMISSIVE** - Migrate gradually to STRICT
- **Monitor certificate expiry** - Set up alerts
- **Use short-lived certs** - 24h or less for workloads
- **Rotate CA periodically** - Plan for CA rotation
- **Log TLS errors** - For debugging and audit
### Don'ts
- **Don't disable mTLS** - For convenience in production
- **Don't ignore cert expiry** - Automate rotation
- **Don't use self-signed certs** - Use proper CA hierarchy
- **Don't skip verification** - Verify the full chain
## Resources
- [Istio Security](https://istio.io/latest/docs/concepts/security/)
- [SPIFFE/SPIRE](https://spiffe.io/)
- [cert-manager](https://cert-manager.io/)
- [Zero Trust Architecture (NIST)](https://www.nist.gov/publications/zero-trust-architecture)

View File

@@ -0,0 +1,383 @@
---
name: service-mesh-observability
description: Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.
---
# Service Mesh Observability
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
## When to Use This Skill
- Setting up distributed tracing across services
- Implementing service mesh metrics and dashboards
- Debugging latency and error issues
- Defining SLOs for service communication
- Visualizing service dependencies
- Troubleshooting mesh connectivity
## Core Concepts
### 1. Three Pillars of Observability
```
┌─────────────────────────────────────────────────────┐
│ Observability │
├─────────────────┬─────────────────┬─────────────────┤
│ Metrics │ Traces │ Logs │
│ │ │ │
│ • Request rate │ • Span context │ • Access logs │
│ • Error rate │ • Latency │ • Error details │
│ • Latency P50 │ • Dependencies │ • Debug info │
│ • Saturation │ • Bottlenecks │ • Audit trail │
└─────────────────┴─────────────────┴─────────────────┘
```
### 2. Golden Signals for Mesh
| Signal | Description | Alert Threshold |
|--------|-------------|-----------------|
| **Latency** | Request duration P50, P99 | P99 > 500ms |
| **Traffic** | Requests per second | Anomaly detection |
| **Errors** | 5xx error rate | > 1% |
| **Saturation** | Resource utilization | > 80% |
## Templates
### Template 1: Istio with Prometheus & Grafana
```yaml
# Install Prometheus
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus
namespace: istio-system
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'istio-mesh'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- istio-system
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: istio-telemetry
---
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: istio-mesh
namespace: istio-system
spec:
selector:
matchLabels:
app: istiod
endpoints:
- port: http-monitoring
interval: 15s
```
### Template 2: Key Istio Metrics Queries
```promql
# Request rate by service
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
# Error rate (5xx)
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
# P99 latency
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
# TCP connections
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
# Request size
histogram_quantile(0.99,
sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
```
### Template 3: Jaeger Distributed Tracing
```yaml
# Jaeger installation for Istio
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
enableTracing: true
defaultConfig:
tracing:
sampling: 100.0 # 100% in dev, lower in prod
zipkin:
address: jaeger-collector.istio-system:9411
---
# Jaeger deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: istio-system
spec:
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.50
ports:
- containerPort: 5775 # UDP
- containerPort: 6831 # Thrift
- containerPort: 6832 # Thrift
- containerPort: 5778 # Config
- containerPort: 16686 # UI
- containerPort: 14268 # HTTP
- containerPort: 14250 # gRPC
- containerPort: 9411 # Zipkin
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
```
### Template 4: Linkerd Viz Dashboard
```bash
# Install Linkerd viz extension
linkerd viz install | kubectl apply -f -
# Access dashboard
linkerd viz dashboard
# CLI commands for observability
# Top requests
linkerd viz top deploy/my-app
# Per-route metrics
linkerd viz routes deploy/my-app --to deploy/backend
# Live traffic inspection
linkerd viz tap deploy/my-app --to deploy/backend
# Service edges (dependencies)
linkerd viz edges deployment -n my-namespace
```
### Template 5: Grafana Dashboard JSON
```json
{
"dashboard": {
"title": "Service Mesh Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 1, "color": "yellow"},
{"value": 5, "color": "red"}
]
}
}
}
},
{
"title": "P99 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Service Topology",
"type": "nodeGraph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
}
]
}
]
}
}
```
### Template 6: Kiali Service Mesh Visualization
```yaml
# Kiali installation
apiVersion: kiali.io/v1alpha1
kind: Kiali
metadata:
name: kiali
namespace: istio-system
spec:
auth:
strategy: anonymous # or openid, token
deployment:
accessible_namespaces:
- "**"
external_services:
prometheus:
url: http://prometheus.istio-system:9090
tracing:
url: http://jaeger-query.istio-system:16686
grafana:
url: http://grafana.istio-system:3000
```
### Template 7: OpenTelemetry Integration
```yaml
# OpenTelemetry Collector for mesh
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
zipkin:
endpoint: 0.0.0.0:9411
processors:
batch:
timeout: 10s
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp, zipkin]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
---
# Istio Telemetry v2 with OTel
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: otel
randomSamplingPercentage: 10
```
## Alerting Rules
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: mesh-alerts
namespace: istio-system
spec:
groups:
- name: mesh.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
/ sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.destination_service_name }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
by (le, destination_service_name)) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency for {{ $labels.destination_service_name }}"
- alert: MeshCertExpiring
expr: |
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
labels:
severity: warning
annotations:
summary: "Mesh certificate expiring in less than 7 days"
```
## Best Practices
### Do's
- **Sample appropriately** - 100% in dev, 1-10% in prod
- **Use trace context** - Propagate headers consistently
- **Set up alerts** - For golden signals
- **Correlate metrics/traces** - Use exemplars
- **Retain strategically** - Hot/cold storage tiers
### Don'ts
- **Don't over-sample** - Storage costs add up
- **Don't ignore cardinality** - Limit label values
- **Don't skip dashboards** - Visualize dependencies
- **Don't forget costs** - Monitor observability costs
## Resources
- [Istio Observability](https://istio.io/latest/docs/tasks/observability/)
- [Linkerd Observability](https://linkerd.io/2.14/features/dashboard/)
- [OpenTelemetry](https://opentelemetry.io/)
- [Kiali](https://kiali.io/)