feat: add 5 new specialized agents with 20 skills

Add domain expert agents with comprehensive skill sets:
- service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability
- event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas
- vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search
- monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces
- threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements

Update all documentation to reflect correct counts:
- 67 plugins, 99 agents, 107 skills, 71 commands
This commit is contained in:
Seth Hobson
2025-12-16 16:00:58 -05:00
parent c7ad381360
commit 01d93fc227
58 changed files with 24830 additions and 50 deletions

View File

@@ -0,0 +1,41 @@
# Service Mesh Expert
Expert service mesh architect specializing in Istio, Linkerd, and cloud-native networking patterns. Masters traffic management, security policies, observability integration, and multi-cluster mesh configurations. Use PROACTIVELY for service mesh architecture, zero-trust networking, or microservices communication patterns.
## Capabilities
- Istio and Linkerd installation, configuration, and optimization
- Traffic management: routing, load balancing, circuit breaking, retries
- mTLS configuration and certificate management
- Service mesh observability with distributed tracing
- Multi-cluster and multi-cloud mesh federation
- Progressive delivery with canary and blue-green deployments
- Security policies and authorization rules
## When to Use
- Implementing service-to-service communication in Kubernetes
- Setting up zero-trust networking with mTLS
- Configuring traffic splitting for canary deployments
- Debugging service mesh connectivity issues
- Implementing rate limiting and circuit breakers
- Setting up cross-cluster service discovery
## Workflow
1. Assess current infrastructure and requirements
2. Design mesh topology and traffic policies
3. Implement security policies (mTLS, AuthorizationPolicy)
4. Configure observability (metrics, traces, logs)
5. Set up traffic management rules
6. Test failover and resilience patterns
7. Document operational runbooks
## Best Practices
- Start with permissive mode, gradually enforce strict mTLS
- Use namespaces for policy isolation
- Implement circuit breakers before they're needed
- Monitor mesh overhead (latency, resource usage)
- Keep sidecar resources appropriately sized
- Use destination rules for consistent load balancing

View File

@@ -0,0 +1,325 @@
---
name: istio-traffic-management
description: Configure Istio traffic management including routing, load balancing, circuit breakers, and canary deployments. Use when implementing service mesh traffic policies, progressive delivery, or resilience patterns.
---
# Istio Traffic Management
Comprehensive guide to Istio traffic management for production service mesh deployments.
## When to Use This Skill
- Configuring service-to-service routing
- Implementing canary or blue-green deployments
- Setting up circuit breakers and retries
- Load balancing configuration
- Traffic mirroring for testing
- Fault injection for chaos engineering
## Core Concepts
### 1. Traffic Management Resources
| Resource | Purpose | Scope |
|----------|---------|-------|
| **VirtualService** | Route traffic to destinations | Host-based |
| **DestinationRule** | Define policies after routing | Service-based |
| **Gateway** | Configure ingress/egress | Cluster edge |
| **ServiceEntry** | Add external services | Mesh-wide |
### 2. Traffic Flow
```
Client → Gateway → VirtualService → DestinationRule → Service
(routing) (policies) (pods)
```
## Templates
### Template 1: Basic Routing
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews-route
namespace: bookinfo
spec:
hosts:
- reviews
http:
- match:
- headers:
end-user:
exact: jason
route:
- destination:
host: reviews
subset: v2
- route:
- destination:
host: reviews
subset: v1
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews-destination
namespace: bookinfo
spec:
host: reviews
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
- name: v3
labels:
version: v3
```
### Template 2: Canary Deployment
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: my-service-canary
spec:
hosts:
- my-service
http:
- route:
- destination:
host: my-service
subset: stable
weight: 90
- destination:
host: my-service
subset: canary
weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: my-service-dr
spec:
host: my-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
subsets:
- name: stable
labels:
version: stable
- name: canary
labels:
version: canary
```
### Template 3: Circuit Breaker
```yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: circuit-breaker
spec:
host: my-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 100
http2MaxRequests: 1000
maxRequestsPerConnection: 10
maxRetries: 3
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
minHealthPercent: 30
```
### Template 4: Retry and Timeout
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ratings-retry
spec:
hosts:
- ratings
http:
- route:
- destination:
host: ratings
timeout: 10s
retries:
attempts: 3
perTryTimeout: 3s
retryOn: connect-failure,refused-stream,unavailable,cancelled,retriable-4xx,503
retryRemoteLocalities: true
```
### Template 5: Traffic Mirroring
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: mirror-traffic
spec:
hosts:
- my-service
http:
- route:
- destination:
host: my-service
subset: v1
mirror:
host: my-service
subset: v2
mirrorPercentage:
value: 100.0
```
### Template 6: Fault Injection
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: fault-injection
spec:
hosts:
- ratings
http:
- fault:
delay:
percentage:
value: 10
fixedDelay: 5s
abort:
percentage:
value: 5
httpStatus: 503
route:
- destination:
host: ratings
```
### Template 7: Ingress Gateway
```yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: my-gateway
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 443
name: https
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: my-tls-secret
hosts:
- "*.example.com"
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: my-vs
spec:
hosts:
- "api.example.com"
gateways:
- my-gateway
http:
- match:
- uri:
prefix: /api/v1
route:
- destination:
host: api-service
port:
number: 8080
```
## Load Balancing Strategies
```yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: load-balancing
spec:
host: my-service
trafficPolicy:
loadBalancer:
simple: ROUND_ROBIN # or LEAST_CONN, RANDOM, PASSTHROUGH
---
# Consistent hashing for sticky sessions
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: sticky-sessions
spec:
host: my-service
trafficPolicy:
loadBalancer:
consistentHash:
httpHeaderName: x-user-id
# or: httpCookie, useSourceIp, httpQueryParameterName
```
## Best Practices
### Do's
- **Start simple** - Add complexity incrementally
- **Use subsets** - Version your services clearly
- **Set timeouts** - Always configure reasonable timeouts
- **Enable retries** - But with backoff and limits
- **Monitor** - Use Kiali and Jaeger for visibility
### Don'ts
- **Don't over-retry** - Can cause cascading failures
- **Don't ignore outlier detection** - Enable circuit breakers
- **Don't mirror to production** - Mirror to test environments
- **Don't skip canary** - Test with small traffic percentage first
## Debugging Commands
```bash
# Check VirtualService configuration
istioctl analyze
# View effective routes
istioctl proxy-config routes deploy/my-app -o json
# Check endpoint discovery
istioctl proxy-config endpoints deploy/my-app
# Debug traffic
istioctl proxy-config log deploy/my-app --level debug
```
## Resources
- [Istio Traffic Management](https://istio.io/latest/docs/concepts/traffic-management/)
- [Virtual Service Reference](https://istio.io/latest/docs/reference/config/networking/virtual-service/)
- [Destination Rule Reference](https://istio.io/latest/docs/reference/config/networking/destination-rule/)

View File

@@ -0,0 +1,309 @@
---
name: linkerd-patterns
description: Implement Linkerd service mesh patterns for lightweight, security-focused service mesh deployments. Use when setting up Linkerd, configuring traffic policies, or implementing zero-trust networking with minimal overhead.
---
# Linkerd Patterns
Production patterns for Linkerd service mesh - the lightweight, security-first service mesh for Kubernetes.
## When to Use This Skill
- Setting up a lightweight service mesh
- Implementing automatic mTLS
- Configuring traffic splits for canary deployments
- Setting up service profiles for per-route metrics
- Implementing retries and timeouts
- Multi-cluster service mesh
## Core Concepts
### 1. Linkerd Architecture
```
┌─────────────────────────────────────────────┐
│ Control Plane │
│ ┌─────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ destiny │ │ identity │ │ proxy-inject │ │
│ └─────────┘ └──────────┘ └──────────────┘ │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Data Plane │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │proxy│────│proxy│────│proxy│ │
│ └─────┘ └─────┘ └─────┘ │
│ │ │ │ │
│ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ │
│ │ app │ │ app │ │ app │ │
│ └─────┘ └─────┘ └─────┘ │
└─────────────────────────────────────────────┘
```
### 2. Key Resources
| Resource | Purpose |
|----------|---------|
| **ServiceProfile** | Per-route metrics, retries, timeouts |
| **TrafficSplit** | Canary deployments, A/B testing |
| **Server** | Define server-side policies |
| **ServerAuthorization** | Access control policies |
## Templates
### Template 1: Mesh Installation
```bash
# Install CLI
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
# Validate cluster
linkerd check --pre
# Install CRDs
linkerd install --crds | kubectl apply -f -
# Install control plane
linkerd install | kubectl apply -f -
# Verify installation
linkerd check
# Install viz extension (optional)
linkerd viz install | kubectl apply -f -
```
### Template 2: Inject Namespace
```yaml
# Automatic injection for namespace
apiVersion: v1
kind: Namespace
metadata:
name: my-app
annotations:
linkerd.io/inject: enabled
---
# Or inject specific deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
annotations:
linkerd.io/inject: enabled
spec:
template:
metadata:
annotations:
linkerd.io/inject: enabled
```
### Template 3: Service Profile with Retries
```yaml
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
name: my-service.my-namespace.svc.cluster.local
namespace: my-namespace
spec:
routes:
- name: GET /api/users
condition:
method: GET
pathRegex: /api/users
responseClasses:
- condition:
status:
min: 500
max: 599
isFailure: true
isRetryable: true
- name: POST /api/users
condition:
method: POST
pathRegex: /api/users
# POST not retryable by default
isRetryable: false
- name: GET /api/users/{id}
condition:
method: GET
pathRegex: /api/users/[^/]+
timeout: 5s
isRetryable: true
retryBudget:
retryRatio: 0.2
minRetriesPerSecond: 10
ttl: 10s
```
### Template 4: Traffic Split (Canary)
```yaml
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
name: my-service-canary
namespace: my-namespace
spec:
service: my-service
backends:
- service: my-service-stable
weight: 900m # 90%
- service: my-service-canary
weight: 100m # 10%
```
### Template 5: Server Authorization Policy
```yaml
# Define the server
apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
name: my-service-http
namespace: my-namespace
spec:
podSelector:
matchLabels:
app: my-service
port: http
proxyProtocol: HTTP/1
---
# Allow traffic from specific clients
apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
name: allow-frontend
namespace: my-namespace
spec:
server:
name: my-service-http
client:
meshTLS:
serviceAccounts:
- name: frontend
namespace: my-namespace
---
# Allow unauthenticated traffic (e.g., from ingress)
apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
name: allow-ingress
namespace: my-namespace
spec:
server:
name: my-service-http
client:
unauthenticated: true
networks:
- cidr: 10.0.0.0/8
```
### Template 6: HTTPRoute for Advanced Routing
```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: HTTPRoute
metadata:
name: my-route
namespace: my-namespace
spec:
parentRefs:
- name: my-service
kind: Service
group: core
port: 8080
rules:
- matches:
- path:
type: PathPrefix
value: /api/v2
- headers:
- name: x-api-version
value: v2
backendRefs:
- name: my-service-v2
port: 8080
- matches:
- path:
type: PathPrefix
value: /api
backendRefs:
- name: my-service-v1
port: 8080
```
### Template 7: Multi-cluster Setup
```bash
# On each cluster, install with cluster credentials
linkerd multicluster install | kubectl apply -f -
# Link clusters
linkerd multicluster link --cluster-name west \
--api-server-address https://west.example.com:6443 \
| kubectl apply -f -
# Export a service to other clusters
kubectl label svc/my-service mirror.linkerd.io/exported=true
# Verify cross-cluster connectivity
linkerd multicluster check
linkerd multicluster gateways
```
## Monitoring Commands
```bash
# Live traffic view
linkerd viz top deploy/my-app
# Per-route metrics
linkerd viz routes deploy/my-app
# Check proxy status
linkerd viz stat deploy -n my-namespace
# View service dependencies
linkerd viz edges deploy -n my-namespace
# Dashboard
linkerd viz dashboard
```
## Debugging
```bash
# Check injection status
linkerd check --proxy -n my-namespace
# View proxy logs
kubectl logs deploy/my-app -c linkerd-proxy
# Debug identity/TLS
linkerd identity -n my-namespace
# Tap traffic (live)
linkerd viz tap deploy/my-app --to deploy/my-backend
```
## Best Practices
### Do's
- **Enable mTLS everywhere** - It's automatic with Linkerd
- **Use ServiceProfiles** - Get per-route metrics and retries
- **Set retry budgets** - Prevent retry storms
- **Monitor golden metrics** - Success rate, latency, throughput
### Don'ts
- **Don't skip check** - Always run `linkerd check` after changes
- **Don't over-configure** - Linkerd defaults are sensible
- **Don't ignore ServiceProfiles** - They unlock advanced features
- **Don't forget timeouts** - Set appropriate values per route
## Resources
- [Linkerd Documentation](https://linkerd.io/2.14/overview/)
- [Service Profiles](https://linkerd.io/2.14/features/service-profiles/)
- [Authorization Policy](https://linkerd.io/2.14/features/server-policy/)

View File

@@ -0,0 +1,347 @@
---
name: mtls-configuration
description: Configure mutual TLS (mTLS) for zero-trust service-to-service communication. Use when implementing zero-trust networking, certificate management, or securing internal service communication.
---
# mTLS Configuration
Comprehensive guide to implementing mutual TLS for zero-trust service mesh communication.
## When to Use This Skill
- Implementing zero-trust networking
- Securing service-to-service communication
- Certificate rotation and management
- Debugging TLS handshake issues
- Compliance requirements (PCI-DSS, HIPAA)
- Multi-cluster secure communication
## Core Concepts
### 1. mTLS Flow
```
┌─────────┐ ┌─────────┐
│ Service │ │ Service │
│ A │ │ B │
└────┬────┘ └────┬────┘
│ │
┌────┴────┐ TLS Handshake ┌────┴────┐
│ Proxy │◄───────────────────────────►│ Proxy │
│(Sidecar)│ 1. ClientHello │(Sidecar)│
│ │ 2. ServerHello + Cert │ │
│ │ 3. Client Cert │ │
│ │ 4. Verify Both Certs │ │
│ │ 5. Encrypted Channel │ │
└─────────┘ └─────────┘
```
### 2. Certificate Hierarchy
```
Root CA (Self-signed, long-lived)
├── Intermediate CA (Cluster-level)
│ │
│ ├── Workload Cert (Service A)
│ └── Workload Cert (Service B)
└── Intermediate CA (Multi-cluster)
└── Cross-cluster certs
```
## Templates
### Template 1: Istio mTLS (Strict Mode)
```yaml
# Enable strict mTLS mesh-wide
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
mtls:
mode: STRICT
---
# Namespace-level override (permissive for migration)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: legacy-namespace
spec:
mtls:
mode: PERMISSIVE
---
# Workload-specific policy
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: payment-service
namespace: production
spec:
selector:
matchLabels:
app: payment-service
mtls:
mode: STRICT
portLevelMtls:
8080:
mode: STRICT
9090:
mode: DISABLE # Metrics port, no mTLS
```
### Template 2: Istio Destination Rule for mTLS
```yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: default
namespace: istio-system
spec:
host: "*.local"
trafficPolicy:
tls:
mode: ISTIO_MUTUAL
---
# TLS to external service
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: external-api
spec:
host: api.external.com
trafficPolicy:
tls:
mode: SIMPLE
caCertificates: /etc/certs/external-ca.pem
---
# Mutual TLS to external service
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: partner-api
spec:
host: api.partner.com
trafficPolicy:
tls:
mode: MUTUAL
clientCertificate: /etc/certs/client.pem
privateKey: /etc/certs/client-key.pem
caCertificates: /etc/certs/partner-ca.pem
```
### Template 3: Cert-Manager with Istio
```yaml
# Install cert-manager issuer for Istio
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: istio-ca
spec:
ca:
secretName: istio-ca-secret
---
# Create Istio CA secret
apiVersion: v1
kind: Secret
metadata:
name: istio-ca-secret
namespace: cert-manager
type: kubernetes.io/tls
data:
tls.crt: <base64-encoded-ca-cert>
tls.key: <base64-encoded-ca-key>
---
# Certificate for workload
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: my-service-cert
namespace: my-namespace
spec:
secretName: my-service-tls
duration: 24h
renewBefore: 8h
issuerRef:
name: istio-ca
kind: ClusterIssuer
commonName: my-service.my-namespace.svc.cluster.local
dnsNames:
- my-service
- my-service.my-namespace
- my-service.my-namespace.svc
- my-service.my-namespace.svc.cluster.local
usages:
- server auth
- client auth
```
### Template 4: SPIFFE/SPIRE Integration
```yaml
# SPIRE Server configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: spire-server
namespace: spire
data:
server.conf: |
server {
bind_address = "0.0.0.0"
bind_port = "8081"
trust_domain = "example.org"
data_dir = "/run/spire/data"
log_level = "INFO"
ca_ttl = "168h"
default_x509_svid_ttl = "1h"
}
plugins {
DataStore "sql" {
plugin_data {
database_type = "sqlite3"
connection_string = "/run/spire/data/datastore.sqlite3"
}
}
NodeAttestor "k8s_psat" {
plugin_data {
clusters = {
"demo-cluster" = {
service_account_allow_list = ["spire:spire-agent"]
}
}
}
}
KeyManager "memory" {
plugin_data {}
}
UpstreamAuthority "disk" {
plugin_data {
key_file_path = "/run/spire/secrets/bootstrap.key"
cert_file_path = "/run/spire/secrets/bootstrap.crt"
}
}
}
---
# SPIRE Agent DaemonSet (abbreviated)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: spire-agent
namespace: spire
spec:
selector:
matchLabels:
app: spire-agent
template:
spec:
containers:
- name: spire-agent
image: ghcr.io/spiffe/spire-agent:1.8.0
volumeMounts:
- name: spire-agent-socket
mountPath: /run/spire/sockets
volumes:
- name: spire-agent-socket
hostPath:
path: /run/spire/sockets
type: DirectoryOrCreate
```
### Template 5: Linkerd mTLS (Automatic)
```yaml
# Linkerd enables mTLS automatically
# Verify with:
# linkerd viz edges deployment -n my-namespace
# For external services without mTLS
apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
name: external-api
namespace: my-namespace
spec:
podSelector:
matchLabels:
app: my-app
port: external-api
proxyProtocol: HTTP/1 # or TLS for passthrough
---
# Skip TLS for specific port
apiVersion: v1
kind: Service
metadata:
name: my-service
annotations:
config.linkerd.io/skip-outbound-ports: "3306" # MySQL
```
## Certificate Rotation
```bash
# Istio - Check certificate expiry
istioctl proxy-config secret deploy/my-app -o json | \
jq '.dynamicActiveSecrets[0].secret.tlsCertificate.certificateChain.inlineBytes' | \
tr -d '"' | base64 -d | openssl x509 -text -noout
# Force certificate rotation
kubectl rollout restart deployment/my-app
# Check Linkerd identity
linkerd identity -n my-namespace
```
## Debugging mTLS Issues
```bash
# Istio - Check if mTLS is enabled
istioctl authn tls-check my-service.my-namespace.svc.cluster.local
# Verify peer authentication
kubectl get peerauthentication --all-namespaces
# Check destination rules
kubectl get destinationrule --all-namespaces
# Debug TLS handshake
istioctl proxy-config log deploy/my-app --level debug
kubectl logs deploy/my-app -c istio-proxy | grep -i tls
# Linkerd - Check mTLS status
linkerd viz edges deployment -n my-namespace
linkerd viz tap deploy/my-app --to deploy/my-backend
```
## Best Practices
### Do's
- **Start with PERMISSIVE** - Migrate gradually to STRICT
- **Monitor certificate expiry** - Set up alerts
- **Use short-lived certs** - 24h or less for workloads
- **Rotate CA periodically** - Plan for CA rotation
- **Log TLS errors** - For debugging and audit
### Don'ts
- **Don't disable mTLS** - For convenience in production
- **Don't ignore cert expiry** - Automate rotation
- **Don't use self-signed certs** - Use proper CA hierarchy
- **Don't skip verification** - Verify the full chain
## Resources
- [Istio Security](https://istio.io/latest/docs/concepts/security/)
- [SPIFFE/SPIRE](https://spiffe.io/)
- [cert-manager](https://cert-manager.io/)
- [Zero Trust Architecture (NIST)](https://www.nist.gov/publications/zero-trust-architecture)

View File

@@ -0,0 +1,383 @@
---
name: service-mesh-observability
description: Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.
---
# Service Mesh Observability
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
## When to Use This Skill
- Setting up distributed tracing across services
- Implementing service mesh metrics and dashboards
- Debugging latency and error issues
- Defining SLOs for service communication
- Visualizing service dependencies
- Troubleshooting mesh connectivity
## Core Concepts
### 1. Three Pillars of Observability
```
┌─────────────────────────────────────────────────────┐
│ Observability │
├─────────────────┬─────────────────┬─────────────────┤
│ Metrics │ Traces │ Logs │
│ │ │ │
│ • Request rate │ • Span context │ • Access logs │
│ • Error rate │ • Latency │ • Error details │
│ • Latency P50 │ • Dependencies │ • Debug info │
│ • Saturation │ • Bottlenecks │ • Audit trail │
└─────────────────┴─────────────────┴─────────────────┘
```
### 2. Golden Signals for Mesh
| Signal | Description | Alert Threshold |
|--------|-------------|-----------------|
| **Latency** | Request duration P50, P99 | P99 > 500ms |
| **Traffic** | Requests per second | Anomaly detection |
| **Errors** | 5xx error rate | > 1% |
| **Saturation** | Resource utilization | > 80% |
## Templates
### Template 1: Istio with Prometheus & Grafana
```yaml
# Install Prometheus
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus
namespace: istio-system
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'istio-mesh'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- istio-system
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: istio-telemetry
---
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: istio-mesh
namespace: istio-system
spec:
selector:
matchLabels:
app: istiod
endpoints:
- port: http-monitoring
interval: 15s
```
### Template 2: Key Istio Metrics Queries
```promql
# Request rate by service
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
# Error rate (5xx)
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
# P99 latency
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
# TCP connections
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
# Request size
histogram_quantile(0.99,
sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
```
### Template 3: Jaeger Distributed Tracing
```yaml
# Jaeger installation for Istio
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
enableTracing: true
defaultConfig:
tracing:
sampling: 100.0 # 100% in dev, lower in prod
zipkin:
address: jaeger-collector.istio-system:9411
---
# Jaeger deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: istio-system
spec:
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.50
ports:
- containerPort: 5775 # UDP
- containerPort: 6831 # Thrift
- containerPort: 6832 # Thrift
- containerPort: 5778 # Config
- containerPort: 16686 # UI
- containerPort: 14268 # HTTP
- containerPort: 14250 # gRPC
- containerPort: 9411 # Zipkin
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
```
### Template 4: Linkerd Viz Dashboard
```bash
# Install Linkerd viz extension
linkerd viz install | kubectl apply -f -
# Access dashboard
linkerd viz dashboard
# CLI commands for observability
# Top requests
linkerd viz top deploy/my-app
# Per-route metrics
linkerd viz routes deploy/my-app --to deploy/backend
# Live traffic inspection
linkerd viz tap deploy/my-app --to deploy/backend
# Service edges (dependencies)
linkerd viz edges deployment -n my-namespace
```
### Template 5: Grafana Dashboard JSON
```json
{
"dashboard": {
"title": "Service Mesh Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 1, "color": "yellow"},
{"value": 5, "color": "red"}
]
}
}
}
},
{
"title": "P99 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Service Topology",
"type": "nodeGraph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
}
]
}
]
}
}
```
### Template 6: Kiali Service Mesh Visualization
```yaml
# Kiali installation
apiVersion: kiali.io/v1alpha1
kind: Kiali
metadata:
name: kiali
namespace: istio-system
spec:
auth:
strategy: anonymous # or openid, token
deployment:
accessible_namespaces:
- "**"
external_services:
prometheus:
url: http://prometheus.istio-system:9090
tracing:
url: http://jaeger-query.istio-system:16686
grafana:
url: http://grafana.istio-system:3000
```
### Template 7: OpenTelemetry Integration
```yaml
# OpenTelemetry Collector for mesh
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
zipkin:
endpoint: 0.0.0.0:9411
processors:
batch:
timeout: 10s
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp, zipkin]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
---
# Istio Telemetry v2 with OTel
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: otel
randomSamplingPercentage: 10
```
## Alerting Rules
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: mesh-alerts
namespace: istio-system
spec:
groups:
- name: mesh.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
/ sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.destination_service_name }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
by (le, destination_service_name)) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency for {{ $labels.destination_service_name }}"
- alert: MeshCertExpiring
expr: |
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
labels:
severity: warning
annotations:
summary: "Mesh certificate expiring in less than 7 days"
```
## Best Practices
### Do's
- **Sample appropriately** - 100% in dev, 1-10% in prod
- **Use trace context** - Propagate headers consistently
- **Set up alerts** - For golden signals
- **Correlate metrics/traces** - Use exemplars
- **Retain strategically** - Hot/cold storage tiers
### Don'ts
- **Don't over-sample** - Storage costs add up
- **Don't ignore cardinality** - Limit label values
- **Don't skip dashboards** - Visualize dependencies
- **Don't forget costs** - Monitor observability costs
## Resources
- [Istio Observability](https://istio.io/latest/docs/tasks/observability/)
- [Linkerd Observability](https://linkerd.io/2.14/features/dashboard/)
- [OpenTelemetry](https://opentelemetry.io/)
- [Kiali](https://kiali.io/)