mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 09:37:15 +00:00
feat: add 5 new specialized agents with 20 skills
Add domain expert agents with comprehensive skill sets: - service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability - event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas - vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search - monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces - threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements Update all documentation to reflect correct counts: - 67 plugins, 99 agents, 107 skills, 71 commands
This commit is contained in:
@@ -0,0 +1,325 @@
|
||||
---
|
||||
name: istio-traffic-management
|
||||
description: Configure Istio traffic management including routing, load balancing, circuit breakers, and canary deployments. Use when implementing service mesh traffic policies, progressive delivery, or resilience patterns.
|
||||
---
|
||||
|
||||
# Istio Traffic Management
|
||||
|
||||
Comprehensive guide to Istio traffic management for production service mesh deployments.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
- Configuring service-to-service routing
|
||||
- Implementing canary or blue-green deployments
|
||||
- Setting up circuit breakers and retries
|
||||
- Load balancing configuration
|
||||
- Traffic mirroring for testing
|
||||
- Fault injection for chaos engineering
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Traffic Management Resources
|
||||
|
||||
| Resource | Purpose | Scope |
|
||||
|----------|---------|-------|
|
||||
| **VirtualService** | Route traffic to destinations | Host-based |
|
||||
| **DestinationRule** | Define policies after routing | Service-based |
|
||||
| **Gateway** | Configure ingress/egress | Cluster edge |
|
||||
| **ServiceEntry** | Add external services | Mesh-wide |
|
||||
|
||||
### 2. Traffic Flow
|
||||
|
||||
```
|
||||
Client → Gateway → VirtualService → DestinationRule → Service
|
||||
(routing) (policies) (pods)
|
||||
```
|
||||
|
||||
## Templates
|
||||
|
||||
### Template 1: Basic Routing
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: VirtualService
|
||||
metadata:
|
||||
name: reviews-route
|
||||
namespace: bookinfo
|
||||
spec:
|
||||
hosts:
|
||||
- reviews
|
||||
http:
|
||||
- match:
|
||||
- headers:
|
||||
end-user:
|
||||
exact: jason
|
||||
route:
|
||||
- destination:
|
||||
host: reviews
|
||||
subset: v2
|
||||
- route:
|
||||
- destination:
|
||||
host: reviews
|
||||
subset: v1
|
||||
---
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: DestinationRule
|
||||
metadata:
|
||||
name: reviews-destination
|
||||
namespace: bookinfo
|
||||
spec:
|
||||
host: reviews
|
||||
subsets:
|
||||
- name: v1
|
||||
labels:
|
||||
version: v1
|
||||
- name: v2
|
||||
labels:
|
||||
version: v2
|
||||
- name: v3
|
||||
labels:
|
||||
version: v3
|
||||
```
|
||||
|
||||
### Template 2: Canary Deployment
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: VirtualService
|
||||
metadata:
|
||||
name: my-service-canary
|
||||
spec:
|
||||
hosts:
|
||||
- my-service
|
||||
http:
|
||||
- route:
|
||||
- destination:
|
||||
host: my-service
|
||||
subset: stable
|
||||
weight: 90
|
||||
- destination:
|
||||
host: my-service
|
||||
subset: canary
|
||||
weight: 10
|
||||
---
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: DestinationRule
|
||||
metadata:
|
||||
name: my-service-dr
|
||||
spec:
|
||||
host: my-service
|
||||
trafficPolicy:
|
||||
connectionPool:
|
||||
tcp:
|
||||
maxConnections: 100
|
||||
http:
|
||||
h2UpgradePolicy: UPGRADE
|
||||
http1MaxPendingRequests: 100
|
||||
http2MaxRequests: 1000
|
||||
subsets:
|
||||
- name: stable
|
||||
labels:
|
||||
version: stable
|
||||
- name: canary
|
||||
labels:
|
||||
version: canary
|
||||
```
|
||||
|
||||
### Template 3: Circuit Breaker
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: DestinationRule
|
||||
metadata:
|
||||
name: circuit-breaker
|
||||
spec:
|
||||
host: my-service
|
||||
trafficPolicy:
|
||||
connectionPool:
|
||||
tcp:
|
||||
maxConnections: 100
|
||||
http:
|
||||
http1MaxPendingRequests: 100
|
||||
http2MaxRequests: 1000
|
||||
maxRequestsPerConnection: 10
|
||||
maxRetries: 3
|
||||
outlierDetection:
|
||||
consecutive5xxErrors: 5
|
||||
interval: 30s
|
||||
baseEjectionTime: 30s
|
||||
maxEjectionPercent: 50
|
||||
minHealthPercent: 30
|
||||
```
|
||||
|
||||
### Template 4: Retry and Timeout
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: VirtualService
|
||||
metadata:
|
||||
name: ratings-retry
|
||||
spec:
|
||||
hosts:
|
||||
- ratings
|
||||
http:
|
||||
- route:
|
||||
- destination:
|
||||
host: ratings
|
||||
timeout: 10s
|
||||
retries:
|
||||
attempts: 3
|
||||
perTryTimeout: 3s
|
||||
retryOn: connect-failure,refused-stream,unavailable,cancelled,retriable-4xx,503
|
||||
retryRemoteLocalities: true
|
||||
```
|
||||
|
||||
### Template 5: Traffic Mirroring
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: VirtualService
|
||||
metadata:
|
||||
name: mirror-traffic
|
||||
spec:
|
||||
hosts:
|
||||
- my-service
|
||||
http:
|
||||
- route:
|
||||
- destination:
|
||||
host: my-service
|
||||
subset: v1
|
||||
mirror:
|
||||
host: my-service
|
||||
subset: v2
|
||||
mirrorPercentage:
|
||||
value: 100.0
|
||||
```
|
||||
|
||||
### Template 6: Fault Injection
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: VirtualService
|
||||
metadata:
|
||||
name: fault-injection
|
||||
spec:
|
||||
hosts:
|
||||
- ratings
|
||||
http:
|
||||
- fault:
|
||||
delay:
|
||||
percentage:
|
||||
value: 10
|
||||
fixedDelay: 5s
|
||||
abort:
|
||||
percentage:
|
||||
value: 5
|
||||
httpStatus: 503
|
||||
route:
|
||||
- destination:
|
||||
host: ratings
|
||||
```
|
||||
|
||||
### Template 7: Ingress Gateway
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: Gateway
|
||||
metadata:
|
||||
name: my-gateway
|
||||
spec:
|
||||
selector:
|
||||
istio: ingressgateway
|
||||
servers:
|
||||
- port:
|
||||
number: 443
|
||||
name: https
|
||||
protocol: HTTPS
|
||||
tls:
|
||||
mode: SIMPLE
|
||||
credentialName: my-tls-secret
|
||||
hosts:
|
||||
- "*.example.com"
|
||||
---
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: VirtualService
|
||||
metadata:
|
||||
name: my-vs
|
||||
spec:
|
||||
hosts:
|
||||
- "api.example.com"
|
||||
gateways:
|
||||
- my-gateway
|
||||
http:
|
||||
- match:
|
||||
- uri:
|
||||
prefix: /api/v1
|
||||
route:
|
||||
- destination:
|
||||
host: api-service
|
||||
port:
|
||||
number: 8080
|
||||
```
|
||||
|
||||
## Load Balancing Strategies
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: DestinationRule
|
||||
metadata:
|
||||
name: load-balancing
|
||||
spec:
|
||||
host: my-service
|
||||
trafficPolicy:
|
||||
loadBalancer:
|
||||
simple: ROUND_ROBIN # or LEAST_CONN, RANDOM, PASSTHROUGH
|
||||
---
|
||||
# Consistent hashing for sticky sessions
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: DestinationRule
|
||||
metadata:
|
||||
name: sticky-sessions
|
||||
spec:
|
||||
host: my-service
|
||||
trafficPolicy:
|
||||
loadBalancer:
|
||||
consistentHash:
|
||||
httpHeaderName: x-user-id
|
||||
# or: httpCookie, useSourceIp, httpQueryParameterName
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
- **Start simple** - Add complexity incrementally
|
||||
- **Use subsets** - Version your services clearly
|
||||
- **Set timeouts** - Always configure reasonable timeouts
|
||||
- **Enable retries** - But with backoff and limits
|
||||
- **Monitor** - Use Kiali and Jaeger for visibility
|
||||
|
||||
### Don'ts
|
||||
- **Don't over-retry** - Can cause cascading failures
|
||||
- **Don't ignore outlier detection** - Enable circuit breakers
|
||||
- **Don't mirror to production** - Mirror to test environments
|
||||
- **Don't skip canary** - Test with small traffic percentage first
|
||||
|
||||
## Debugging Commands
|
||||
|
||||
```bash
|
||||
# Check VirtualService configuration
|
||||
istioctl analyze
|
||||
|
||||
# View effective routes
|
||||
istioctl proxy-config routes deploy/my-app -o json
|
||||
|
||||
# Check endpoint discovery
|
||||
istioctl proxy-config endpoints deploy/my-app
|
||||
|
||||
# Debug traffic
|
||||
istioctl proxy-config log deploy/my-app --level debug
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- [Istio Traffic Management](https://istio.io/latest/docs/concepts/traffic-management/)
|
||||
- [Virtual Service Reference](https://istio.io/latest/docs/reference/config/networking/virtual-service/)
|
||||
- [Destination Rule Reference](https://istio.io/latest/docs/reference/config/networking/destination-rule/)
|
||||
309
plugins/cloud-infrastructure/skills/linkerd-patterns/SKILL.md
Normal file
309
plugins/cloud-infrastructure/skills/linkerd-patterns/SKILL.md
Normal file
@@ -0,0 +1,309 @@
|
||||
---
|
||||
name: linkerd-patterns
|
||||
description: Implement Linkerd service mesh patterns for lightweight, security-focused service mesh deployments. Use when setting up Linkerd, configuring traffic policies, or implementing zero-trust networking with minimal overhead.
|
||||
---
|
||||
|
||||
# Linkerd Patterns
|
||||
|
||||
Production patterns for Linkerd service mesh - the lightweight, security-first service mesh for Kubernetes.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
- Setting up a lightweight service mesh
|
||||
- Implementing automatic mTLS
|
||||
- Configuring traffic splits for canary deployments
|
||||
- Setting up service profiles for per-route metrics
|
||||
- Implementing retries and timeouts
|
||||
- Multi-cluster service mesh
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Linkerd Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Control Plane │
|
||||
│ ┌─────────┐ ┌──────────┐ ┌──────────────┐ │
|
||||
│ │ destiny │ │ identity │ │ proxy-inject │ │
|
||||
│ └─────────┘ └──────────┘ └──────────────┘ │
|
||||
└─────────────────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Data Plane │
|
||||
│ ┌─────┐ ┌─────┐ ┌─────┐ │
|
||||
│ │proxy│────│proxy│────│proxy│ │
|
||||
│ └─────┘ └─────┘ └─────┘ │
|
||||
│ │ │ │ │
|
||||
│ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ │
|
||||
│ │ app │ │ app │ │ app │ │
|
||||
│ └─────┘ └─────┘ └─────┘ │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 2. Key Resources
|
||||
|
||||
| Resource | Purpose |
|
||||
|----------|---------|
|
||||
| **ServiceProfile** | Per-route metrics, retries, timeouts |
|
||||
| **TrafficSplit** | Canary deployments, A/B testing |
|
||||
| **Server** | Define server-side policies |
|
||||
| **ServerAuthorization** | Access control policies |
|
||||
|
||||
## Templates
|
||||
|
||||
### Template 1: Mesh Installation
|
||||
|
||||
```bash
|
||||
# Install CLI
|
||||
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
|
||||
|
||||
# Validate cluster
|
||||
linkerd check --pre
|
||||
|
||||
# Install CRDs
|
||||
linkerd install --crds | kubectl apply -f -
|
||||
|
||||
# Install control plane
|
||||
linkerd install | kubectl apply -f -
|
||||
|
||||
# Verify installation
|
||||
linkerd check
|
||||
|
||||
# Install viz extension (optional)
|
||||
linkerd viz install | kubectl apply -f -
|
||||
```
|
||||
|
||||
### Template 2: Inject Namespace
|
||||
|
||||
```yaml
|
||||
# Automatic injection for namespace
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: my-app
|
||||
annotations:
|
||||
linkerd.io/inject: enabled
|
||||
---
|
||||
# Or inject specific deployment
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: my-app
|
||||
annotations:
|
||||
linkerd.io/inject: enabled
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
annotations:
|
||||
linkerd.io/inject: enabled
|
||||
```
|
||||
|
||||
### Template 3: Service Profile with Retries
|
||||
|
||||
```yaml
|
||||
apiVersion: linkerd.io/v1alpha2
|
||||
kind: ServiceProfile
|
||||
metadata:
|
||||
name: my-service.my-namespace.svc.cluster.local
|
||||
namespace: my-namespace
|
||||
spec:
|
||||
routes:
|
||||
- name: GET /api/users
|
||||
condition:
|
||||
method: GET
|
||||
pathRegex: /api/users
|
||||
responseClasses:
|
||||
- condition:
|
||||
status:
|
||||
min: 500
|
||||
max: 599
|
||||
isFailure: true
|
||||
isRetryable: true
|
||||
- name: POST /api/users
|
||||
condition:
|
||||
method: POST
|
||||
pathRegex: /api/users
|
||||
# POST not retryable by default
|
||||
isRetryable: false
|
||||
- name: GET /api/users/{id}
|
||||
condition:
|
||||
method: GET
|
||||
pathRegex: /api/users/[^/]+
|
||||
timeout: 5s
|
||||
isRetryable: true
|
||||
retryBudget:
|
||||
retryRatio: 0.2
|
||||
minRetriesPerSecond: 10
|
||||
ttl: 10s
|
||||
```
|
||||
|
||||
### Template 4: Traffic Split (Canary)
|
||||
|
||||
```yaml
|
||||
apiVersion: split.smi-spec.io/v1alpha1
|
||||
kind: TrafficSplit
|
||||
metadata:
|
||||
name: my-service-canary
|
||||
namespace: my-namespace
|
||||
spec:
|
||||
service: my-service
|
||||
backends:
|
||||
- service: my-service-stable
|
||||
weight: 900m # 90%
|
||||
- service: my-service-canary
|
||||
weight: 100m # 10%
|
||||
```
|
||||
|
||||
### Template 5: Server Authorization Policy
|
||||
|
||||
```yaml
|
||||
# Define the server
|
||||
apiVersion: policy.linkerd.io/v1beta1
|
||||
kind: Server
|
||||
metadata:
|
||||
name: my-service-http
|
||||
namespace: my-namespace
|
||||
spec:
|
||||
podSelector:
|
||||
matchLabels:
|
||||
app: my-service
|
||||
port: http
|
||||
proxyProtocol: HTTP/1
|
||||
---
|
||||
# Allow traffic from specific clients
|
||||
apiVersion: policy.linkerd.io/v1beta1
|
||||
kind: ServerAuthorization
|
||||
metadata:
|
||||
name: allow-frontend
|
||||
namespace: my-namespace
|
||||
spec:
|
||||
server:
|
||||
name: my-service-http
|
||||
client:
|
||||
meshTLS:
|
||||
serviceAccounts:
|
||||
- name: frontend
|
||||
namespace: my-namespace
|
||||
---
|
||||
# Allow unauthenticated traffic (e.g., from ingress)
|
||||
apiVersion: policy.linkerd.io/v1beta1
|
||||
kind: ServerAuthorization
|
||||
metadata:
|
||||
name: allow-ingress
|
||||
namespace: my-namespace
|
||||
spec:
|
||||
server:
|
||||
name: my-service-http
|
||||
client:
|
||||
unauthenticated: true
|
||||
networks:
|
||||
- cidr: 10.0.0.0/8
|
||||
```
|
||||
|
||||
### Template 6: HTTPRoute for Advanced Routing
|
||||
|
||||
```yaml
|
||||
apiVersion: policy.linkerd.io/v1beta2
|
||||
kind: HTTPRoute
|
||||
metadata:
|
||||
name: my-route
|
||||
namespace: my-namespace
|
||||
spec:
|
||||
parentRefs:
|
||||
- name: my-service
|
||||
kind: Service
|
||||
group: core
|
||||
port: 8080
|
||||
rules:
|
||||
- matches:
|
||||
- path:
|
||||
type: PathPrefix
|
||||
value: /api/v2
|
||||
- headers:
|
||||
- name: x-api-version
|
||||
value: v2
|
||||
backendRefs:
|
||||
- name: my-service-v2
|
||||
port: 8080
|
||||
- matches:
|
||||
- path:
|
||||
type: PathPrefix
|
||||
value: /api
|
||||
backendRefs:
|
||||
- name: my-service-v1
|
||||
port: 8080
|
||||
```
|
||||
|
||||
### Template 7: Multi-cluster Setup
|
||||
|
||||
```bash
|
||||
# On each cluster, install with cluster credentials
|
||||
linkerd multicluster install | kubectl apply -f -
|
||||
|
||||
# Link clusters
|
||||
linkerd multicluster link --cluster-name west \
|
||||
--api-server-address https://west.example.com:6443 \
|
||||
| kubectl apply -f -
|
||||
|
||||
# Export a service to other clusters
|
||||
kubectl label svc/my-service mirror.linkerd.io/exported=true
|
||||
|
||||
# Verify cross-cluster connectivity
|
||||
linkerd multicluster check
|
||||
linkerd multicluster gateways
|
||||
```
|
||||
|
||||
## Monitoring Commands
|
||||
|
||||
```bash
|
||||
# Live traffic view
|
||||
linkerd viz top deploy/my-app
|
||||
|
||||
# Per-route metrics
|
||||
linkerd viz routes deploy/my-app
|
||||
|
||||
# Check proxy status
|
||||
linkerd viz stat deploy -n my-namespace
|
||||
|
||||
# View service dependencies
|
||||
linkerd viz edges deploy -n my-namespace
|
||||
|
||||
# Dashboard
|
||||
linkerd viz dashboard
|
||||
```
|
||||
|
||||
## Debugging
|
||||
|
||||
```bash
|
||||
# Check injection status
|
||||
linkerd check --proxy -n my-namespace
|
||||
|
||||
# View proxy logs
|
||||
kubectl logs deploy/my-app -c linkerd-proxy
|
||||
|
||||
# Debug identity/TLS
|
||||
linkerd identity -n my-namespace
|
||||
|
||||
# Tap traffic (live)
|
||||
linkerd viz tap deploy/my-app --to deploy/my-backend
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
- **Enable mTLS everywhere** - It's automatic with Linkerd
|
||||
- **Use ServiceProfiles** - Get per-route metrics and retries
|
||||
- **Set retry budgets** - Prevent retry storms
|
||||
- **Monitor golden metrics** - Success rate, latency, throughput
|
||||
|
||||
### Don'ts
|
||||
- **Don't skip check** - Always run `linkerd check` after changes
|
||||
- **Don't over-configure** - Linkerd defaults are sensible
|
||||
- **Don't ignore ServiceProfiles** - They unlock advanced features
|
||||
- **Don't forget timeouts** - Set appropriate values per route
|
||||
|
||||
## Resources
|
||||
|
||||
- [Linkerd Documentation](https://linkerd.io/2.14/overview/)
|
||||
- [Service Profiles](https://linkerd.io/2.14/features/service-profiles/)
|
||||
- [Authorization Policy](https://linkerd.io/2.14/features/server-policy/)
|
||||
347
plugins/cloud-infrastructure/skills/mtls-configuration/SKILL.md
Normal file
347
plugins/cloud-infrastructure/skills/mtls-configuration/SKILL.md
Normal file
@@ -0,0 +1,347 @@
|
||||
---
|
||||
name: mtls-configuration
|
||||
description: Configure mutual TLS (mTLS) for zero-trust service-to-service communication. Use when implementing zero-trust networking, certificate management, or securing internal service communication.
|
||||
---
|
||||
|
||||
# mTLS Configuration
|
||||
|
||||
Comprehensive guide to implementing mutual TLS for zero-trust service mesh communication.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
- Implementing zero-trust networking
|
||||
- Securing service-to-service communication
|
||||
- Certificate rotation and management
|
||||
- Debugging TLS handshake issues
|
||||
- Compliance requirements (PCI-DSS, HIPAA)
|
||||
- Multi-cluster secure communication
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. mTLS Flow
|
||||
|
||||
```
|
||||
┌─────────┐ ┌─────────┐
|
||||
│ Service │ │ Service │
|
||||
│ A │ │ B │
|
||||
└────┬────┘ └────┬────┘
|
||||
│ │
|
||||
┌────┴────┐ TLS Handshake ┌────┴────┐
|
||||
│ Proxy │◄───────────────────────────►│ Proxy │
|
||||
│(Sidecar)│ 1. ClientHello │(Sidecar)│
|
||||
│ │ 2. ServerHello + Cert │ │
|
||||
│ │ 3. Client Cert │ │
|
||||
│ │ 4. Verify Both Certs │ │
|
||||
│ │ 5. Encrypted Channel │ │
|
||||
└─────────┘ └─────────┘
|
||||
```
|
||||
|
||||
### 2. Certificate Hierarchy
|
||||
|
||||
```
|
||||
Root CA (Self-signed, long-lived)
|
||||
│
|
||||
├── Intermediate CA (Cluster-level)
|
||||
│ │
|
||||
│ ├── Workload Cert (Service A)
|
||||
│ └── Workload Cert (Service B)
|
||||
│
|
||||
└── Intermediate CA (Multi-cluster)
|
||||
│
|
||||
└── Cross-cluster certs
|
||||
```
|
||||
|
||||
## Templates
|
||||
|
||||
### Template 1: Istio mTLS (Strict Mode)
|
||||
|
||||
```yaml
|
||||
# Enable strict mTLS mesh-wide
|
||||
apiVersion: security.istio.io/v1beta1
|
||||
kind: PeerAuthentication
|
||||
metadata:
|
||||
name: default
|
||||
namespace: istio-system
|
||||
spec:
|
||||
mtls:
|
||||
mode: STRICT
|
||||
---
|
||||
# Namespace-level override (permissive for migration)
|
||||
apiVersion: security.istio.io/v1beta1
|
||||
kind: PeerAuthentication
|
||||
metadata:
|
||||
name: default
|
||||
namespace: legacy-namespace
|
||||
spec:
|
||||
mtls:
|
||||
mode: PERMISSIVE
|
||||
---
|
||||
# Workload-specific policy
|
||||
apiVersion: security.istio.io/v1beta1
|
||||
kind: PeerAuthentication
|
||||
metadata:
|
||||
name: payment-service
|
||||
namespace: production
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: payment-service
|
||||
mtls:
|
||||
mode: STRICT
|
||||
portLevelMtls:
|
||||
8080:
|
||||
mode: STRICT
|
||||
9090:
|
||||
mode: DISABLE # Metrics port, no mTLS
|
||||
```
|
||||
|
||||
### Template 2: Istio Destination Rule for mTLS
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: DestinationRule
|
||||
metadata:
|
||||
name: default
|
||||
namespace: istio-system
|
||||
spec:
|
||||
host: "*.local"
|
||||
trafficPolicy:
|
||||
tls:
|
||||
mode: ISTIO_MUTUAL
|
||||
---
|
||||
# TLS to external service
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: DestinationRule
|
||||
metadata:
|
||||
name: external-api
|
||||
spec:
|
||||
host: api.external.com
|
||||
trafficPolicy:
|
||||
tls:
|
||||
mode: SIMPLE
|
||||
caCertificates: /etc/certs/external-ca.pem
|
||||
---
|
||||
# Mutual TLS to external service
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: DestinationRule
|
||||
metadata:
|
||||
name: partner-api
|
||||
spec:
|
||||
host: api.partner.com
|
||||
trafficPolicy:
|
||||
tls:
|
||||
mode: MUTUAL
|
||||
clientCertificate: /etc/certs/client.pem
|
||||
privateKey: /etc/certs/client-key.pem
|
||||
caCertificates: /etc/certs/partner-ca.pem
|
||||
```
|
||||
|
||||
### Template 3: Cert-Manager with Istio
|
||||
|
||||
```yaml
|
||||
# Install cert-manager issuer for Istio
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: ClusterIssuer
|
||||
metadata:
|
||||
name: istio-ca
|
||||
spec:
|
||||
ca:
|
||||
secretName: istio-ca-secret
|
||||
---
|
||||
# Create Istio CA secret
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: istio-ca-secret
|
||||
namespace: cert-manager
|
||||
type: kubernetes.io/tls
|
||||
data:
|
||||
tls.crt: <base64-encoded-ca-cert>
|
||||
tls.key: <base64-encoded-ca-key>
|
||||
---
|
||||
# Certificate for workload
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: Certificate
|
||||
metadata:
|
||||
name: my-service-cert
|
||||
namespace: my-namespace
|
||||
spec:
|
||||
secretName: my-service-tls
|
||||
duration: 24h
|
||||
renewBefore: 8h
|
||||
issuerRef:
|
||||
name: istio-ca
|
||||
kind: ClusterIssuer
|
||||
commonName: my-service.my-namespace.svc.cluster.local
|
||||
dnsNames:
|
||||
- my-service
|
||||
- my-service.my-namespace
|
||||
- my-service.my-namespace.svc
|
||||
- my-service.my-namespace.svc.cluster.local
|
||||
usages:
|
||||
- server auth
|
||||
- client auth
|
||||
```
|
||||
|
||||
### Template 4: SPIFFE/SPIRE Integration
|
||||
|
||||
```yaml
|
||||
# SPIRE Server configuration
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: spire-server
|
||||
namespace: spire
|
||||
data:
|
||||
server.conf: |
|
||||
server {
|
||||
bind_address = "0.0.0.0"
|
||||
bind_port = "8081"
|
||||
trust_domain = "example.org"
|
||||
data_dir = "/run/spire/data"
|
||||
log_level = "INFO"
|
||||
ca_ttl = "168h"
|
||||
default_x509_svid_ttl = "1h"
|
||||
}
|
||||
|
||||
plugins {
|
||||
DataStore "sql" {
|
||||
plugin_data {
|
||||
database_type = "sqlite3"
|
||||
connection_string = "/run/spire/data/datastore.sqlite3"
|
||||
}
|
||||
}
|
||||
|
||||
NodeAttestor "k8s_psat" {
|
||||
plugin_data {
|
||||
clusters = {
|
||||
"demo-cluster" = {
|
||||
service_account_allow_list = ["spire:spire-agent"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
KeyManager "memory" {
|
||||
plugin_data {}
|
||||
}
|
||||
|
||||
UpstreamAuthority "disk" {
|
||||
plugin_data {
|
||||
key_file_path = "/run/spire/secrets/bootstrap.key"
|
||||
cert_file_path = "/run/spire/secrets/bootstrap.crt"
|
||||
}
|
||||
}
|
||||
}
|
||||
---
|
||||
# SPIRE Agent DaemonSet (abbreviated)
|
||||
apiVersion: apps/v1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: spire-agent
|
||||
namespace: spire
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: spire-agent
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: spire-agent
|
||||
image: ghcr.io/spiffe/spire-agent:1.8.0
|
||||
volumeMounts:
|
||||
- name: spire-agent-socket
|
||||
mountPath: /run/spire/sockets
|
||||
volumes:
|
||||
- name: spire-agent-socket
|
||||
hostPath:
|
||||
path: /run/spire/sockets
|
||||
type: DirectoryOrCreate
|
||||
```
|
||||
|
||||
### Template 5: Linkerd mTLS (Automatic)
|
||||
|
||||
```yaml
|
||||
# Linkerd enables mTLS automatically
|
||||
# Verify with:
|
||||
# linkerd viz edges deployment -n my-namespace
|
||||
|
||||
# For external services without mTLS
|
||||
apiVersion: policy.linkerd.io/v1beta1
|
||||
kind: Server
|
||||
metadata:
|
||||
name: external-api
|
||||
namespace: my-namespace
|
||||
spec:
|
||||
podSelector:
|
||||
matchLabels:
|
||||
app: my-app
|
||||
port: external-api
|
||||
proxyProtocol: HTTP/1 # or TLS for passthrough
|
||||
---
|
||||
# Skip TLS for specific port
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: my-service
|
||||
annotations:
|
||||
config.linkerd.io/skip-outbound-ports: "3306" # MySQL
|
||||
```
|
||||
|
||||
## Certificate Rotation
|
||||
|
||||
```bash
|
||||
# Istio - Check certificate expiry
|
||||
istioctl proxy-config secret deploy/my-app -o json | \
|
||||
jq '.dynamicActiveSecrets[0].secret.tlsCertificate.certificateChain.inlineBytes' | \
|
||||
tr -d '"' | base64 -d | openssl x509 -text -noout
|
||||
|
||||
# Force certificate rotation
|
||||
kubectl rollout restart deployment/my-app
|
||||
|
||||
# Check Linkerd identity
|
||||
linkerd identity -n my-namespace
|
||||
```
|
||||
|
||||
## Debugging mTLS Issues
|
||||
|
||||
```bash
|
||||
# Istio - Check if mTLS is enabled
|
||||
istioctl authn tls-check my-service.my-namespace.svc.cluster.local
|
||||
|
||||
# Verify peer authentication
|
||||
kubectl get peerauthentication --all-namespaces
|
||||
|
||||
# Check destination rules
|
||||
kubectl get destinationrule --all-namespaces
|
||||
|
||||
# Debug TLS handshake
|
||||
istioctl proxy-config log deploy/my-app --level debug
|
||||
kubectl logs deploy/my-app -c istio-proxy | grep -i tls
|
||||
|
||||
# Linkerd - Check mTLS status
|
||||
linkerd viz edges deployment -n my-namespace
|
||||
linkerd viz tap deploy/my-app --to deploy/my-backend
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
- **Start with PERMISSIVE** - Migrate gradually to STRICT
|
||||
- **Monitor certificate expiry** - Set up alerts
|
||||
- **Use short-lived certs** - 24h or less for workloads
|
||||
- **Rotate CA periodically** - Plan for CA rotation
|
||||
- **Log TLS errors** - For debugging and audit
|
||||
|
||||
### Don'ts
|
||||
- **Don't disable mTLS** - For convenience in production
|
||||
- **Don't ignore cert expiry** - Automate rotation
|
||||
- **Don't use self-signed certs** - Use proper CA hierarchy
|
||||
- **Don't skip verification** - Verify the full chain
|
||||
|
||||
## Resources
|
||||
|
||||
- [Istio Security](https://istio.io/latest/docs/concepts/security/)
|
||||
- [SPIFFE/SPIRE](https://spiffe.io/)
|
||||
- [cert-manager](https://cert-manager.io/)
|
||||
- [Zero Trust Architecture (NIST)](https://www.nist.gov/publications/zero-trust-architecture)
|
||||
@@ -0,0 +1,383 @@
|
||||
---
|
||||
name: service-mesh-observability
|
||||
description: Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.
|
||||
---
|
||||
|
||||
# Service Mesh Observability
|
||||
|
||||
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
- Setting up distributed tracing across services
|
||||
- Implementing service mesh metrics and dashboards
|
||||
- Debugging latency and error issues
|
||||
- Defining SLOs for service communication
|
||||
- Visualizing service dependencies
|
||||
- Troubleshooting mesh connectivity
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Three Pillars of Observability
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ Observability │
|
||||
├─────────────────┬─────────────────┬─────────────────┤
|
||||
│ Metrics │ Traces │ Logs │
|
||||
│ │ │ │
|
||||
│ • Request rate │ • Span context │ • Access logs │
|
||||
│ • Error rate │ • Latency │ • Error details │
|
||||
│ • Latency P50 │ • Dependencies │ • Debug info │
|
||||
│ • Saturation │ • Bottlenecks │ • Audit trail │
|
||||
└─────────────────┴─────────────────┴─────────────────┘
|
||||
```
|
||||
|
||||
### 2. Golden Signals for Mesh
|
||||
|
||||
| Signal | Description | Alert Threshold |
|
||||
|--------|-------------|-----------------|
|
||||
| **Latency** | Request duration P50, P99 | P99 > 500ms |
|
||||
| **Traffic** | Requests per second | Anomaly detection |
|
||||
| **Errors** | 5xx error rate | > 1% |
|
||||
| **Saturation** | Resource utilization | > 80% |
|
||||
|
||||
## Templates
|
||||
|
||||
### Template 1: Istio with Prometheus & Grafana
|
||||
|
||||
```yaml
|
||||
# Install Prometheus
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: prometheus
|
||||
namespace: istio-system
|
||||
data:
|
||||
prometheus.yml: |
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
scrape_configs:
|
||||
- job_name: 'istio-mesh'
|
||||
kubernetes_sd_configs:
|
||||
- role: endpoints
|
||||
namespaces:
|
||||
names:
|
||||
- istio-system
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_service_name]
|
||||
action: keep
|
||||
regex: istio-telemetry
|
||||
---
|
||||
# ServiceMonitor for Prometheus Operator
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: istio-mesh
|
||||
namespace: istio-system
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: istiod
|
||||
endpoints:
|
||||
- port: http-monitoring
|
||||
interval: 15s
|
||||
```
|
||||
|
||||
### Template 2: Key Istio Metrics Queries
|
||||
|
||||
```promql
|
||||
# Request rate by service
|
||||
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
|
||||
|
||||
# Error rate (5xx)
|
||||
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
|
||||
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
|
||||
|
||||
# P99 latency
|
||||
histogram_quantile(0.99,
|
||||
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
|
||||
by (le, destination_service_name))
|
||||
|
||||
# TCP connections
|
||||
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
|
||||
|
||||
# Request size
|
||||
histogram_quantile(0.99,
|
||||
sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
|
||||
by (le, destination_service_name))
|
||||
```
|
||||
|
||||
### Template 3: Jaeger Distributed Tracing
|
||||
|
||||
```yaml
|
||||
# Jaeger installation for Istio
|
||||
apiVersion: install.istio.io/v1alpha1
|
||||
kind: IstioOperator
|
||||
spec:
|
||||
meshConfig:
|
||||
enableTracing: true
|
||||
defaultConfig:
|
||||
tracing:
|
||||
sampling: 100.0 # 100% in dev, lower in prod
|
||||
zipkin:
|
||||
address: jaeger-collector.istio-system:9411
|
||||
---
|
||||
# Jaeger deployment
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: jaeger
|
||||
namespace: istio-system
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: jaeger
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: jaeger
|
||||
spec:
|
||||
containers:
|
||||
- name: jaeger
|
||||
image: jaegertracing/all-in-one:1.50
|
||||
ports:
|
||||
- containerPort: 5775 # UDP
|
||||
- containerPort: 6831 # Thrift
|
||||
- containerPort: 6832 # Thrift
|
||||
- containerPort: 5778 # Config
|
||||
- containerPort: 16686 # UI
|
||||
- containerPort: 14268 # HTTP
|
||||
- containerPort: 14250 # gRPC
|
||||
- containerPort: 9411 # Zipkin
|
||||
env:
|
||||
- name: COLLECTOR_ZIPKIN_HOST_PORT
|
||||
value: ":9411"
|
||||
```
|
||||
|
||||
### Template 4: Linkerd Viz Dashboard
|
||||
|
||||
```bash
|
||||
# Install Linkerd viz extension
|
||||
linkerd viz install | kubectl apply -f -
|
||||
|
||||
# Access dashboard
|
||||
linkerd viz dashboard
|
||||
|
||||
# CLI commands for observability
|
||||
# Top requests
|
||||
linkerd viz top deploy/my-app
|
||||
|
||||
# Per-route metrics
|
||||
linkerd viz routes deploy/my-app --to deploy/backend
|
||||
|
||||
# Live traffic inspection
|
||||
linkerd viz tap deploy/my-app --to deploy/backend
|
||||
|
||||
# Service edges (dependencies)
|
||||
linkerd viz edges deployment -n my-namespace
|
||||
```
|
||||
|
||||
### Template 5: Grafana Dashboard JSON
|
||||
|
||||
```json
|
||||
{
|
||||
"dashboard": {
|
||||
"title": "Service Mesh Overview",
|
||||
"panels": [
|
||||
{
|
||||
"title": "Request Rate",
|
||||
"type": "graph",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
|
||||
"legendFormat": "{{destination_service_name}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Error Rate",
|
||||
"type": "gauge",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{"value": 0, "color": "green"},
|
||||
{"value": 1, "color": "yellow"},
|
||||
{"value": 5, "color": "red"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "P99 Latency",
|
||||
"type": "graph",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
|
||||
"legendFormat": "{{destination_service_name}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Service Topology",
|
||||
"type": "nodeGraph",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Template 6: Kiali Service Mesh Visualization
|
||||
|
||||
```yaml
|
||||
# Kiali installation
|
||||
apiVersion: kiali.io/v1alpha1
|
||||
kind: Kiali
|
||||
metadata:
|
||||
name: kiali
|
||||
namespace: istio-system
|
||||
spec:
|
||||
auth:
|
||||
strategy: anonymous # or openid, token
|
||||
deployment:
|
||||
accessible_namespaces:
|
||||
- "**"
|
||||
external_services:
|
||||
prometheus:
|
||||
url: http://prometheus.istio-system:9090
|
||||
tracing:
|
||||
url: http://jaeger-query.istio-system:16686
|
||||
grafana:
|
||||
url: http://grafana.istio-system:3000
|
||||
```
|
||||
|
||||
### Template 7: OpenTelemetry Integration
|
||||
|
||||
```yaml
|
||||
# OpenTelemetry Collector for mesh
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: otel-collector-config
|
||||
data:
|
||||
config.yaml: |
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
zipkin:
|
||||
endpoint: 0.0.0.0:9411
|
||||
|
||||
processors:
|
||||
batch:
|
||||
timeout: 10s
|
||||
|
||||
exporters:
|
||||
jaeger:
|
||||
endpoint: jaeger-collector:14250
|
||||
tls:
|
||||
insecure: true
|
||||
prometheus:
|
||||
endpoint: 0.0.0.0:8889
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
traces:
|
||||
receivers: [otlp, zipkin]
|
||||
processors: [batch]
|
||||
exporters: [jaeger]
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
processors: [batch]
|
||||
exporters: [prometheus]
|
||||
---
|
||||
# Istio Telemetry v2 with OTel
|
||||
apiVersion: telemetry.istio.io/v1alpha1
|
||||
kind: Telemetry
|
||||
metadata:
|
||||
name: mesh-default
|
||||
namespace: istio-system
|
||||
spec:
|
||||
tracing:
|
||||
- providers:
|
||||
- name: otel
|
||||
randomSamplingPercentage: 10
|
||||
```
|
||||
|
||||
## Alerting Rules
|
||||
|
||||
```yaml
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: PrometheusRule
|
||||
metadata:
|
||||
name: mesh-alerts
|
||||
namespace: istio-system
|
||||
spec:
|
||||
groups:
|
||||
- name: mesh.rules
|
||||
rules:
|
||||
- alert: HighErrorRate
|
||||
expr: |
|
||||
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
|
||||
/ sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "High error rate for {{ $labels.destination_service_name }}"
|
||||
|
||||
- alert: HighLatency
|
||||
expr: |
|
||||
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
|
||||
by (le, destination_service_name)) > 1000
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High P99 latency for {{ $labels.destination_service_name }}"
|
||||
|
||||
- alert: MeshCertExpiring
|
||||
expr: |
|
||||
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Mesh certificate expiring in less than 7 days"
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
- **Sample appropriately** - 100% in dev, 1-10% in prod
|
||||
- **Use trace context** - Propagate headers consistently
|
||||
- **Set up alerts** - For golden signals
|
||||
- **Correlate metrics/traces** - Use exemplars
|
||||
- **Retain strategically** - Hot/cold storage tiers
|
||||
|
||||
### Don'ts
|
||||
- **Don't over-sample** - Storage costs add up
|
||||
- **Don't ignore cardinality** - Limit label values
|
||||
- **Don't skip dashboards** - Visualize dependencies
|
||||
- **Don't forget costs** - Monitor observability costs
|
||||
|
||||
## Resources
|
||||
|
||||
- [Istio Observability](https://istio.io/latest/docs/tasks/observability/)
|
||||
- [Linkerd Observability](https://linkerd.io/2.14/features/dashboard/)
|
||||
- [OpenTelemetry](https://opentelemetry.io/)
|
||||
- [Kiali](https://kiali.io/)
|
||||
Reference in New Issue
Block a user