feat: add 5 new specialized agents with 20 skills

Add domain expert agents with comprehensive skill sets: - service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability - event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas - vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search - monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces - threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements Update all documentation to reflect correct counts: - 67 plugins, 99 agents, 107 skills, 71 commands
2026-03-18 09:37:15 +00:00 · 2025-12-16 16:00:58 -05:00
parent c7ad381360
commit 01d93fc227
58 changed files with 24830 additions and 50 deletions
--- a/plugins/cloud-infrastructure/agents/service-mesh-expert.md
+++ b/plugins/cloud-infrastructure/agents/service-mesh-expert.md
@@ -0,0 +1,41 @@
+# Service Mesh Expert
+
+Expert service mesh architect specializing in Istio, Linkerd, and cloud-native networking patterns. Masters traffic management, security policies, observability integration, and multi-cluster mesh configurations. Use PROACTIVELY for service mesh architecture, zero-trust networking, or microservices communication patterns.
+
+## Capabilities
+
+- Istio and Linkerd installation, configuration, and optimization
+- Traffic management: routing, load balancing, circuit breaking, retries
+- mTLS configuration and certificate management
+- Service mesh observability with distributed tracing
+- Multi-cluster and multi-cloud mesh federation
+- Progressive delivery with canary and blue-green deployments
+- Security policies and authorization rules
+
+## When to Use
+
+- Implementing service-to-service communication in Kubernetes
+- Setting up zero-trust networking with mTLS
+- Configuring traffic splitting for canary deployments
+- Debugging service mesh connectivity issues
+- Implementing rate limiting and circuit breakers
+- Setting up cross-cluster service discovery
+
+## Workflow
+
+1. Assess current infrastructure and requirements
+2. Design mesh topology and traffic policies
+3. Implement security policies (mTLS, AuthorizationPolicy)
+4. Configure observability (metrics, traces, logs)
+5. Set up traffic management rules
+6. Test failover and resilience patterns
+7. Document operational runbooks
+
+## Best Practices
+
+- Start with permissive mode, gradually enforce strict mTLS
+- Use namespaces for policy isolation
+- Implement circuit breakers before they're needed
+- Monitor mesh overhead (latency, resource usage)
+- Keep sidecar resources appropriately sized
+- Use destination rules for consistent load balancing
--- a/plugins/cloud-infrastructure/skills/istio-traffic-management/SKILL.md
+++ b/plugins/cloud-infrastructure/skills/istio-traffic-management/SKILL.md
@@ -0,0 +1,325 @@
+---
+name: istio-traffic-management
+description: Configure Istio traffic management including routing, load balancing, circuit breakers, and canary deployments. Use when implementing service mesh traffic policies, progressive delivery, or resilience patterns.
+---
+
+# Istio Traffic Management
+
+Comprehensive guide to Istio traffic management for production service mesh deployments.
+
+## When to Use This Skill
+
+- Configuring service-to-service routing
+- Implementing canary or blue-green deployments
+- Setting up circuit breakers and retries
+- Load balancing configuration
+- Traffic mirroring for testing
+- Fault injection for chaos engineering
+
+## Core Concepts
+
+### 1. Traffic Management Resources
+
+| Resource | Purpose | Scope |
+|----------|---------|-------|
+| **VirtualService** | Route traffic to destinations | Host-based |
+| **DestinationRule** | Define policies after routing | Service-based |
+| **Gateway** | Configure ingress/egress | Cluster edge |
+| **ServiceEntry** | Add external services | Mesh-wide |
+
+### 2. Traffic Flow
+
+```
+Client → Gateway → VirtualService → DestinationRule → Service
+                   (routing)        (policies)        (pods)
+```
+
+## Templates
+
+### Template 1: Basic Routing
+
+```yaml
+apiVersion: networking.istio.io/v1beta1
+kind: VirtualService
+metadata:
+  name: reviews-route
+  namespace: bookinfo
+spec:
+  hosts:
+    - reviews
+  http:
+    - match:
+        - headers:
+            end-user:
+              exact: jason
+      route:
+        - destination:
+            host: reviews
+            subset: v2
+    - route:
+        - destination:
+            host: reviews
+            subset: v1
+---
+apiVersion: networking.istio.io/v1beta1
+kind: DestinationRule
+metadata:
+  name: reviews-destination
+  namespace: bookinfo
+spec:
+  host: reviews
+  subsets:
+    - name: v1
+      labels:
+        version: v1
+    - name: v2
+      labels:
+        version: v2
+    - name: v3
+      labels:
+        version: v3
+```
+
+### Template 2: Canary Deployment
+
+```yaml
+apiVersion: networking.istio.io/v1beta1
+kind: VirtualService
+metadata:
+  name: my-service-canary
+spec:
+  hosts:
+    - my-service
+  http:
+    - route:
+        - destination:
+            host: my-service
+            subset: stable
+          weight: 90
+        - destination:
+            host: my-service
+            subset: canary
+          weight: 10
+---
+apiVersion: networking.istio.io/v1beta1
+kind: DestinationRule
+metadata:
+  name: my-service-dr
+spec:
+  host: my-service
+  trafficPolicy:
+    connectionPool:
+      tcp:
+        maxConnections: 100
+      http:
+        h2UpgradePolicy: UPGRADE
+        http1MaxPendingRequests: 100
+        http2MaxRequests: 1000
+  subsets:
+    - name: stable
+      labels:
+        version: stable
+    - name: canary
+      labels:
+        version: canary
+```
+
+### Template 3: Circuit Breaker
+
+```yaml
+apiVersion: networking.istio.io/v1beta1
+kind: DestinationRule
+metadata:
+  name: circuit-breaker
+spec:
+  host: my-service
+  trafficPolicy:
+    connectionPool:
+      tcp:
+        maxConnections: 100
+      http:
+        http1MaxPendingRequests: 100
+        http2MaxRequests: 1000
+        maxRequestsPerConnection: 10
+        maxRetries: 3
+    outlierDetection:
+      consecutive5xxErrors: 5
+      interval: 30s
+      baseEjectionTime: 30s
+      maxEjectionPercent: 50
+      minHealthPercent: 30
+```
+
+### Template 4: Retry and Timeout
+
+```yaml
+apiVersion: networking.istio.io/v1beta1
+kind: VirtualService
+metadata:
+  name: ratings-retry
+spec:
+  hosts:
+    - ratings
+  http:
+    - route:
+        - destination:
+            host: ratings
+      timeout: 10s
+      retries:
+        attempts: 3
+        perTryTimeout: 3s
+        retryOn: connect-failure,refused-stream,unavailable,cancelled,retriable-4xx,503
+        retryRemoteLocalities: true
+```
+
+### Template 5: Traffic Mirroring
+
+```yaml
+apiVersion: networking.istio.io/v1beta1
+kind: VirtualService
+metadata:
+  name: mirror-traffic
+spec:
+  hosts:
+    - my-service
+  http:
+    - route:
+        - destination:
+            host: my-service
+            subset: v1
+      mirror:
+        host: my-service
+        subset: v2
+      mirrorPercentage:
+        value: 100.0
+```
+
+### Template 6: Fault Injection
+
+```yaml
+apiVersion: networking.istio.io/v1beta1
+kind: VirtualService
+metadata:
+  name: fault-injection
+spec:
+  hosts:
+    - ratings
+  http:
+    - fault:
+        delay:
+          percentage:
+            value: 10
+          fixedDelay: 5s
+        abort:
+          percentage:
+            value: 5
+          httpStatus: 503
+      route:
+        - destination:
+            host: ratings
+```
+
+### Template 7: Ingress Gateway
+
+```yaml
+apiVersion: networking.istio.io/v1beta1
+kind: Gateway
+metadata:
+  name: my-gateway
+spec:
+  selector:
+    istio: ingressgateway
+  servers:
+    - port:
+        number: 443
+        name: https
+        protocol: HTTPS
+      tls:
+        mode: SIMPLE
+        credentialName: my-tls-secret
+      hosts:
+        - "*.example.com"
+---
+apiVersion: networking.istio.io/v1beta1
+kind: VirtualService
+metadata:
+  name: my-vs
+spec:
+  hosts:
+    - "api.example.com"
+  gateways:
+    - my-gateway
+  http:
+    - match:
+        - uri:
+            prefix: /api/v1
+      route:
+        - destination:
+            host: api-service
+            port:
+              number: 8080
+```
+
+## Load Balancing Strategies
+
+```yaml
+apiVersion: networking.istio.io/v1beta1
+kind: DestinationRule
+metadata:
+  name: load-balancing
+spec:
+  host: my-service
+  trafficPolicy:
+    loadBalancer:
+      simple: ROUND_ROBIN  # or LEAST_CONN, RANDOM, PASSTHROUGH
+---
+# Consistent hashing for sticky sessions
+apiVersion: networking.istio.io/v1beta1
+kind: DestinationRule
+metadata:
+  name: sticky-sessions
+spec:
+  host: my-service
+  trafficPolicy:
+    loadBalancer:
+      consistentHash:
+        httpHeaderName: x-user-id
+        # or: httpCookie, useSourceIp, httpQueryParameterName
+```
+
+## Best Practices
+
+### Do's
+- **Start simple** - Add complexity incrementally
+- **Use subsets** - Version your services clearly
+- **Set timeouts** - Always configure reasonable timeouts
+- **Enable retries** - But with backoff and limits
+- **Monitor** - Use Kiali and Jaeger for visibility
+
+### Don'ts
+- **Don't over-retry** - Can cause cascading failures
+- **Don't ignore outlier detection** - Enable circuit breakers
+- **Don't mirror to production** - Mirror to test environments
+- **Don't skip canary** - Test with small traffic percentage first
+
+## Debugging Commands
+
+```bash
+# Check VirtualService configuration
+istioctl analyze
+
+# View effective routes
+istioctl proxy-config routes deploy/my-app -o json
+
+# Check endpoint discovery
+istioctl proxy-config endpoints deploy/my-app
+
+# Debug traffic
+istioctl proxy-config log deploy/my-app --level debug
+```
+
+## Resources
+
+- [Istio Traffic Management](https://istio.io/latest/docs/concepts/traffic-management/)
+- [Virtual Service Reference](https://istio.io/latest/docs/reference/config/networking/virtual-service/)
+- [Destination Rule Reference](https://istio.io/latest/docs/reference/config/networking/destination-rule/)
--- a/plugins/cloud-infrastructure/skills/linkerd-patterns/SKILL.md
+++ b/plugins/cloud-infrastructure/skills/linkerd-patterns/SKILL.md
@@ -0,0 +1,309 @@
+---
+name: linkerd-patterns
+description: Implement Linkerd service mesh patterns for lightweight, security-focused service mesh deployments. Use when setting up Linkerd, configuring traffic policies, or implementing zero-trust networking with minimal overhead.
+---
+
+# Linkerd Patterns
+
+Production patterns for Linkerd service mesh - the lightweight, security-first service mesh for Kubernetes.
+
+## When to Use This Skill
+
+- Setting up a lightweight service mesh
+- Implementing automatic mTLS
+- Configuring traffic splits for canary deployments
+- Setting up service profiles for per-route metrics
+- Implementing retries and timeouts
+- Multi-cluster service mesh
+
+## Core Concepts
+
+### 1. Linkerd Architecture
+
+```
+┌─────────────────────────────────────────────┐
+│                Control Plane                 │
+│  ┌─────────┐ ┌──────────┐ ┌──────────────┐ │
+│  │ destiny │ │ identity │ │ proxy-inject │ │
+│  └─────────┘ └──────────┘ └──────────────┘ │
+└─────────────────────────────────────────────┘
+                      │
+┌─────────────────────────────────────────────┐
+│                 Data Plane                   │
+│  ┌─────┐    ┌─────┐    ┌─────┐             │
+│  │proxy│────│proxy│────│proxy│             │
+│  └─────┘    └─────┘    └─────┘             │
+│     │           │           │               │
+│  ┌──┴──┐    ┌──┴──┐    ┌──┴──┐            │
+│  │ app │    │ app │    │ app │            │
+│  └─────┘    └─────┘    └─────┘            │
+└─────────────────────────────────────────────┘
+```
+
+### 2. Key Resources
+
+| Resource | Purpose |
+|----------|---------|
+| **ServiceProfile** | Per-route metrics, retries, timeouts |
+| **TrafficSplit** | Canary deployments, A/B testing |
+| **Server** | Define server-side policies |
+| **ServerAuthorization** | Access control policies |
+
+## Templates
+
+### Template 1: Mesh Installation
+
+```bash
+# Install CLI
+curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
+
+# Validate cluster
+linkerd check --pre
+
+# Install CRDs
+linkerd install --crds | kubectl apply -f -
+
+# Install control plane
+linkerd install | kubectl apply -f -
+
+# Verify installation
+linkerd check
+
+# Install viz extension (optional)
+linkerd viz install | kubectl apply -f -
+```
+
+### Template 2: Inject Namespace
+
+```yaml
+# Automatic injection for namespace
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: my-app
+  annotations:
+    linkerd.io/inject: enabled
+---
+# Or inject specific deployment
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: my-app
+  annotations:
+    linkerd.io/inject: enabled
+spec:
+  template:
+    metadata:
+      annotations:
+        linkerd.io/inject: enabled
+```
+
+### Template 3: Service Profile with Retries
+
+```yaml
+apiVersion: linkerd.io/v1alpha2
+kind: ServiceProfile
+metadata:
+  name: my-service.my-namespace.svc.cluster.local
+  namespace: my-namespace
+spec:
+  routes:
+    - name: GET /api/users
+      condition:
+        method: GET
+        pathRegex: /api/users
+      responseClasses:
+        - condition:
+            status:
+              min: 500
+              max: 599
+          isFailure: true
+      isRetryable: true
+    - name: POST /api/users
+      condition:
+        method: POST
+        pathRegex: /api/users
+      # POST not retryable by default
+      isRetryable: false
+    - name: GET /api/users/{id}
+      condition:
+        method: GET
+        pathRegex: /api/users/[^/]+
+      timeout: 5s
+      isRetryable: true
+  retryBudget:
+    retryRatio: 0.2
+    minRetriesPerSecond: 10
+    ttl: 10s
+```
+
+### Template 4: Traffic Split (Canary)
+
+```yaml
+apiVersion: split.smi-spec.io/v1alpha1
+kind: TrafficSplit
+metadata:
+  name: my-service-canary
+  namespace: my-namespace
+spec:
+  service: my-service
+  backends:
+    - service: my-service-stable
+      weight: 900m  # 90%
+    - service: my-service-canary
+      weight: 100m  # 10%
+```
+
+### Template 5: Server Authorization Policy
+
+```yaml
+# Define the server
+apiVersion: policy.linkerd.io/v1beta1
+kind: Server
+metadata:
+  name: my-service-http
+  namespace: my-namespace
+spec:
+  podSelector:
+    matchLabels:
+      app: my-service
+  port: http
+  proxyProtocol: HTTP/1
+---
+# Allow traffic from specific clients
+apiVersion: policy.linkerd.io/v1beta1
+kind: ServerAuthorization
+metadata:
+  name: allow-frontend
+  namespace: my-namespace
+spec:
+  server:
+    name: my-service-http
+  client:
+    meshTLS:
+      serviceAccounts:
+        - name: frontend
+          namespace: my-namespace
+---
+# Allow unauthenticated traffic (e.g., from ingress)
+apiVersion: policy.linkerd.io/v1beta1
+kind: ServerAuthorization
+metadata:
+  name: allow-ingress
+  namespace: my-namespace
+spec:
+  server:
+    name: my-service-http
+  client:
+    unauthenticated: true
+    networks:
+      - cidr: 10.0.0.0/8
+```
+
+### Template 6: HTTPRoute for Advanced Routing
+
+```yaml
+apiVersion: policy.linkerd.io/v1beta2
+kind: HTTPRoute
+metadata:
+  name: my-route
+  namespace: my-namespace
+spec:
+  parentRefs:
+    - name: my-service
+      kind: Service
+      group: core
+      port: 8080
+  rules:
+    - matches:
+        - path:
+            type: PathPrefix
+            value: /api/v2
+        - headers:
+            - name: x-api-version
+              value: v2
+      backendRefs:
+        - name: my-service-v2
+          port: 8080
+    - matches:
+        - path:
+            type: PathPrefix
+            value: /api
+      backendRefs:
+        - name: my-service-v1
+          port: 8080
+```
+
+### Template 7: Multi-cluster Setup
+
+```bash
+# On each cluster, install with cluster credentials
+linkerd multicluster install | kubectl apply -f -
+
+# Link clusters
+linkerd multicluster link --cluster-name west \
+  --api-server-address https://west.example.com:6443 \
+  | kubectl apply -f -
+
+# Export a service to other clusters
+kubectl label svc/my-service mirror.linkerd.io/exported=true
+
+# Verify cross-cluster connectivity
+linkerd multicluster check
+linkerd multicluster gateways
+```
+
+## Monitoring Commands
+
+```bash
+# Live traffic view
+linkerd viz top deploy/my-app
+
+# Per-route metrics
+linkerd viz routes deploy/my-app
+
+# Check proxy status
+linkerd viz stat deploy -n my-namespace
+
+# View service dependencies
+linkerd viz edges deploy -n my-namespace
+
+# Dashboard
+linkerd viz dashboard
+```
+
+## Debugging
+
+```bash
+# Check injection status
+linkerd check --proxy -n my-namespace
+
+# View proxy logs
+kubectl logs deploy/my-app -c linkerd-proxy
+
+# Debug identity/TLS
+linkerd identity -n my-namespace
+
+# Tap traffic (live)
+linkerd viz tap deploy/my-app --to deploy/my-backend
+```
+
+## Best Practices
+
+### Do's
+- **Enable mTLS everywhere** - It's automatic with Linkerd
+- **Use ServiceProfiles** - Get per-route metrics and retries
+- **Set retry budgets** - Prevent retry storms
+- **Monitor golden metrics** - Success rate, latency, throughput
+
+### Don'ts
+- **Don't skip check** - Always run `linkerd check` after changes
+- **Don't over-configure** - Linkerd defaults are sensible
+- **Don't ignore ServiceProfiles** - They unlock advanced features
+- **Don't forget timeouts** - Set appropriate values per route
+
+## Resources
+
+- [Linkerd Documentation](https://linkerd.io/2.14/overview/)
+- [Service Profiles](https://linkerd.io/2.14/features/service-profiles/)
+- [Authorization Policy](https://linkerd.io/2.14/features/server-policy/)
--- a/plugins/cloud-infrastructure/skills/mtls-configuration/SKILL.md
+++ b/plugins/cloud-infrastructure/skills/mtls-configuration/SKILL.md
@@ -0,0 +1,347 @@
+---
+name: mtls-configuration
+description: Configure mutual TLS (mTLS) for zero-trust service-to-service communication. Use when implementing zero-trust networking, certificate management, or securing internal service communication.
+---
+
+# mTLS Configuration
+
+Comprehensive guide to implementing mutual TLS for zero-trust service mesh communication.
+
+## When to Use This Skill
+
+- Implementing zero-trust networking
+- Securing service-to-service communication
+- Certificate rotation and management
+- Debugging TLS handshake issues
+- Compliance requirements (PCI-DSS, HIPAA)
+- Multi-cluster secure communication
+
+## Core Concepts
+
+### 1. mTLS Flow
+
+```
+┌─────────┐                              ┌─────────┐
+│ Service │                              │ Service │
+│    A    │                              │    B    │
+└────┬────┘                              └────┬────┘
+     │                                        │
+┌────┴────┐      TLS Handshake          ┌────┴────┐
+│  Proxy  │◄───────────────────────────►│  Proxy  │
+│(Sidecar)│  1. ClientHello             │(Sidecar)│
+│         │  2. ServerHello + Cert      │         │
+│         │  3. Client Cert             │         │
+│         │  4. Verify Both Certs       │         │
+│         │  5. Encrypted Channel       │         │
+└─────────┘                              └─────────┘
+```
+
+### 2. Certificate Hierarchy
+
+```
+Root CA (Self-signed, long-lived)
+    │
+    ├── Intermediate CA (Cluster-level)
+    │       │
+    │       ├── Workload Cert (Service A)
+    │       └── Workload Cert (Service B)
+    │
+    └── Intermediate CA (Multi-cluster)
+            │
+            └── Cross-cluster certs
+```
+
+## Templates
+
+### Template 1: Istio mTLS (Strict Mode)
+
+```yaml
+# Enable strict mTLS mesh-wide
+apiVersion: security.istio.io/v1beta1
+kind: PeerAuthentication
+metadata:
+  name: default
+  namespace: istio-system
+spec:
+  mtls:
+    mode: STRICT
+---
+# Namespace-level override (permissive for migration)
+apiVersion: security.istio.io/v1beta1
+kind: PeerAuthentication
+metadata:
+  name: default
+  namespace: legacy-namespace
+spec:
+  mtls:
+    mode: PERMISSIVE
+---
+# Workload-specific policy
+apiVersion: security.istio.io/v1beta1
+kind: PeerAuthentication
+metadata:
+  name: payment-service
+  namespace: production
+spec:
+  selector:
+    matchLabels:
+      app: payment-service
+  mtls:
+    mode: STRICT
+  portLevelMtls:
+    8080:
+      mode: STRICT
+    9090:
+      mode: DISABLE  # Metrics port, no mTLS
+```
+
+### Template 2: Istio Destination Rule for mTLS
+
+```yaml
+apiVersion: networking.istio.io/v1beta1
+kind: DestinationRule
+metadata:
+  name: default
+  namespace: istio-system
+spec:
+  host: "*.local"
+  trafficPolicy:
+    tls:
+      mode: ISTIO_MUTUAL
+---
+# TLS to external service
+apiVersion: networking.istio.io/v1beta1
+kind: DestinationRule
+metadata:
+  name: external-api
+spec:
+  host: api.external.com
+  trafficPolicy:
+    tls:
+      mode: SIMPLE
+      caCertificates: /etc/certs/external-ca.pem
+---
+# Mutual TLS to external service
+apiVersion: networking.istio.io/v1beta1
+kind: DestinationRule
+metadata:
+  name: partner-api
+spec:
+  host: api.partner.com
+  trafficPolicy:
+    tls:
+      mode: MUTUAL
+      clientCertificate: /etc/certs/client.pem
+      privateKey: /etc/certs/client-key.pem
+      caCertificates: /etc/certs/partner-ca.pem
+```
+
+### Template 3: Cert-Manager with Istio
+
+```yaml
+# Install cert-manager issuer for Istio
+apiVersion: cert-manager.io/v1
+kind: ClusterIssuer
+metadata:
+  name: istio-ca
+spec:
+  ca:
+    secretName: istio-ca-secret
+---
+# Create Istio CA secret
+apiVersion: v1
+kind: Secret
+metadata:
+  name: istio-ca-secret
+  namespace: cert-manager
+type: kubernetes.io/tls
+data:
+  tls.crt: <base64-encoded-ca-cert>
+  tls.key: <base64-encoded-ca-key>
+---
+# Certificate for workload
+apiVersion: cert-manager.io/v1
+kind: Certificate
+metadata:
+  name: my-service-cert
+  namespace: my-namespace
+spec:
+  secretName: my-service-tls
+  duration: 24h
+  renewBefore: 8h
+  issuerRef:
+    name: istio-ca
+    kind: ClusterIssuer
+  commonName: my-service.my-namespace.svc.cluster.local
+  dnsNames:
+    - my-service
+    - my-service.my-namespace
+    - my-service.my-namespace.svc
+    - my-service.my-namespace.svc.cluster.local
+  usages:
+    - server auth
+    - client auth
+```
+
+### Template 4: SPIFFE/SPIRE Integration
+
+```yaml
+# SPIRE Server configuration
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: spire-server
+  namespace: spire
+data:
+  server.conf: |
+    server {
+      bind_address = "0.0.0.0"
+      bind_port = "8081"
+      trust_domain = "example.org"
+      data_dir = "/run/spire/data"
+      log_level = "INFO"
+      ca_ttl = "168h"
+      default_x509_svid_ttl = "1h"
+    }
+
+    plugins {
+      DataStore "sql" {
+        plugin_data {
+          database_type = "sqlite3"
+          connection_string = "/run/spire/data/datastore.sqlite3"
+        }
+      }
+
+      NodeAttestor "k8s_psat" {
+        plugin_data {
+          clusters = {
+            "demo-cluster" = {
+              service_account_allow_list = ["spire:spire-agent"]
+            }
+          }
+        }
+      }
+
+      KeyManager "memory" {
+        plugin_data {}
+      }
+
+      UpstreamAuthority "disk" {
+        plugin_data {
+          key_file_path = "/run/spire/secrets/bootstrap.key"
+          cert_file_path = "/run/spire/secrets/bootstrap.crt"
+        }
+      }
+    }
+---
+# SPIRE Agent DaemonSet (abbreviated)
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: spire-agent
+  namespace: spire
+spec:
+  selector:
+    matchLabels:
+      app: spire-agent
+  template:
+    spec:
+      containers:
+        - name: spire-agent
+          image: ghcr.io/spiffe/spire-agent:1.8.0
+          volumeMounts:
+            - name: spire-agent-socket
+              mountPath: /run/spire/sockets
+      volumes:
+        - name: spire-agent-socket
+          hostPath:
+            path: /run/spire/sockets
+            type: DirectoryOrCreate
+```
+
+### Template 5: Linkerd mTLS (Automatic)
+
+```yaml
+# Linkerd enables mTLS automatically
+# Verify with:
+# linkerd viz edges deployment -n my-namespace
+
+# For external services without mTLS
+apiVersion: policy.linkerd.io/v1beta1
+kind: Server
+metadata:
+  name: external-api
+  namespace: my-namespace
+spec:
+  podSelector:
+    matchLabels:
+      app: my-app
+  port: external-api
+  proxyProtocol: HTTP/1  # or TLS for passthrough
+---
+# Skip TLS for specific port
+apiVersion: v1
+kind: Service
+metadata:
+  name: my-service
+  annotations:
+    config.linkerd.io/skip-outbound-ports: "3306"  # MySQL
+```
+
+## Certificate Rotation
+
+```bash
+# Istio - Check certificate expiry
+istioctl proxy-config secret deploy/my-app -o json | \
+  jq '.dynamicActiveSecrets[0].secret.tlsCertificate.certificateChain.inlineBytes' | \
+  tr -d '"' | base64 -d | openssl x509 -text -noout
+
+# Force certificate rotation
+kubectl rollout restart deployment/my-app
+
+# Check Linkerd identity
+linkerd identity -n my-namespace
+```
+
+## Debugging mTLS Issues
+
+```bash
+# Istio - Check if mTLS is enabled
+istioctl authn tls-check my-service.my-namespace.svc.cluster.local
+
+# Verify peer authentication
+kubectl get peerauthentication --all-namespaces
+
+# Check destination rules
+kubectl get destinationrule --all-namespaces
+
+# Debug TLS handshake
+istioctl proxy-config log deploy/my-app --level debug
+kubectl logs deploy/my-app -c istio-proxy | grep -i tls
+
+# Linkerd - Check mTLS status
+linkerd viz edges deployment -n my-namespace
+linkerd viz tap deploy/my-app --to deploy/my-backend
+```
+
+## Best Practices
+
+### Do's
+- **Start with PERMISSIVE** - Migrate gradually to STRICT
+- **Monitor certificate expiry** - Set up alerts
+- **Use short-lived certs** - 24h or less for workloads
+- **Rotate CA periodically** - Plan for CA rotation
+- **Log TLS errors** - For debugging and audit
+
+### Don'ts
+- **Don't disable mTLS** - For convenience in production
+- **Don't ignore cert expiry** - Automate rotation
+- **Don't use self-signed certs** - Use proper CA hierarchy
+- **Don't skip verification** - Verify the full chain
+
+## Resources
+
+- [Istio Security](https://istio.io/latest/docs/concepts/security/)
+- [SPIFFE/SPIRE](https://spiffe.io/)
+- [cert-manager](https://cert-manager.io/)
+- [Zero Trust Architecture (NIST)](https://www.nist.gov/publications/zero-trust-architecture)
--- a/plugins/cloud-infrastructure/skills/service-mesh-observability/SKILL.md
+++ b/plugins/cloud-infrastructure/skills/service-mesh-observability/SKILL.md
@@ -0,0 +1,383 @@
+---
+name: service-mesh-observability
+description: Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.
+---
+
+# Service Mesh Observability
+
+Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
+
+## When to Use This Skill
+
+- Setting up distributed tracing across services
+- Implementing service mesh metrics and dashboards
+- Debugging latency and error issues
+- Defining SLOs for service communication
+- Visualizing service dependencies
+- Troubleshooting mesh connectivity
+
+## Core Concepts
+
+### 1. Three Pillars of Observability
+
+```
+┌─────────────────────────────────────────────────────┐
+│                  Observability                       │
+├─────────────────┬─────────────────┬─────────────────┤
+│     Metrics     │     Traces      │      Logs       │
+│                 │                 │                 │
+│ • Request rate  │ • Span context  │ • Access logs   │
+│ • Error rate    │ • Latency       │ • Error details │
+│ • Latency P50   │ • Dependencies  │ • Debug info    │
+│ • Saturation    │ • Bottlenecks   │ • Audit trail   │
+└─────────────────┴─────────────────┴─────────────────┘
+```
+
+### 2. Golden Signals for Mesh
+
+| Signal | Description | Alert Threshold |
+|--------|-------------|-----------------|
+| **Latency** | Request duration P50, P99 | P99 > 500ms |
+| **Traffic** | Requests per second | Anomaly detection |
+| **Errors** | 5xx error rate | > 1% |
+| **Saturation** | Resource utilization | > 80% |
+
+## Templates
+
+### Template 1: Istio with Prometheus & Grafana
+
+```yaml
+# Install Prometheus
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: prometheus
+  namespace: istio-system
+data:
+  prometheus.yml: |
+    global:
+      scrape_interval: 15s
+    scrape_configs:
+      - job_name: 'istio-mesh'
+        kubernetes_sd_configs:
+          - role: endpoints
+            namespaces:
+              names:
+                - istio-system
+        relabel_configs:
+          - source_labels: [__meta_kubernetes_service_name]
+            action: keep
+            regex: istio-telemetry
+---
+# ServiceMonitor for Prometheus Operator
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: istio-mesh
+  namespace: istio-system
+spec:
+  selector:
+    matchLabels:
+      app: istiod
+  endpoints:
+    - port: http-monitoring
+      interval: 15s
+```
+
+### Template 2: Key Istio Metrics Queries
+
+```promql
+# Request rate by service
+sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
+
+# Error rate (5xx)
+sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
+  / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
+
+# P99 latency
+histogram_quantile(0.99,
+  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
+  by (le, destination_service_name))
+
+# TCP connections
+sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
+
+# Request size
+histogram_quantile(0.99,
+  sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
+  by (le, destination_service_name))
+```
+
+### Template 3: Jaeger Distributed Tracing
+
+```yaml
+# Jaeger installation for Istio
+apiVersion: install.istio.io/v1alpha1
+kind: IstioOperator
+spec:
+  meshConfig:
+    enableTracing: true
+    defaultConfig:
+      tracing:
+        sampling: 100.0  # 100% in dev, lower in prod
+        zipkin:
+          address: jaeger-collector.istio-system:9411
+---
+# Jaeger deployment
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: jaeger
+  namespace: istio-system
+spec:
+  selector:
+    matchLabels:
+      app: jaeger
+  template:
+    metadata:
+      labels:
+        app: jaeger
+    spec:
+      containers:
+        - name: jaeger
+          image: jaegertracing/all-in-one:1.50
+          ports:
+            - containerPort: 5775   # UDP
+            - containerPort: 6831   # Thrift
+            - containerPort: 6832   # Thrift
+            - containerPort: 5778   # Config
+            - containerPort: 16686  # UI
+            - containerPort: 14268  # HTTP
+            - containerPort: 14250  # gRPC
+            - containerPort: 9411   # Zipkin
+          env:
+            - name: COLLECTOR_ZIPKIN_HOST_PORT
+              value: ":9411"
+```
+
+### Template 4: Linkerd Viz Dashboard
+
+```bash
+# Install Linkerd viz extension
+linkerd viz install | kubectl apply -f -
+
+# Access dashboard
+linkerd viz dashboard
+
+# CLI commands for observability
+# Top requests
+linkerd viz top deploy/my-app
+
+# Per-route metrics
+linkerd viz routes deploy/my-app --to deploy/backend
+
+# Live traffic inspection
+linkerd viz tap deploy/my-app --to deploy/backend
+
+# Service edges (dependencies)
+linkerd viz edges deployment -n my-namespace
+```
+
+### Template 5: Grafana Dashboard JSON
+
+```json
+{
+  "dashboard": {
+    "title": "Service Mesh Overview",
+    "panels": [
+      {
+        "title": "Request Rate",
+        "type": "graph",
+        "targets": [
+          {
+            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
+            "legendFormat": "{{destination_service_name}}"
+          }
+        ]
+      },
+      {
+        "title": "Error Rate",
+        "type": "gauge",
+        "targets": [
+          {
+            "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
+          }
+        ],
+        "fieldConfig": {
+          "defaults": {
+            "thresholds": {
+              "steps": [
+                {"value": 0, "color": "green"},
+                {"value": 1, "color": "yellow"},
+                {"value": 5, "color": "red"}
+              ]
+            }
+          }
+        }
+      },
+      {
+        "title": "P99 Latency",
+        "type": "graph",
+        "targets": [
+          {
+            "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
+            "legendFormat": "{{destination_service_name}}"
+          }
+        ]
+      },
+      {
+        "title": "Service Topology",
+        "type": "nodeGraph",
+        "targets": [
+          {
+            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
+          }
+        ]
+      }
+    ]
+  }
+}
+```
+
+### Template 6: Kiali Service Mesh Visualization
+
+```yaml
+# Kiali installation
+apiVersion: kiali.io/v1alpha1
+kind: Kiali
+metadata:
+  name: kiali
+  namespace: istio-system
+spec:
+  auth:
+    strategy: anonymous  # or openid, token
+  deployment:
+    accessible_namespaces:
+      - "**"
+  external_services:
+    prometheus:
+      url: http://prometheus.istio-system:9090
+    tracing:
+      url: http://jaeger-query.istio-system:16686
+    grafana:
+      url: http://grafana.istio-system:3000
+```
+
+### Template 7: OpenTelemetry Integration
+
+```yaml
+# OpenTelemetry Collector for mesh
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: otel-collector-config
+data:
+  config.yaml: |
+    receivers:
+      otlp:
+        protocols:
+          grpc:
+            endpoint: 0.0.0.0:4317
+          http:
+            endpoint: 0.0.0.0:4318
+      zipkin:
+        endpoint: 0.0.0.0:9411
+
+    processors:
+      batch:
+        timeout: 10s
+
+    exporters:
+      jaeger:
+        endpoint: jaeger-collector:14250
+        tls:
+          insecure: true
+      prometheus:
+        endpoint: 0.0.0.0:8889
+
+    service:
+      pipelines:
+        traces:
+          receivers: [otlp, zipkin]
+          processors: [batch]
+          exporters: [jaeger]
+        metrics:
+          receivers: [otlp]
+          processors: [batch]
+          exporters: [prometheus]
+---
+# Istio Telemetry v2 with OTel
+apiVersion: telemetry.istio.io/v1alpha1
+kind: Telemetry
+metadata:
+  name: mesh-default
+  namespace: istio-system
+spec:
+  tracing:
+    - providers:
+        - name: otel
+      randomSamplingPercentage: 10
+```
+
+## Alerting Rules
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: mesh-alerts
+  namespace: istio-system
+spec:
+  groups:
+    - name: mesh.rules
+      rules:
+        - alert: HighErrorRate
+          expr: |
+            sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
+            / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
+          for: 5m
+          labels:
+            severity: critical
+          annotations:
+            summary: "High error rate for {{ $labels.destination_service_name }}"
+
+        - alert: HighLatency
+          expr: |
+            histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
+            by (le, destination_service_name)) > 1000
+          for: 5m
+          labels:
+            severity: warning
+          annotations:
+            summary: "High P99 latency for {{ $labels.destination_service_name }}"
+
+        - alert: MeshCertExpiring
+          expr: |
+            (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
+          labels:
+            severity: warning
+          annotations:
+            summary: "Mesh certificate expiring in less than 7 days"
+```
+
+## Best Practices
+
+### Do's
+- **Sample appropriately** - 100% in dev, 1-10% in prod
+- **Use trace context** - Propagate headers consistently
+- **Set up alerts** - For golden signals
+- **Correlate metrics/traces** - Use exemplars
+- **Retain strategically** - Hot/cold storage tiers
+
+### Don'ts
+- **Don't over-sample** - Storage costs add up
+- **Don't ignore cardinality** - Limit label values
+- **Don't skip dashboards** - Visualize dependencies
+- **Don't forget costs** - Monitor observability costs
+
+## Resources
+
+- [Istio Observability](https://istio.io/latest/docs/tasks/observability/)
+- [Linkerd Observability](https://linkerd.io/2.14/features/dashboard/)
+- [OpenTelemetry](https://opentelemetry.io/)
+- [Kiali](https://kiali.io/)