feat: add 5 new specialized agents with 20 skills

Add domain expert agents with comprehensive skill sets: - service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability - event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas - vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search - monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces - threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements Update all documentation to reflect correct counts: - 67 plugins, 99 agents, 107 skills, 71 commands
2026-03-18 17:47:16 +00:00 · 2025-12-16 16:00:58 -05:00
parent c7ad381360
commit 01d93fc227
58 changed files with 24830 additions and 50 deletions
--- a/plugins/cloud-infrastructure/skills/service-mesh-observability/SKILL.md
+++ b/plugins/cloud-infrastructure/skills/service-mesh-observability/SKILL.md
@@ -0,0 +1,383 @@
+---
+name: service-mesh-observability
+description: Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.
+---
+
+# Service Mesh Observability
+
+Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
+
+## When to Use This Skill
+
+- Setting up distributed tracing across services
+- Implementing service mesh metrics and dashboards
+- Debugging latency and error issues
+- Defining SLOs for service communication
+- Visualizing service dependencies
+- Troubleshooting mesh connectivity
+
+## Core Concepts
+
+### 1. Three Pillars of Observability
+
+```
+┌─────────────────────────────────────────────────────┐
+│                  Observability                       │
+├─────────────────┬─────────────────┬─────────────────┤
+│     Metrics     │     Traces      │      Logs       │
+│                 │                 │                 │
+│ • Request rate  │ • Span context  │ • Access logs   │
+│ • Error rate    │ • Latency       │ • Error details │
+│ • Latency P50   │ • Dependencies  │ • Debug info    │
+│ • Saturation    │ • Bottlenecks   │ • Audit trail   │
+└─────────────────┴─────────────────┴─────────────────┘
+```
+
+### 2. Golden Signals for Mesh
+
+| Signal | Description | Alert Threshold |
+|--------|-------------|-----------------|
+| **Latency** | Request duration P50, P99 | P99 > 500ms |
+| **Traffic** | Requests per second | Anomaly detection |
+| **Errors** | 5xx error rate | > 1% |
+| **Saturation** | Resource utilization | > 80% |
+
+## Templates
+
+### Template 1: Istio with Prometheus & Grafana
+
+```yaml
+# Install Prometheus
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: prometheus
+  namespace: istio-system
+data:
+  prometheus.yml: |
+    global:
+      scrape_interval: 15s
+    scrape_configs:
+      - job_name: 'istio-mesh'
+        kubernetes_sd_configs:
+          - role: endpoints
+            namespaces:
+              names:
+                - istio-system
+        relabel_configs:
+          - source_labels: [__meta_kubernetes_service_name]
+            action: keep
+            regex: istio-telemetry
+---
+# ServiceMonitor for Prometheus Operator
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: istio-mesh
+  namespace: istio-system
+spec:
+  selector:
+    matchLabels:
+      app: istiod
+  endpoints:
+    - port: http-monitoring
+      interval: 15s
+```
+
+### Template 2: Key Istio Metrics Queries
+
+```promql
+# Request rate by service
+sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
+
+# Error rate (5xx)
+sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
+  / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
+
+# P99 latency
+histogram_quantile(0.99,
+  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
+  by (le, destination_service_name))
+
+# TCP connections
+sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
+
+# Request size
+histogram_quantile(0.99,
+  sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
+  by (le, destination_service_name))
+```
+
+### Template 3: Jaeger Distributed Tracing
+
+```yaml
+# Jaeger installation for Istio
+apiVersion: install.istio.io/v1alpha1
+kind: IstioOperator
+spec:
+  meshConfig:
+    enableTracing: true
+    defaultConfig:
+      tracing:
+        sampling: 100.0  # 100% in dev, lower in prod
+        zipkin:
+          address: jaeger-collector.istio-system:9411
+---
+# Jaeger deployment
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: jaeger
+  namespace: istio-system
+spec:
+  selector:
+    matchLabels:
+      app: jaeger
+  template:
+    metadata:
+      labels:
+        app: jaeger
+    spec:
+      containers:
+        - name: jaeger
+          image: jaegertracing/all-in-one:1.50
+          ports:
+            - containerPort: 5775   # UDP
+            - containerPort: 6831   # Thrift
+            - containerPort: 6832   # Thrift
+            - containerPort: 5778   # Config
+            - containerPort: 16686  # UI
+            - containerPort: 14268  # HTTP
+            - containerPort: 14250  # gRPC
+            - containerPort: 9411   # Zipkin
+          env:
+            - name: COLLECTOR_ZIPKIN_HOST_PORT
+              value: ":9411"
+```
+
+### Template 4: Linkerd Viz Dashboard
+
+```bash
+# Install Linkerd viz extension
+linkerd viz install | kubectl apply -f -
+
+# Access dashboard
+linkerd viz dashboard
+
+# CLI commands for observability
+# Top requests
+linkerd viz top deploy/my-app
+
+# Per-route metrics
+linkerd viz routes deploy/my-app --to deploy/backend
+
+# Live traffic inspection
+linkerd viz tap deploy/my-app --to deploy/backend
+
+# Service edges (dependencies)
+linkerd viz edges deployment -n my-namespace
+```
+
+### Template 5: Grafana Dashboard JSON
+
+```json
+{
+  "dashboard": {
+    "title": "Service Mesh Overview",
+    "panels": [
+      {
+        "title": "Request Rate",
+        "type": "graph",
+        "targets": [
+          {
+            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
+            "legendFormat": "{{destination_service_name}}"
+          }
+        ]
+      },
+      {
+        "title": "Error Rate",
+        "type": "gauge",
+        "targets": [
+          {
+            "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
+          }
+        ],
+        "fieldConfig": {
+          "defaults": {
+            "thresholds": {
+              "steps": [
+                {"value": 0, "color": "green"},
+                {"value": 1, "color": "yellow"},
+                {"value": 5, "color": "red"}
+              ]
+            }
+          }
+        }
+      },
+      {
+        "title": "P99 Latency",
+        "type": "graph",
+        "targets": [
+          {
+            "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
+            "legendFormat": "{{destination_service_name}}"
+          }
+        ]
+      },
+      {
+        "title": "Service Topology",
+        "type": "nodeGraph",
+        "targets": [
+          {
+            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
+          }
+        ]
+      }
+    ]
+  }
+}
+```
+
+### Template 6: Kiali Service Mesh Visualization
+
+```yaml
+# Kiali installation
+apiVersion: kiali.io/v1alpha1
+kind: Kiali
+metadata:
+  name: kiali
+  namespace: istio-system
+spec:
+  auth:
+    strategy: anonymous  # or openid, token
+  deployment:
+    accessible_namespaces:
+      - "**"
+  external_services:
+    prometheus:
+      url: http://prometheus.istio-system:9090
+    tracing:
+      url: http://jaeger-query.istio-system:16686
+    grafana:
+      url: http://grafana.istio-system:3000
+```
+
+### Template 7: OpenTelemetry Integration
+
+```yaml
+# OpenTelemetry Collector for mesh
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: otel-collector-config
+data:
+  config.yaml: |
+    receivers:
+      otlp:
+        protocols:
+          grpc:
+            endpoint: 0.0.0.0:4317
+          http:
+            endpoint: 0.0.0.0:4318
+      zipkin:
+        endpoint: 0.0.0.0:9411
+
+    processors:
+      batch:
+        timeout: 10s
+
+    exporters:
+      jaeger:
+        endpoint: jaeger-collector:14250
+        tls:
+          insecure: true
+      prometheus:
+        endpoint: 0.0.0.0:8889
+
+    service:
+      pipelines:
+        traces:
+          receivers: [otlp, zipkin]
+          processors: [batch]
+          exporters: [jaeger]
+        metrics:
+          receivers: [otlp]
+          processors: [batch]
+          exporters: [prometheus]
+---
+# Istio Telemetry v2 with OTel
+apiVersion: telemetry.istio.io/v1alpha1
+kind: Telemetry
+metadata:
+  name: mesh-default
+  namespace: istio-system
+spec:
+  tracing:
+    - providers:
+        - name: otel
+      randomSamplingPercentage: 10
+```
+
+## Alerting Rules
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: mesh-alerts
+  namespace: istio-system
+spec:
+  groups:
+    - name: mesh.rules
+      rules:
+        - alert: HighErrorRate
+          expr: |
+            sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
+            / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
+          for: 5m
+          labels:
+            severity: critical
+          annotations:
+            summary: "High error rate for {{ $labels.destination_service_name }}"
+
+        - alert: HighLatency
+          expr: |
+            histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
+            by (le, destination_service_name)) > 1000
+          for: 5m
+          labels:
+            severity: warning
+          annotations:
+            summary: "High P99 latency for {{ $labels.destination_service_name }}"
+
+        - alert: MeshCertExpiring
+          expr: |
+            (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
+          labels:
+            severity: warning
+          annotations:
+            summary: "Mesh certificate expiring in less than 7 days"
+```
+
+## Best Practices
+
+### Do's
+- **Sample appropriately** - 100% in dev, 1-10% in prod
+- **Use trace context** - Propagate headers consistently
+- **Set up alerts** - For golden signals
+- **Correlate metrics/traces** - Use exemplars
+- **Retain strategically** - Hot/cold storage tiers
+
+### Don'ts
+- **Don't over-sample** - Storage costs add up
+- **Don't ignore cardinality** - Limit label values
+- **Don't skip dashboards** - Visualize dependencies
+- **Don't forget costs** - Monitor observability costs
+
+## Resources
+
+- [Istio Observability](https://istio.io/latest/docs/tasks/observability/)
+- [Linkerd Observability](https://linkerd.io/2.14/features/dashboard/)
+- [OpenTelemetry](https://opentelemetry.io/)
+- [Kiali](https://kiali.io/)