mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 09:37:15 +00:00
Remove references to non-existent resource files (references/, assets/, scripts/, examples/) from 115 skill SKILL.md files. These sections pointed to directories and files that were never created, causing confusion when users install skills. Also fix broken Code of Conduct links in issue templates to use absolute GitHub URLs instead of relative paths that 404.
395 lines
9.8 KiB
Markdown
395 lines
9.8 KiB
Markdown
---
|
|
name: prometheus-configuration
|
|
description: Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems.
|
|
---
|
|
|
|
# Prometheus Configuration
|
|
|
|
Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.
|
|
|
|
## Purpose
|
|
|
|
Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.
|
|
|
|
## When to Use
|
|
|
|
- Set up Prometheus monitoring
|
|
- Configure metric scraping
|
|
- Create recording rules
|
|
- Design alert rules
|
|
- Implement service discovery
|
|
|
|
## Prometheus Architecture
|
|
|
|
```
|
|
┌──────────────┐
|
|
│ Applications │ ← Instrumented with client libraries
|
|
└──────┬───────┘
|
|
│ /metrics endpoint
|
|
↓
|
|
┌──────────────┐
|
|
│ Prometheus │ ← Scrapes metrics periodically
|
|
│ Server │
|
|
└──────┬───────┘
|
|
│
|
|
├─→ AlertManager (alerts)
|
|
├─→ Grafana (visualization)
|
|
└─→ Long-term storage (Thanos/Cortex)
|
|
```
|
|
|
|
## Installation
|
|
|
|
### Kubernetes with Helm
|
|
|
|
```bash
|
|
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
|
helm repo update
|
|
|
|
helm install prometheus prometheus-community/kube-prometheus-stack \
|
|
--namespace monitoring \
|
|
--create-namespace \
|
|
--set prometheus.prometheusSpec.retention=30d \
|
|
--set prometheus.prometheusSpec.storageVolumeSize=50Gi
|
|
```
|
|
|
|
### Docker Compose
|
|
|
|
```yaml
|
|
version: "3.8"
|
|
services:
|
|
prometheus:
|
|
image: prom/prometheus:latest
|
|
ports:
|
|
- "9090:9090"
|
|
volumes:
|
|
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
|
- prometheus-data:/prometheus
|
|
command:
|
|
- "--config.file=/etc/prometheus/prometheus.yml"
|
|
- "--storage.tsdb.path=/prometheus"
|
|
- "--storage.tsdb.retention.time=30d"
|
|
|
|
volumes:
|
|
prometheus-data:
|
|
```
|
|
|
|
## Configuration File
|
|
|
|
**prometheus.yml:**
|
|
|
|
```yaml
|
|
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
external_labels:
|
|
cluster: "production"
|
|
region: "us-west-2"
|
|
|
|
# Alertmanager configuration
|
|
alerting:
|
|
alertmanagers:
|
|
- static_configs:
|
|
- targets:
|
|
- alertmanager:9093
|
|
|
|
# Load rules files
|
|
rule_files:
|
|
- /etc/prometheus/rules/*.yml
|
|
|
|
# Scrape configurations
|
|
scrape_configs:
|
|
# Prometheus itself
|
|
- job_name: "prometheus"
|
|
static_configs:
|
|
- targets: ["localhost:9090"]
|
|
|
|
# Node exporters
|
|
- job_name: "node-exporter"
|
|
static_configs:
|
|
- targets:
|
|
- "node1:9100"
|
|
- "node2:9100"
|
|
- "node3:9100"
|
|
relabel_configs:
|
|
- source_labels: [__address__]
|
|
target_label: instance
|
|
regex: "([^:]+)(:[0-9]+)?"
|
|
replacement: "${1}"
|
|
|
|
# Kubernetes pods with annotations
|
|
- job_name: "kubernetes-pods"
|
|
kubernetes_sd_configs:
|
|
- role: pod
|
|
relabel_configs:
|
|
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
|
|
action: keep
|
|
regex: true
|
|
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
|
|
action: replace
|
|
target_label: __metrics_path__
|
|
regex: (.+)
|
|
- source_labels:
|
|
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
|
|
action: replace
|
|
regex: ([^:]+)(?::\d+)?;(\d+)
|
|
replacement: $1:$2
|
|
target_label: __address__
|
|
- source_labels: [__meta_kubernetes_namespace]
|
|
action: replace
|
|
target_label: namespace
|
|
- source_labels: [__meta_kubernetes_pod_name]
|
|
action: replace
|
|
target_label: pod
|
|
|
|
# Application metrics
|
|
- job_name: "my-app"
|
|
static_configs:
|
|
- targets:
|
|
- "app1.example.com:9090"
|
|
- "app2.example.com:9090"
|
|
metrics_path: "/metrics"
|
|
scheme: "https"
|
|
tls_config:
|
|
ca_file: /etc/prometheus/ca.crt
|
|
cert_file: /etc/prometheus/client.crt
|
|
key_file: /etc/prometheus/client.key
|
|
```
|
|
|
|
**Reference:** See `assets/prometheus.yml.template`
|
|
|
|
## Scrape Configurations
|
|
|
|
### Static Targets
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: "static-targets"
|
|
static_configs:
|
|
- targets: ["host1:9100", "host2:9100"]
|
|
labels:
|
|
env: "production"
|
|
region: "us-west-2"
|
|
```
|
|
|
|
### File-based Service Discovery
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: "file-sd"
|
|
file_sd_configs:
|
|
- files:
|
|
- /etc/prometheus/targets/*.json
|
|
- /etc/prometheus/targets/*.yml
|
|
refresh_interval: 5m
|
|
```
|
|
|
|
**targets/production.json:**
|
|
|
|
```json
|
|
[
|
|
{
|
|
"targets": ["app1:9090", "app2:9090"],
|
|
"labels": {
|
|
"env": "production",
|
|
"service": "api"
|
|
}
|
|
}
|
|
]
|
|
```
|
|
|
|
### Kubernetes Service Discovery
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: "kubernetes-services"
|
|
kubernetes_sd_configs:
|
|
- role: service
|
|
relabel_configs:
|
|
- source_labels:
|
|
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
|
|
action: keep
|
|
regex: true
|
|
- source_labels:
|
|
[__meta_kubernetes_service_annotation_prometheus_io_scheme]
|
|
action: replace
|
|
target_label: __scheme__
|
|
regex: (https?)
|
|
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
|
|
action: replace
|
|
target_label: __metrics_path__
|
|
regex: (.+)
|
|
```
|
|
|
|
**Reference:** See `references/scrape-configs.md`
|
|
|
|
## Recording Rules
|
|
|
|
Create pre-computed metrics for frequently queried expressions:
|
|
|
|
```yaml
|
|
# /etc/prometheus/rules/recording_rules.yml
|
|
groups:
|
|
- name: api_metrics
|
|
interval: 15s
|
|
rules:
|
|
# HTTP request rate per service
|
|
- record: job:http_requests:rate5m
|
|
expr: sum by (job) (rate(http_requests_total[5m]))
|
|
|
|
# Error rate percentage
|
|
- record: job:http_requests_errors:rate5m
|
|
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
|
|
|
|
- record: job:http_requests_error_rate:percentage
|
|
expr: |
|
|
(job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100
|
|
|
|
# P95 latency
|
|
- record: job:http_request_duration:p95
|
|
expr: |
|
|
histogram_quantile(0.95,
|
|
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
|
|
)
|
|
|
|
- name: resource_metrics
|
|
interval: 30s
|
|
rules:
|
|
# CPU utilization percentage
|
|
- record: instance:node_cpu:utilization
|
|
expr: |
|
|
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
|
|
|
# Memory utilization percentage
|
|
- record: instance:node_memory:utilization
|
|
expr: |
|
|
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
|
|
|
|
# Disk usage percentage
|
|
- record: instance:node_disk:utilization
|
|
expr: |
|
|
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
|
|
```
|
|
|
|
**Reference:** See `references/recording-rules.md`
|
|
|
|
## Alert Rules
|
|
|
|
```yaml
|
|
# /etc/prometheus/rules/alert_rules.yml
|
|
groups:
|
|
- name: availability
|
|
interval: 30s
|
|
rules:
|
|
- alert: ServiceDown
|
|
expr: up{job="my-app"} == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Service {{ $labels.instance }} is down"
|
|
description: "{{ $labels.job }} has been down for more than 1 minute"
|
|
|
|
- alert: HighErrorRate
|
|
expr: job:http_requests_error_rate:percentage > 5
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High error rate for {{ $labels.job }}"
|
|
description: "Error rate is {{ $value }}% (threshold: 5%)"
|
|
|
|
- alert: HighLatency
|
|
expr: job:http_request_duration:p95 > 1
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High latency for {{ $labels.job }}"
|
|
description: "P95 latency is {{ $value }}s (threshold: 1s)"
|
|
|
|
- name: resources
|
|
interval: 1m
|
|
rules:
|
|
- alert: HighCPUUsage
|
|
expr: instance:node_cpu:utilization > 80
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High CPU usage on {{ $labels.instance }}"
|
|
description: "CPU usage is {{ $value }}%"
|
|
|
|
- alert: HighMemoryUsage
|
|
expr: instance:node_memory:utilization > 85
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High memory usage on {{ $labels.instance }}"
|
|
description: "Memory usage is {{ $value }}%"
|
|
|
|
- alert: DiskSpaceLow
|
|
expr: instance:node_disk:utilization > 90
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Low disk space on {{ $labels.instance }}"
|
|
description: "Disk usage is {{ $value }}%"
|
|
```
|
|
|
|
## Validation
|
|
|
|
```bash
|
|
# Validate configuration
|
|
promtool check config prometheus.yml
|
|
|
|
# Validate rules
|
|
promtool check rules /etc/prometheus/rules/*.yml
|
|
|
|
# Test query
|
|
promtool query instant http://localhost:9090 'up'
|
|
```
|
|
|
|
**Reference:** See `scripts/validate-prometheus.sh`
|
|
|
|
## Best Practices
|
|
|
|
1. **Use consistent naming** for metrics (prefix_name_unit)
|
|
2. **Set appropriate scrape intervals** (15-60s typical)
|
|
3. **Use recording rules** for expensive queries
|
|
4. **Implement high availability** (multiple Prometheus instances)
|
|
5. **Configure retention** based on storage capacity
|
|
6. **Use relabeling** for metric cleanup
|
|
7. **Monitor Prometheus itself**
|
|
8. **Implement federation** for large deployments
|
|
9. **Use Thanos/Cortex** for long-term storage
|
|
10. **Document custom metrics**
|
|
|
|
## Troubleshooting
|
|
|
|
**Check scrape targets:**
|
|
|
|
```bash
|
|
curl http://localhost:9090/api/v1/targets
|
|
```
|
|
|
|
**Check configuration:**
|
|
|
|
```bash
|
|
curl http://localhost:9090/api/v1/status/config
|
|
```
|
|
|
|
**Test query:**
|
|
|
|
```bash
|
|
curl 'http://localhost:9090/api/v1/query?query=up'
|
|
```
|
|
|
|
|
|
## Related Skills
|
|
|
|
- `grafana-dashboards` - For visualization
|
|
- `slo-implementation` - For SLO monitoring
|
|
- `distributed-tracing` - For request tracing
|