🎯

prometheus

🎯Skill

from cosmix/claude-loom

VibeIndex|
What it does

Configures and manages Prometheus monitoring, enabling metrics collection, PromQL querying, alerting rules, and service discovery for cloud-native observability.

πŸ“¦

Part of

cosmix/claude-loom(12 items)

prometheus

Installation

Install ScriptRun install script
curl -fsSL https://raw.githubusercontent.com/cosmix/loom/main/install.sh | bash
git cloneClone repository
git clone https://github.com/cosmix/loom.git
Install ScriptRun install script
bash install.sh
πŸ“– Extracted from docs: cosmix/claude-loom
1Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

|

Overview

# Prometheus Monitoring and Alerting

Overview

Prometheus is a powerful open-source monitoring and alerting system designed for reliability and scalability in cloud-native environments. Built for multi-dimensional time-series data with flexible querying via PromQL.

Architecture Components

  • Prometheus Server: Core component that scrapes and stores time-series data with local TSDB
  • Alertmanager: Handles alerts, deduplication, grouping, routing, and notifications to receivers
  • Pushgateway: Allows ephemeral jobs to push metrics (use sparingly - prefer pull model)
  • Exporters: Convert metrics from third-party systems to Prometheus format (node, blackbox, etc.)
  • Client Libraries: Instrument application code (Go, Java, Python, Rust, etc.)
  • Prometheus Operator: Kubernetes-native deployment and management via CRDs
  • Remote Storage: Long-term storage via Thanos, Cortex, Mimir for multi-cluster federation

Data Model

  • Metrics: Time-series data identified by metric name and key-value labels
  • Format: metric_name{label1="value1", label2="value2"} sample_value timestamp
  • Metric Types:

- Counter: Monotonically increasing value (requests, errors) - use rate() or increase() for querying

- Gauge: Value that can go up/down (temperature, memory usage, queue length)

- Histogram: Observations in configurable buckets (latency, request size) - exposes _bucket, _sum, _count

- Summary: Similar to histogram but calculates quantiles client-side - use histograms for aggregation

Setup and Configuration

Basic Prometheus Server Configuration

```yaml

# prometheus.yml

global:

scrape_interval: 15s

scrape_timeout: 10s

evaluation_interval: 15s

external_labels:

cluster: "production"

region: "us-east-1"

# Alertmanager configuration

alerting:

alertmanagers:

- static_configs:

- targets:

- alertmanager:9093

# Load rules files

rule_files:

- "alerts/*.yml"

- "rules/*.yml"

# Scrape configurations

scrape_configs:

# Prometheus itself

- job_name: "prometheus"

static_configs:

- targets: ["localhost:9090"]

# Application services

- job_name: "application"

metrics_path: "/metrics"

static_configs:

- targets:

- "app-1:8080"

- "app-2:8080"

labels:

env: "production"

team: "backend"

# Kubernetes service discovery

- job_name: "kubernetes-pods"

kubernetes_sd_configs:

- role: pod

relabel_configs:

# Only scrape pods with prometheus.io/scrape annotation

- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]

action: keep

regex: true

# Use custom metrics path if specified

- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]

action: replace

target_label: __metrics_path__

regex: (.+)

# Use custom port if specified

- source_labels:

[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]

action: replace

regex: ([^:]+)(?::\d+)?;(\d+)

replacement: $1:$2

target_label: __address__

# Add namespace label

- source_labels: [__meta_kubernetes_namespace]

action: replace

target_label: kubernetes_namespace

# Add pod name label

- source_labels: [__meta_kubernetes_pod_name]

action: replace

target_label: kubernetes_pod_name

# Add service name label

- source_labels: [__meta_kubernetes_pod_label_app]

action: replace

target_label: app

# Node Exporter for host metrics

- job_name: "node-exporter"

static_configs:

- targets:

- "node-exporter:9100"

```

Alertmanager Configuration

```yaml

# alertmanager.yml

global:

resolve_timeout: 5m

slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

# Template files for custom notifications

templates:

- "/etc/alertmanager/templates/*.tmpl"

# Route alerts to appropriate receivers

route:

group_by: ["alertname", "cluster", "service"]

group_wait: 10s

group_interval: 10s

repeat_interval: 12h

receiver: "default"

routes:

# Critical alerts go to PagerDuty

- match:

severity: critical

receiver: "pagerduty"

continue: true

# Database alerts to DBA team

- match:

team: database

receiver: "dba-team"

group_by: ["alertname", "instance"]

# Development environment alerts

- match:

env: development

receiver: "slack-dev"

group_wait: 5m

repeat_interval: 4h

# Inhibition rules (suppress alerts)

inhibit_rules:

# Suppress warning alerts if critical alert is firing

- source_match:

severity: "critical"

target_match:

severity: "warning"

equal: ["alertname", "instance"]

# Suppress instance alerts if entire service is down

- source_match:

alertname: "ServiceDown"

target_match_re:

alertname: ".*"

equal: ["service"]

receivers:

- name: "default"

slack_configs:

- channel: "#alerts"

title: "Alert: {{ .GroupLabels.alertname }}"

text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"

- name: "pagerduty"

pagerduty_configs:

- service_key: "YOUR_PAGERDUTY_SERVICE_KEY"

description: "{{ .GroupLabels.alertname }}"

- name: "dba-team"

slack_configs:

- channel: "#database-alerts"

email_configs:

- to: "dba-team@example.com"

headers:

Subject: "Database Alert: {{ .GroupLabels.alertname }}"

- name: "slack-dev"

slack_configs:

- channel: "#dev-alerts"

send_resolved: true

```

Best Practices

Metric Naming Conventions

Follow these naming patterns for consistency:

```text

# Format: ___

# Counters (always use _total suffix)

http_requests_total

http_request_errors_total

cache_hits_total

# Gauges

memory_usage_bytes

active_connections

queue_size

# Histograms (use _bucket, _sum, _count suffixes automatically)

http_request_duration_seconds

response_size_bytes

db_query_duration_seconds

# Use consistent base units

  • seconds for duration (not milliseconds)
  • bytes for size (not kilobytes)
  • ratio for percentages (0.0-1.0, not 0-100)

```

Label Cardinality Management

#### DO

```yaml

# Good: Bounded cardinality

http_requests_total{method="GET", status="200", endpoint="/api/users"}

# Good: Reasonable number of label values

db_queries_total{table="users", operation="select"}

```

#### DON'T

```yaml

# Bad: Unbounded cardinality (user IDs, email addresses, timestamps)

http_requests_total{user_id="12345"}

http_requests_total{email="user@example.com"}

http_requests_total{timestamp="1234567890"}

# Bad: High cardinality (full URLs, IP addresses)

http_requests_total{url="/api/users/12345/profile"}

http_requests_total{client_ip="192.168.1.100"}

```

#### Guidelines

  • Keep label values to < 10 per label (ideally)
  • Total unique time-series per metric should be < 10,000
  • Use recording rules to pre-aggregate high-cardinality metrics
  • Avoid labels with unbounded values (IDs, timestamps, user input)

Recording Rules for Performance

Use recording rules to pre-compute expensive queries:

```yaml

# rules/recording_rules.yml

groups:

- name: performance_rules

interval: 30s

rules:

# Pre-calculate request rates

- record: job:http_requests:rate5m

expr: sum(rate(http_requests_total[5m])) by (job)

# Pre-calculate error rates

- record: job:http_request_errors:rate5m

expr: sum(rate(http_request_errors_total[5m])) by (job)

# Pre-calculate error ratio

- record: job:http_request_error_ratio:rate5m

expr: |

job:http_request_errors:rate5m

/

job:http_requests:rate5m

# Pre-aggregate latency percentiles

- record: job:http_request_duration_seconds:p95

expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

- record: job:http_request_duration_seconds:p99

expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

- name: aggregation_rules

interval: 1m

rules:

# Multi-level aggregation for dashboards

- record: instance:node_cpu_utilization:ratio

expr: |

1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

- record: cluster:node_cpu_utilization:ratio

expr: avg(instance:node_cpu_utilization:ratio)

# Memory aggregation

- record: instance:node_memory_utilization:ratio

expr: |

1 - (

node_memory_MemAvailable_bytes

/

node_memory_MemTotal_bytes

)

```

Alert Design (Symptoms vs Causes)

#### Alert on symptoms (user-facing impact), not causes

```yaml

# alerts/symptom_based.yml

groups:

- name: symptom_alerts

rules:

# GOOD: Alert on user-facing symptoms

- alert: HighErrorRate

expr: |

(

sum(rate(http_requests_total{status=~"5.."}[5m]))

/

sum(rate(http_requests_total[5m]))

) > 0.05

for: 5m

labels:

severity: critical

team: backend

annotations:

summary: "High error rate detected"

description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"

runbook: "https://wiki.example.com/runbooks/high-error-rate"

- alert: HighLatency

expr: |

histogram_quantile(0.95,

sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)

) > 1

for: 5m

labels:

severity: warning

team: backend

annotations:

summary: "High latency on {{ $labels.service }}"

description: "P95 latency is {{ $value }}s (threshold: 1s)"

impact: "Users experiencing slow page loads"

# GOOD: SLO-based alerting

- alert: SLOBudgetBurnRate

expr: |

(

1 - (

sum(rate(http_requests_total{status!~"5.."}[1h]))

/

sum(rate(http_requests_total[1h]))

)

) > (14.4 * (1 - 0.999)) # 14.4x burn rate for 99.9% SLO

for: 5m

labels:

severity: critical

team: sre

annotations:

summary: "SLO budget burning too fast"

description: "At current rate, monthly error budget will be exhausted in {{ $value | humanizeDuration }}"

```

#### Cause-based alerts (use for debugging, not paging)

```yaml

# alerts/cause_based.yml

groups:

- name: infrastructure_alerts

rules:

# Lower severity for infrastructure issues

- alert: HighMemoryUsage

expr: |

(

node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

) / node_memory_MemTotal_bytes > 0.9

for: 10m

labels:

severity: warning # Not critical unless symptoms appear

team: infrastructure

annotations:

summary: "High memory usage on {{ $labels.instance }}"

description: "Memory usage is {{ $value | humanizePercentage }}"

- alert: DiskSpaceLow

expr: |

(

node_filesystem_avail_bytes{mountpoint="/"}

/

node_filesystem_size_bytes{mountpoint="/"}

) < 0.1

for: 5m

labels:

severity: warning

team: infrastructure

annotations:

summary: "Low disk space on {{ $labels.instance }}"

description: "Only {{ $value | humanizePercentage }} disk space remaining"

action: "Clean up logs or expand disk"

```

Alert Best Practices

  1. For duration: Use for clause to avoid flapping
  2. Meaningful annotations: Include summary, description, runbook URL, impact
  3. Proper severity levels: critical (page immediately), warning (ticket), info (log)
  4. Actionable alerts: Every alert should require human action
  5. Include context: Add labels for team ownership, service, environment

PromQL Query Patterns

PromQL is the query language for Prometheus. Key concepts: instant vectors, range vectors, scalar, string literals, selectors, operators, functions, and aggregation.

Selectors and Matchers

```promql

# Instant vector selector (latest sample for each time-series)

http_requests_total

# Filter by label values

http_requests_total{method="GET", status="200"}

# Regex matching (=~) and negative regex (!~)

http_requests_total{status=~"5.."} # 5xx errors

http_requests_total{endpoint!~"/admin.*"} # exclude admin endpoints

# Label absence/presence

http_requests_total{job="api", status=""} # empty label

http_requests_total{job="api", status!=""} # non-empty label

# Range vector selector (samples over time)

http_requests_total[5m] # last 5 minutes of samples

```

Rate Calculations

```promql

# Request rate (requests per second) - ALWAYS use rate() for counters

rate(http_requests_total[5m])

# Sum by service

sum(rate(http_requests_total[5m])) by (service)

# Increase over time window (total count) - for alerts/dashboards showing total

increase(http_requests_total[1h])

# irate() for volatile, fast-moving counters (more sensitive to spikes)

irate(http_requests_total[5m])

```

Error Ratios

```promql

# Error rate ratio

sum(rate(http_requests_total{status=~"5.."}[5m]))

/

sum(rate(http_requests_total[5m]))

# Success rate

sum(rate(http_requests_total{status=~"2.."}[5m]))

/

sum(rate(http_requests_total[5m]))

```

Histogram Queries

```promql

# P95 latency

histogram_quantile(0.95,

sum(rate(http_request_duration_seconds_bucket[5m])) by (le)

)

# P50, P95, P99 latency by service

histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

# Average request duration

sum(rate(http_request_duration_seconds_sum[5m])) by (service)

/

sum(rate(http_request_duration_seconds_count[5m])) by (service)

```

Aggregation Operations

```promql

# Sum across all instances

sum(node_memory_MemTotal_bytes) by (cluster)

# Average CPU usage

avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

# Maximum value

max(http_request_duration_seconds) by (service)

# Minimum value

min(node_filesystem_avail_bytes) by (instance)

# Count number of instances

count(up == 1) by (job)

# Standard deviation

stddev(http_request_duration_seconds) by (service)

```

Advanced Queries

```promql

# Top 5 services by request rate

topk(5, sum(rate(http_requests_total[5m])) by (service))

# Bottom 3 instances by available memory

bottomk(3, node_memory_MemAvailable_bytes)

# Predict disk full time (linear regression)

predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0

# Compare with 1 day ago

http_requests_total - http_requests_total offset 1d

# Rate of change (derivative)

deriv(node_memory_MemAvailable_bytes[5m])

# Absent metric detection

absent(up{job="critical-service"})

```

Complex Aggregations

```promql

# Calculate Apdex score (Application Performance Index)

(

sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))

+

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) * 0.5

)

/

sum(rate(http_request_duration_seconds_count[5m]))

# Multi-window multi-burn-rate SLO

(

sum(rate(http_requests_total{status=~"5.."}[1h]))

/

sum(rate(http_requests_total[1h]))

> 0.001 * 14.4

)

and

(

sum(rate(http_requests_total{status=~"5.."}[5m]))

/

sum(rate(http_requests_total[5m]))

> 0.001 * 14.4

)

```

Binary Operators and Vector Matching

```promql

# Arithmetic operators (+, -, *, /, %, ^)

node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# Comparison operators (==, !=, >, <, >=, <=) - filter to matching values

http_request_duration_seconds > 1

# Logical operators (and, or, unless)

up{job="api"} and rate(http_requests_total[5m]) > 100

# One-to-one matching (default)

method:http_requests:rate5m / method:http_requests:total

# Many-to-one matching with group_left

sum(rate(http_requests_total[5m])) by (instance, method)

/ on(instance) group_left

sum(rate(http_requests_total[5m])) by (instance)

# One-to-many matching with group_right

sum(rate(http_requests_total[5m])) by (instance)

/ on(instance) group_right

sum(rate(http_requests_total[5m])) by (instance, method)

```

Time Functions and Offsets

```promql

# Compare with previous time period

rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)

# Day-over-day comparison

http_requests_total - http_requests_total offset 1d

# Time-based filtering

http_requests_total and hour() >= 9 and hour() < 17 # business hours

day_of_week() == 0 or day_of_week() == 6 # weekends

# Timestamp functions

time() - process_start_time_seconds # uptime in seconds

```

Service Discovery

Prometheus supports multiple service discovery mechanisms for dynamic environments where targets appear and disappear.

Static Configuration

```yaml

scrape_configs:

- job_name: 'static-targets'

static_configs:

- targets:

- 'host1:9100'

- 'host2:9100'

labels:

env: production

region: us-east-1

```

File-based Service Discovery

```yaml

scrape_configs:

- job_name: 'file-sd'

file_sd_configs:

- files:

- '/etc/prometheus/targets/*.json'

- '/etc/prometheus/targets/*.yml'

refresh_interval: 30s

# targets/webservers.json

[

{

"targets": ["web1:8080", "web2:8080"],

"labels": {

"job": "web",

"env": "prod"

}

}

]

```

Kubernetes Service Discovery

```yaml

scrape_configs:

# Pod-based discovery

- job_name: 'kubernetes-pods'

kubernetes_sd_configs:

- role: pod

namespaces:

names:

- production

- staging

relabel_configs:

# Keep only pods with prometheus.io/scrape=true annotation

- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]

action: keep

regex: true

# Extract custom scrape path from annotation

- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]

action: replace

target_label: __metrics_path__

regex: (.+)

# Extract custom port from annotation

- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]

action: replace

regex: ([^:]+)(?::\d+)?;(\d+)

replacement: $1:$2

target_label: __address__

# Add standard Kubernetes labels

- action: labelmap

regex: __meta_kubernetes_pod_label_(.+)

- source_labels: [__meta_kubernetes_namespace]

target_label: kubernetes_namespace

- source_labels: [__meta_kubernetes_pod_name]

target_label: kubernetes_pod_name

# Service-based discovery

- job_name: 'kubernetes-services'

kubernetes_sd_configs:

- role: service

relabel_configs:

- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]

action: keep

regex: true

- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]

action: replace

target_label: __scheme__

regex: (https?)

- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]

action: replace

target_label: __metrics_path__

regex: (.+)

# Node-based discovery (for node exporters)

- job_name: 'kubernetes-nodes'

kubernetes_sd_configs:

- role: node

relabel_configs:

- action: labelmap

regex: __meta_kubernetes_node_label_(.+)

- target_label: __address__

replacement: kubernetes.default.svc:443

- source_labels: [__meta_kubernetes_node_name]

regex: (.+)

target_label: __metrics_path__

replacement: /api/v1/nodes/${1}/proxy/metrics

# Endpoints discovery (for service endpoints)

- job_name: 'kubernetes-endpoints'

kubernetes_sd_configs:

- role: endpoints

relabel_configs:

- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]

action: keep

regex: true

- source_labels: [__meta_kubernetes_endpoint_port_name]

action: keep

regex: metrics

```

Consul Service Discovery

```yaml

scrape_configs:

- job_name: 'consul-services'

consul_sd_configs:

- server: 'consul.example.com:8500'

datacenter: 'dc1'

services: ['web', 'api', 'cache']

tags: ['production']

relabel_configs:

- source_labels: [__meta_consul_service]

target_label: service

- source_labels: [__meta_consul_tags]

target_label: tags

```

EC2 Service Discovery

```yaml

scrape_configs:

- job_name: 'ec2-instances'

ec2_sd_configs:

- region: us-east-1

access_key: YOUR_ACCESS_KEY

secret_key: YOUR_SECRET_KEY

port: 9100

filters:

- name: tag:Environment

values: [production]

- name: instance-state-name

values: [running]

relabel_configs:

- source_labels: [__meta_ec2_tag_Name]

target_label: instance_name

- source_labels: [__meta_ec2_availability_zone]

target_label: availability_zone

- source_labels: [__meta_ec2_instance_type]

target_label: instance_type

```

DNS Service Discovery

```yaml

scrape_configs:

- job_name: 'dns-srv-records'

dns_sd_configs:

- names:

- '_prometheus._tcp.example.com'

type: 'SRV'

refresh_interval: 30s

relabel_configs:

- source_labels: [__meta_dns_name]

target_label: instance

```

Relabeling Actions Reference

| Action | Description | Use Case |

|--------|-------------|----------|

| keep | Keep targets where regex matches source labels | Filter targets by annotation/label |

| drop | Drop targets where regex matches source labels | Exclude specific targets |

| replace | Replace target label with value from source labels | Extract custom labels/paths/ports |

| labelmap | Map source label names to target labels via regex | Copy all Kubernetes labels |

| labeldrop | Drop labels matching regex | Remove internal metadata labels |

| labelkeep | Keep only labels matching regex | Reduce cardinality |

| hashmod | Set target label to hash of source labels modulo N | Sharding/routing |

High Availability and Scalability

Prometheus High Availability Setup

```yaml

# Deploy multiple identical Prometheus instances scraping same targets

# Use external labels to distinguish instances

global:

external_labels:

replica: prometheus-1 # Change to prometheus-2, etc.

cluster: production

# Alertmanager will deduplicate alerts from multiple Prometheus instances

alerting:

alertmanagers:

- static_configs:

- targets:

- alertmanager-1:9093

- alertmanager-2:9093

- alertmanager-3:9093

```

Alertmanager Clustering

```yaml

# alertmanager.yml - HA cluster configuration

global:

resolve_timeout: 5m

route:

receiver: 'default'

group_by: ['alertname', 'cluster']

group_wait: 10s

group_interval: 10s

repeat_interval: 12h

receivers:

- name: 'default'

slack_configs:

- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK'

channel: '#alerts'

# Start Alertmanager cluster members

# alertmanager-1: --cluster.peer=alertmanager-2:9094 --cluster.peer=alertmanager-3:9094

# alertmanager-2: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-3:9094

# alertmanager-3: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-2:9094

```

Federation for Hierarchical Monitoring

```yaml

# Global Prometheus federating from regional instances

scrape_configs:

- job_name: 'federate'

scrape_interval: 15s

honor_labels: true

metrics_path: '/federate'

params:

'match[]':

# Pull aggregated metrics only

- '{job="prometheus"}'

- '{__name__=~"job:.*"}' # Recording rules

- 'up'

static_configs:

- targets:

- 'prometheus-us-east-1:9090'

- 'prometheus-us-west-2:9090'

- 'prometheus-eu-west-1:9090'

labels:

region: 'us-east-1'

```

Remote Storage for Long-term Retention

```yaml

# Prometheus remote write to Thanos/Cortex/Mimir

remote_write:

- url: "http://thanos-receive:19291/api/v1/receive"

queue_config:

capacity: 10000

max_shards: 50

min_shards: 1

max_samples_per_send: 5000

batch_send_deadline: 5s

min_backoff: 30ms

max_backoff: 100ms

write_relabel_configs:

# Drop high-cardinality metrics before remote write

- source_labels: [__name__]

regex: 'go_.*'

action: drop

# Prometheus remote read from long-term storage

remote_read:

- url: "http://thanos-query:9090/api/v1/read"

read_recent: true

```

Thanos Architecture for Global View

```yaml

# Thanos Sidecar - runs alongside Prometheus

thanos sidecar \

--prometheus.url=http://localhost:9090 \

--tsdb.path=/prometheus \

--objstore.config-file=/etc/thanos/bucket.yml \

--grpc-address=0.0.0.0:10901 \

--http-address=0.0.0.0:10902

# Thanos Store - queries object storage

thanos store \

--data-dir=/var/thanos/store \

--objstore.config-file=/etc/thanos/bucket.yml \

--grpc-address=0.0.0.0:10901 \

--http-address=0.0.0.0:10902

# Thanos Query - global query interface

thanos query \

--http-address=0.0.0.0:9090 \

--grpc-address=0.0.0.0:10901 \

--store=prometheus-1-sidecar:10901 \

--store=prometheus-2-sidecar:10901 \

--store=thanos-store:10901

# Thanos Compactor - downsample and compact blocks

thanos compact \

--data-dir=/var/thanos/compact \

--objstore.config-file=/etc/thanos/bucket.yml \

--retention.resolution-raw=30d \

--retention.resolution-5m=90d \

--retention.resolution-1h=365d

```

Horizontal Sharding with Hashmod

```yaml

# Split scrape targets across multiple Prometheus instances using hashmod

scrape_configs:

- job_name: 'kubernetes-pods-shard-0'

kubernetes_sd_configs:

- role: pod

relabel_configs:

# Hash pod name and keep only shard 0 (mod 3)

- source_labels: [__meta_kubernetes_pod_name]

modulus: 3

target_label: __tmp_hash

action: hashmod

- source_labels: [__tmp_hash]

regex: "0"

action: keep

- job_name: 'kubernetes-pods-shard-1'

kubernetes_sd_configs:

- role: pod

relabel_configs:

- source_labels: [__meta_kubernetes_pod_name]

modulus: 3

target_label: __tmp_hash

action: hashmod

- source_labels: [__tmp_hash]

regex: "1"

action: keep

# shard-2 similar pattern...

```

Kubernetes Integration

ServiceMonitor for Prometheus Operator

```yaml

# servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

name: app-metrics

namespace: monitoring

labels:

app: myapp

release: prometheus

spec:

# Select services to monitor

selector:

matchLabels:

app: myapp

# Define namespaces to search

namespaceSelector:

matchNames:

- production

- staging

# Endpoint configuration

endpoints:

- port: metrics # Service port name

path: /metrics

interval: 30s

scrapeTimeout: 10s

# Relabeling

relabelings:

- sourceLabels: [__meta_kubernetes_pod_name]

targetLabel: pod

- sourceLabels: [__meta_kubernetes_namespace]

targetLabel: namespace

# Metric relabeling (filter/modify metrics)

metricRelabelings:

- sourceLabels: [__name__]

regex: "go_.*"

action: drop # Drop Go runtime metrics

- sourceLabels: [status]

regex: "[45].."

targetLabel: error

replacement: "true"

# Optional: TLS configuration

# tlsConfig:

# insecureSkipVerify: true

# ca:

# secret:

# name: prometheus-tls

# key: ca.crt

```

PodMonitor for Direct Pod Scraping

```yaml

# podmonitor.yaml

apiVersion: monitoring.coreos.com/v1

kind: PodMonitor

metadata:

name: app-pods

namespace: monitoring

labels:

release: prometheus

spec:

# Select pods to monitor

selector:

matchLabels:

app: myapp

# Namespace selection

namespaceSelector:

matchNames:

- production

# Pod metrics endpoints

podMetricsEndpoints:

- port: metrics

path: /metrics

interval: 15s

# Relabeling

relabelings:

- sourceLabels: [__meta_kubernetes_pod_label_version]

targetLabel: version

- sourceLabels: [__meta_kubernetes_pod_node_name]

targetLabel: node

```

PrometheusRule for Alerts and Recording Rules

```yaml

# prometheusrule.yaml

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

name: app-rules

namespace: monitoring

labels:

release: prometheus

role: alert-rules

spec:

groups:

- name: app_alerts

interval: 30s

rules:

- alert: HighErrorRate

expr: |

(

sum(rate(http_requests_total{status=~"5..", app="myapp"}[5m]))

/

sum(rate(http_requests_total{app="myapp"}[5m]))

) > 0.05

for: 5m

labels:

severity: critical

team: backend

annotations:

summary: "High error rate on {{ $labels.namespace }}/{{ $labels.pod }}"

description: "Error rate is {{ $value | humanizePercentage }}"

dashboard: "https://grafana.example.com/d/app-overview"

runbook: "https://wiki.example.com/runbooks/high-error-rate"

- alert: PodCrashLooping

expr: |

rate(kube_pod_container_status_restarts_total[15m]) > 0

for: 5m

labels:

severity: warning

annotations:

summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

description: "Container {{ $labels.container }} has restarted {{ $value }} times in 15m"

- name: app_recording_rules

interval: 30s

rules:

- record: app:http_requests:rate5m

expr: sum(rate(http_requests_total{app="myapp"}[5m])) by (namespace, pod, method, status)

- record: app:http_request_duration_seconds:p95

expr: |

histogram_quantile(0.95,

sum(rate(http_request_duration_seconds_bucket{app="myapp"}[5m])) by (le, namespace, pod)

)

```

Prometheus Custom Resource

```yaml

# prometheus.yaml

apiVersion: monitoring.coreos.com/v1

kind: Prometheus

metadata:

name: prometheus

namespace: monitoring

spec:

replicas: 2

version: v2.45.0

# Service account for Kubernetes API access

serviceAccountName: prometheus

# Select ServiceMonitors

serviceMonitorSelector:

matchLabels:

release: prometheus

# Select PodMonitors

podMonitorSelector:

matchLabels:

release: prometheus

# Select PrometheusRules

ruleSelector:

matchLabels:

release: prometheus

role: alert-rules

# Resource limits

resources:

requests:

memory: 2Gi

cpu: 1000m

limits:

memory: 4Gi

cpu: 2000m

# Storage

storage:

volumeClaimTemplate:

spec:

accessModes:

- ReadWriteOnce

resources:

requests:

storage: 50Gi

storageClassName: fast-ssd

# Retention

retention: 30d

retentionSize: 45GB

# Alertmanager configuration

alerting:

alertmanagers:

- namespace: monitoring

name: alertmanager

port: web

# External labels

externalLabels:

cluster: production

region: us-east-1

# Security context

securityContext:

fsGroup: 2000

runAsNonRoot: true

runAsUser: 1000

# Enable admin API for management operations

enableAdminAPI: false

# Additional scrape configs (from Secret)

additionalScrapeConfigs:

name: additional-scrape-configs

key: prometheus-additional.yaml

```

Application Instrumentation Examples

Go Application

```go

// main.go

package main

import (

"net/http"

"time"

"github.com/prometheus/client_golang/prometheus"

"github.com/prometheus/client_golang/prometheus/promauto"

"github.com/prometheus/client_golang/prometheus/promhttp"

)

var (

// Counter for total requests

httpRequestsTotal = promauto.NewCounterVec(

prometheus.CounterOpts{

Name: "http_requests_total",

Help: "Total number of HTTP requests",

},

[]string{"method", "endpoint", "status"},

)

// Histogram for request duration

httpRequestDuration = promauto.NewHistogramVec(

prometheus.HistogramOpts{

Name: "http_request_duration_seconds",

Help: "HTTP request duration in seconds",

Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},

},

[]string{"method", "endpoint"},

)

// Gauge for active connections

activeConnections = promauto.NewGauge(

prometheus.GaugeOpts{

Name: "active_connections",

Help: "Number of active connections",

},

)

// Summary for response sizes

responseSizeBytes = promauto.NewSummaryVec(

prometheus.SummaryOpts{

Name: "http_response_size_bytes",

Help: "HTTP response size in bytes",

Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},

},

[]string{"endpoint"},

)

)

// Middleware to instrument HTTP handlers

func instrumentHandler(endpoint string, handler http.HandlerFunc) http.HandlerFunc {

return func(w http.ResponseWriter, r *http.Request) {

start := time.Now()

activeConnections.Inc()

defer activeConnections.Dec()

// Wrap response writer to capture status code

wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}

handler(wrapped, r)

duration := time.Since(start).Seconds()

httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)

httpRequestsTotal.WithLabelValues(r.Method, endpoint,

http.StatusText(wrapped.statusCode)).Inc()

}

}

type responseWriter struct {

http.ResponseWriter

statusCode int

}

func (rw *responseWriter) WriteHeader(code int) {

rw.statusCode = code

rw.ResponseWriter.WriteHeader(code)

}

func handleUsers(w http.ResponseWriter, r *http.Request) {

w.Header().Set("Content-Type", "application/json")

w.Write([]byte({"users": []}))

}

func main() {

// Register handlers

http.HandleFunc("/api/users", instrumentHandler("/api/users", handleUsers))

http.Handle("/metrics", promhttp.Handler())

// Start server

http.ListenAndServe(":8080", nil)

}

```

Python Application (Flask)

```python

# app.py

from flask import Flask, request

from prometheus_client import Counter, Histogram, Gauge, generate_latest

import time

app = Flask(__name__)

# Define metrics

request_count = Counter(

'http_requests_total',

'Total HTTP requests',

['method', 'endpoint', 'status']

)

request_duration = Histogram(

'http_request_duration_seconds',

'HTTP request duration in seconds',

['method', 'endpoint'],

buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]

)

active_requests = Gauge(

'active_requests',

'Number of active requests'

)

# Middleware for instrumentation

@app.before_request

def before_request():

active_requests.inc()

request.start_time = time.time()

@app.after_request

def after_request(response):

active_requests.dec()

duration = time.time() - request.start_time

request_duration.labels(

method=request.method,

endpoint=request.endpoint or 'unknown'

).observe(duration)

request_count.labels(

method=request.method,

endpoint=request.endpoint or 'unknown',

status=response.status_code

).inc()

return response

@app.route('/metrics')

def metrics():

return generate_latest()

@app.route('/api/users')

def users():

return {'users': []}

if __name__ == '__main__':

app.run(host='0.0.0.0', port=8080)

```

Production Deployment Checklist

  • [ ] Set appropriate retention period (balance storage vs history needs)
  • [ ] Configure persistent storage with adequate size
  • [ ] Enable high availability (multiple Prometheus replicas or federation)
  • [ ] Set up remote storage for long-term retention (Thanos, Cortex, Mimir)
  • [ ] Configure service discovery for dynamic environments
  • [ ] Implement recording rules for frequently-used queries
  • [ ] Create symptom-based alerts with proper annotations
  • [ ] Set up Alertmanager with appropriate routing and receivers
  • [ ] Configure inhibition rules to reduce alert noise
  • [ ] Add runbook URLs to all critical alerts
  • [ ] Implement proper label hygiene (avoid high cardinality)
  • [ ] Monitor Prometheus itself (meta-monitoring)
  • [ ] Set up authentication and authorization
  • [ ] Enable TLS for scrape targets and remote storage
  • [ ] Configure rate limiting for queries
  • [ ] Test alert and recording rule validity (promtool check rules)
  • [ ] Implement backup and disaster recovery procedures
  • [ ] Document metric naming conventions for the team
  • [ ] Create dashboards in Grafana for common queries
  • [ ] Set up log aggregation alongside metrics (Loki)

Troubleshooting Commands

```bash

# Check Prometheus configuration syntax

promtool check config prometheus.yml

# Check rules file syntax

promtool check rules alerts/*.yml

# Test PromQL queries

promtool query instant http://localhost:9090 'up'

# Check which targets are up

curl http://localhost:9090/api/v1/targets

# Query current metric values

curl 'http://localhost:9090/api/v1/query?query=up'

# Check service discovery

curl http://localhost:9090/api/v1/targets/metadata

# View TSDB stats

curl http://localhost:9090/api/v1/status/tsdb

# Check runtime information

curl http://localhost:9090/api/v1/status/runtimeinfo

```

Quick Reference

Common PromQL Patterns

```promql

# Request rate per second

rate(http_requests_total[5m])

# Error ratio percentage

100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# P95 latency from histogram

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Average latency from histogram

sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

# Memory utilization percentage

100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# CPU utilization (non-idle)

100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))

# Disk space remaining percentage

100 * node_filesystem_avail_bytes / node_filesystem_size_bytes

# Top 5 endpoints by request rate

topk(5, sum(rate(http_requests_total[5m])) by (endpoint))

# Service uptime in days

(time() - process_start_time_seconds) / 86400

# Request rate growth compared to 1 hour ago

rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)

```

Alert Rule Patterns

```yaml

# High error rate (symptom-based)

alert: HighErrorRate

expr: |

sum(rate(http_requests_total{status=~"5.."}[5m]))

/ sum(rate(http_requests_total[5m])) > 0.05

for: 5m

labels:

severity: critical

annotations:

summary: "Error rate is {{ $value | humanizePercentage }}"

runbook: "https://runbooks.example.com/high-error-rate"

# High latency P95

alert: HighLatency

expr: |

histogram_quantile(0.95,

sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)

) > 1

for: 5m

labels:

severity: warning

# Service down

alert: ServiceDown

expr: up{job="critical-service"} == 0

for: 2m

labels:

severity: critical

# Disk space low (cause-based, warning only)

alert: DiskSpaceLow

expr: |

node_filesystem_avail_bytes{mountpoint="/"}

/ node_filesystem_size_bytes{mountpoint="/"} < 0.1

for: 10m

labels:

severity: warning

# Pod crash looping

alert: PodCrashLooping

expr: rate(kube_pod_container_status_restarts_total[15m]) > 0

for: 5m

labels:

severity: warning

```

Recording Rule Naming Convention

```yaml

# Format: level:metric:operations

# level = aggregation level (job, instance, cluster)

# metric = base metric name

# operations = transformations applied (rate5m, sum, ratio)

groups:

- name: aggregation_rules

rules:

# Instance-level aggregation

- record: instance:node_cpu_utilization:ratio

expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

# Job-level aggregation

- record: job:http_requests:rate5m

expr: sum(rate(http_requests_total[5m])) by (job)

# Job-level error ratio

- record: job:http_request_errors:ratio

expr: |

sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)

/ sum(rate(http_requests_total[5m])) by (job)

# Cluster-level aggregation

- record: cluster:cpu_utilization:ratio

expr: avg(instance:node_cpu_utilization:ratio)

```

Metric Naming Best Practices

| Pattern | Good Example | Bad Example |

|---------|-------------|-------------|

| Counter suffix | http_requests_total | http_requests |

| Base units | http_request_duration_seconds | http_request_duration_ms |

| Ratio range | cache_hit_ratio (0.0-1.0) | cache_hit_percentage (0-100) |

| Byte units | response_size_bytes | response_size_kb |

| Namespace prefix | myapp_http_requests_total | http_requests_total |

| Label naming | {method="GET", status="200"} | {httpMethod="GET", statusCode="200"} |

Label Cardinality Guidelines

| Cardinality | Examples | Recommendation |

|-------------|----------|----------------|

| Low (<10) | HTTP method, status code, environment | Safe for all labels |

| Medium (10-100) | API endpoint, service name, pod name | Safe with aggregation |

| High (100-1000) | Container ID, hostname | Use only when necessary |

| Unbounded | User ID, IP address, timestamp, URL path | Never use as label |

Kubernetes Annotation-based Scraping

```yaml

# Pod annotations for automatic Prometheus scraping

apiVersion: v1

kind: Pod

metadata:

annotations:

prometheus.io/scrape: "true"

prometheus.io/port: "8080"

prometheus.io/path: "/metrics"

prometheus.io/scheme: "http"

spec:

containers:

- name: app

image: myapp:latest

ports:

- containerPort: 8080

name: metrics

```

Alertmanager Routing Patterns

```yaml

route:

receiver: default

group_by: ['alertname', 'cluster']

routes:

# Critical alerts to PagerDuty

- match:

severity: critical

receiver: pagerduty

continue: true # Also send to default

# Team-based routing

- match:

team: database

receiver: dba-team

group_by: ['alertname', 'instance']

# Environment-based routing

- match:

env: development

receiver: slack-dev

repeat_interval: 4h

# Time-based routing (office hours only)

- match:

severity: warning

receiver: email

active_time_intervals:

- business-hours

time_intervals:

- name: business-hours

time_intervals:

- times:

- start_time: '09:00'

end_time: '17:00'

weekdays: ['monday:friday']

```

Additional Resources

  • [Prometheus Documentation](https://prometheus.io/docs/)
  • [PromQL Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/)
  • [Best Practices](https://prometheus.io/docs/practices/)
  • [Alerting Rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
  • [Recording Rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/)
  • [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator)
  • [Thanos Documentation](https://thanos.io/tip/thanos/getting-started.md/)
  • [Google SRE Book - Monitoring](https://sre.google/sre-book/monitoring-distributed-systems/)

More from this repository10

🎯
data-validation🎯Skill

Validates and sanitizes data across various formats and use cases, ensuring data integrity and security.

🎯
refactoring🎯Skill

Systematically restructures code to enhance readability, maintainability, and performance while preserving its original behavior.

🎯
event-driven🎯Skill

Enables scalable, loosely-coupled systems by implementing event-driven architectures with message queues, pub/sub patterns, and distributed transaction management across various messaging platforms.

🎯
logging-observability🎯Skill

Enables comprehensive logging and observability by providing structured logging, distributed tracing, metrics collection, and centralized log management patterns.

🎯
background-jobs🎯Skill

Manages asynchronous task processing with robust job queues, scheduling, worker pools, and advanced retry strategies across various frameworks.

🎯
testing🎯Skill

Comprehensively tests software across domains, implementing unit, integration, and end-to-end tests with TDD/BDD workflows and robust test architecture.

🎯
auth🎯Skill

Implements robust authentication and authorization patterns including OAuth2, JWT, MFA, access control, and identity management.

🎯
feature-flags🎯Skill

Enables runtime feature control through configurable flags for gradual rollouts, A/B testing, user targeting, and dynamic system configuration.

🎯
grafana🎯Skill

Designs and configures Grafana dashboards, panels, and visualizations for observability using LGTM stack technologies like Loki, Tempo, and Mimir.

🎯
karpenter🎯Skill

Dynamically provisions and optimizes Kubernetes nodes using intelligent instance selection, spot management, and cost-efficient scaling strategies.