🎯

grafana

🎯Skill

from cosmix/claude-loom

What it does

Designs and configures Grafana dashboards, panels, and visualizations for observability using LGTM stack technologies like Loki, Tempo, and Mimir.

📦

Part of

cosmix/claude-loom(12 items)

grafana

Installation

Install ScriptRun install script

curl -fsSL https://raw.githubusercontent.com/cosmix/loom/main/install.sh | bash

git cloneClone repository

git clone https://github.com/cosmix/loom.git

Install ScriptRun install script

bash install.sh

📖 Extracted from docs: cosmix/claude-loom

Need more details? View full documentation on GitHub →

1Installs

AddedFeb 4, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Overview

# Grafana and LGTM Stack Skill

Overview

The LGTM stack provides a complete observability solution with comprehensive visualization and dashboard capabilities:

Loki: Log aggregation and querying (LogQL)
Grafana: Visualization, dashboarding, alerting, and exploration
Tempo: Distributed tracing (TraceQL)
Mimir: Long-term metrics storage (Prometheus-compatible)

This skill covers setup, configuration, dashboard creation, panel design, querying, alerting, templating, and production observability best practices.

When to Use This Skill

Primary Use Cases

Creating or modifying Grafana dashboards
Designing panels and visualizations (graphs, stats, tables, heatmaps, etc.)
Writing queries (PromQL, LogQL, TraceQL)
Configuring data sources (Prometheus, Loki, Tempo, Mimir)
Setting up alerting rules and notification policies
Implementing dashboard variables and templates
Dashboard provisioning and GitOps workflows
Troubleshooting observability queries
Analyzing application performance, errors, or system behavior

Who Uses This Skill

senior-software-engineer (PRIMARY): Production observability setup, LGTM stack deployment, dashboard architecture (use infrastructure skills for deployment)
software-engineer: Application dashboards, service metrics visualization

LGTM Stack Components

Loki - Log Aggregation

#### Architecture - Loki

Horizontally scalable log aggregation inspired by Prometheus

Indexes only metadata (labels), not log content
Cost-effective storage with object stores (S3, GCS, etc.)
LogQL query language similar to PromQL

#### Key Concepts - Loki

Labels for indexing (low cardinality)
Log streams identified by unique label sets
Parsers: logfmt, JSON, regex, pattern
Line filters and label filters

Grafana - Visualization

#### Features

Multi-datasource dashboarding
Panel types: Graph, Stat, Table, Heatmap, Bar Chart, Pie Chart, Gauge, Logs, Traces, Time Series
Templating and variables for dynamic dashboards
Alerting (unified alerting with contact points and notification policies)
Dashboard provisioning and GitOps integration
Role-based access control (RBAC)
Explore mode for ad-hoc queries
Annotations for event markers
Dashboard folders and organization

Tempo - Distributed Tracing

#### Architecture - Tempo

Scalable distributed tracing backend

Cost-effective trace storage
TraceQL for trace querying
Integration with logs and metrics (trace-to-logs, trace-to-metrics)
OpenTelemetry compatible

Mimir - Metrics Storage

#### Architecture - Mimir

Horizontally scalable long-term Prometheus storage

Multi-tenancy support
Query federation
High availability
Prometheus remote_write compatible

Dashboard Design and Best Practices

Dashboard Organization Principles

Hierarchy: Overview -> Service -> Component -> Deep Dive
Golden Signals: Latency, Traffic, Errors, Saturation (RED/USE method)
Variable-driven: Use templates for flexibility across environments
Consistent Layouts: Grid alignment (24-column grid), logical top-to-bottom flow
Performance: Limit queries, use query caching, appropriate time intervals

Panel Types and When to Use Them

| Panel Type | Use Case | Best For |

|------------|----------|----------|

| Time Series / Graph | Trends over time | Request rates, latency, resource usage |

| Stat | Single metric value | Error rates, current values, percentage |

| Gauge | Progress toward limit | CPU usage, memory, disk space |

| Bar Gauge | Comparative values | Top N items, distribution |

| Table | Structured data | Service lists, error details, resource inventory |

| Pie Chart | Proportions | Traffic distribution, error breakdown |

| Heatmap | Distribution over time | Latency percentiles, request patterns |

| Logs | Log streams | Error investigation, debugging |

| Traces | Distributed tracing | Performance analysis, dependency mapping |

Panel Configuration Best Practices

#### Titles and Descriptions

Clear, descriptive titles: Include units and metric context
Tooltips: Add description fields for panel documentation
Examples:

- Good: "P95 Latency (seconds) by Endpoint"

- Bad: "Latency"

#### Legends and Labels

Show legends only when needed (multiple series)
Use {{label}} format for dynamic legend names
Place legends appropriately (bottom, right, or hidden)
Sort by value when showing Top N

#### Axes and Units

Always label axes with units
Use appropriate unit formats (seconds, bytes, percent, requests/sec)
Set reasonable min/max ranges to avoid misleading scales
Use logarithmic scales for wide value ranges

#### Thresholds and Colors

Use thresholds for visual cues (green/yellow/red)
Standard threshold pattern:

- Green: Normal operation

- Yellow: Warning (action may be needed)

- Red: Critical (immediate attention required)

Examples:

- Error rate: 0% (green), 1% (yellow), 5% (red)

- P95 latency: <1s (green), 1-3s (yellow), >3s (red)

#### Links and Drilldowns

Link panels to related dashboards
Use data links for context (logs, traces, related services)
Create drill-down paths: Overview -> Service -> Component -> Details
Link to runbooks for alert panels

Dashboard Variables and Templating

Dashboard variables enable reusable, dynamic dashboards that work across environments, services, and time ranges.

#### Variable Types

| Type | Purpose | Example |

|------|---------|---------|

| Query | Populate from data source | Namespaces, services, pods |

| Custom | Static list of options | Environments (prod/staging/dev) |

| Interval | Time interval selection | Auto-adjusted query intervals |

| Datasource | Switch between data sources | Multiple Prometheus instances |

| Constant | Hidden values for queries | Cluster name, region |

| Text box | Free-form input | Custom filters |

#### Common Variable Patterns

```json

{

"templating": {

"list": [

{

"name": "datasource",

"type": "datasource",

"query": "prometheus",

"description": "Select Prometheus data source"

{

"name": "namespace",

"type": "query",

"datasource": "${datasource}",

"query": "label_values(kube_pod_info, namespace)",

"multi": true,

"includeAll": true,

"description": "Kubernetes namespace filter"

{

"name": "app",

"type": "query",

"datasource": "${datasource}",

"query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)",

"multi": true,

"includeAll": true,

"description": "Application filter (depends on namespace)"

{

"name": "interval",

"type": "interval",

"auto": true,

"auto_count": 30,

"auto_min": "10s",

"options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"],

"description": "Query resolution interval"

{

"name": "environment",

"type": "custom",

"options": [

{ "text": "Production", "value": "prod" },

{ "text": "Staging", "value": "staging" },

{ "text": "Development", "value": "dev" }

"current": { "text": "Production", "value": "prod" }

}

]

}

```

#### Variable Usage in Queries

Variables are referenced with $variable_name or ${variable_name} syntax:

```promql

# Simple variable reference

rate(http_requests_total{namespace="$namespace"}[5m])

# Multi-select with regex match

rate(http_requests_total{namespace=~"$namespace"}[5m])

# Variable in legend

rate(http_requests_total{app="$app"}[5m]) by (method)

# Legend format: "{{method}}"

# Using interval variable for adaptive queries

rate(http_requests_total[$__interval])

# Chained variables (app depends on namespace)

rate(http_requests_total{namespace="$namespace", app="$app"}[5m])

```

#### Advanced Variable Techniques

Regex filtering:

```json

{

"name": "pod",

"type": "query",

"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",

"regex": "/^$app-.*/",

"description": "Filter pods by app prefix"

}

```

All option with custom value:

```json

{

"name": "status",

"type": "custom",

"options": ["200", "404", "500"],

"includeAll": true,

"allValue": ".*",

"description": "HTTP status code filter"

}

```

Dependent variables (variable chain):

$datasource (datasource type)
$cluster (query: depends on datasource)
$namespace (query: depends on cluster)
$app (query: depends on namespace)
$pod (query: depends on app)

Annotations

Annotations display events as vertical markers on time series panels:

```json

{

"annotations": {

"list": [

{

"name": "Deployments",

"datasource": "Prometheus",

"expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])",

"tagKeys": "deployment,namespace",

"textFormat": "Deployment: {{deployment}}",

"iconColor": "blue"

{

"name": "Alerts",

"datasource": "Loki",

"expr": "{app=\"alertmanager\"} | json | alertname!=\"\"",

"textFormat": "Alert: {{alertname}}",

"iconColor": "red"

}

]

}

```

Dashboard Performance Optimization

#### Query Optimization

Limit number of panels (< 15 per dashboard)
Use appropriate time ranges (avoid queries over months)
Leverage $__interval for adaptive sampling
Avoid high-cardinality grouping (too many series)
Use query caching when available

#### Panel Performance

Set max data points to reasonable values
Use instant queries for current-state panels
Combine related metrics into single queries when possible
Disable auto-refresh on heavy dashboards

Dashboard as Code and Provisioning

Dashboard Provisioning

Dashboard provisioning enables GitOps workflows and version-controlled dashboard definitions.

#### Provisioning Provider Configuration

File: /etc/grafana/provisioning/dashboards/dashboards.yaml

```yaml

apiVersion: 1

providers:

- name: 'default'

orgId: 1

folder: ''

type: file

disableDeletion: false

updateIntervalSeconds: 10

allowUiUpdates: true

options:

path: /etc/grafana/provisioning/dashboards

- name: 'application'

orgId: 1

folder: 'Applications'

type: file

disableDeletion: true

editable: false

options:

path: /var/lib/grafana/dashboards/application

- name: 'infrastructure'

orgId: 1

folder: 'Infrastructure'

type: file

options:

path: /var/lib/grafana/dashboards/infrastructure

```

#### Dashboard JSON Structure

Complete dashboard JSON with metadata and provisioning:

```json

{

"dashboard": {

"title": "Application Observability - ${app}",

"uid": "app-observability",

"tags": ["observability", "application"],

"timezone": "browser",

"editable": true,

"graphTooltip": 1,

"time": {

"from": "now-1h",

"to": "now"

"refresh": "30s",

"templating": { "list": [] },

"panels": [],

"links": []

"overwrite": true,

"folderId": null,

"folderUid": null

}

```

#### Kubernetes ConfigMap Provisioning

```yaml

apiVersion: v1

kind: ConfigMap

metadata:

namespace: monitoring

labels:

grafana_dashboard: "1"

data:

application-dashboard.json: |

{

"dashboard": {

"title": "Application Metrics",

"uid": "app-metrics",

"tags": ["application"],

"panels": []

}

```

#### Grafana Operator (CRD)

```yaml

apiVersion: grafana.integreatly.org/v1beta1

kind: GrafanaDashboard

metadata:

namespace: monitoring

spec:

instanceSelector:

matchLabels:

dashboards: "grafana"

json: |

{

"dashboard": {

"title": "Application Observability",

"panels": []

}

```

Data Source Provisioning

#### Loki Data Source

File: /etc/grafana/provisioning/datasources/loki.yaml

```yaml

apiVersion: 1

datasources:

- name: Loki

type: loki

access: proxy

url: http://loki:3100

jsonData:

maxLines: 1000

derivedFields:

- datasourceUid: tempo_uid

matcherRegex: "trace_id=(\\w+)"

url: "$${__value.raw}"

editable: false

```

#### Tempo Data Source

File: /etc/grafana/provisioning/datasources/tempo.yaml

```yaml

apiVersion: 1

datasources:

- name: Tempo

type: tempo

access: proxy

url: http://tempo:3200

uid: tempo_uid

jsonData:

httpMethod: GET

tracesToLogs:

datasourceUid: loki_uid

tags: ["job", "instance", "pod", "namespace"]

mappedTags: [{ key: "service.name", value: "service" }]

spanStartTimeShift: "1h"

spanEndTimeShift: "1h"

tracesToMetrics:

datasourceUid: prometheus_uid

tags: [{ key: "service.name", value: "service" }]

serviceMap:

datasourceUid: prometheus_uid

nodeGraph:

enabled: true

editable: false

```

#### Mimir/Prometheus Data Source

File: /etc/grafana/provisioning/datasources/mimir.yaml

```yaml

apiVersion: 1

datasources:

- name: Mimir

type: prometheus

access: proxy

url: http://mimir:8080/prometheus

uid: prometheus_uid

jsonData:

httpMethod: POST

exemplarTraceIdDestinations:

- datasourceUid: tempo_uid

prometheusType: Mimir

prometheusVersion: 2.40.0

cacheLevel: "High"

incrementalQuerying: true

incrementalQueryOverlapWindow: 10m

editable: false

```

Alerting

Alert Rule Configuration

Grafana unified alerting supports multi-datasource alerts with flexible evaluation and routing.

#### Prometheus/Mimir Alert Rule

File: /etc/grafana/provisioning/alerting/rules.yaml

```yaml

apiVersion: 1

groups:

- name: application_alerts

interval: 1m

rules:

- uid: error_rate_high

title: High Error Rate

condition: A

data:

- refId: A

queryType: ""

relativeTimeRange:

from: 300

to: 0

datasourceUid: prometheus_uid

model:

expr: |

sum(rate(http_requests_total{status=~"5.."}[5m]))

sum(rate(http_requests_total[5m]))

> 0.05

intervalMs: 1000

maxDataPoints: 43200

noDataState: NoData

execErrState: Error

for: 5m

annotations:

description: 'Error rate is {{ printf "%.2f" $values.A.Value }}% (threshold: 5%)'

summary: Application error rate is above threshold

runbook_url: https://wiki.company.com/runbooks/high-error-rate

labels:

severity: critical

team: platform

isPaused: false

- uid: high_latency

title: High P95 Latency

condition: A

data:

- refId: A

datasourceUid: prometheus_uid

model:

expr: |

histogram_quantile(0.95,

sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)

) > 2

for: 10m

annotations:

description: "P95 latency is {{ $values.A.Value }}s on endpoint {{ $labels.endpoint }}"

runbook_url: https://wiki.company.com/runbooks/high-latency

labels:

severity: warning

```

#### Loki Alert Rule

```yaml

apiVersion: 1

groups:

- name: log_based_alerts

interval: 1m

rules:

- uid: error_spike

title: Error Log Spike

condition: A

data:

- refId: A

queryType: ""

datasourceUid: loki_uid

model:

expr: |

sum(rate({app="api"} | json | level="error" [5m]))

> 10

for: 2m

annotations:

description: "Error log rate is {{ $values.A.Value }} logs/sec"

summary: Spike in error logs detected

labels:

severity: warning

- uid: critical_error_pattern

title: Critical Error Pattern Detected

condition: A

data:

- refId: A

datasourceUid: loki_uid

model:

expr: |

sum(count_over_time({app="api"}

|~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]

)) > 0

for: 0m

annotations:

description: "Critical error pattern found in logs"

labels:

severity: critical

page: true

```

Contact Points and Notification Policies

File: /etc/grafana/provisioning/alerting/contactpoints.yaml

```yaml

apiVersion: 1

contactPoints:

- orgId: 1

receivers:

- uid: slack_critical

type: slack

settings:

url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL

title: "{{ .GroupLabels.alertname }}"

text: |

Alert: {{ .Labels.alertname }}

Summary: {{ .Annotations.summary }}

Description: {{ .Annotations.description }}

Severity: {{ .Labels.severity }}

disableResolveMessage: false

- orgId: 1

receivers:

- uid: pagerduty_oncall

type: pagerduty

settings:

integrationKey: YOUR_INTEGRATION_KEY

severity: critical

- orgId: 1

receivers:

- uid: email_team

type: email

settings:

addresses: team@company.com

singleEmail: true

notificationPolicies:

- orgId: 1

receiver: slack-critical

group_by: ["alertname", "namespace"]

group_wait: 30s

group_interval: 5m

repeat_interval: 4h

routes:

- receiver: pagerduty-oncall

matchers:

- severity = critical

- page = true

group_wait: 10s

repeat_interval: 1h

continue: true

- receiver: email-team

matchers:

- severity = warning

- team = platform

group_interval: 10m

repeat_interval: 12h

```

LogQL Query Patterns

Basic Log Queries

#### Stream Selection

```logql

# Simple label matching

{namespace="production", app="api"}

# Regex matching

{app=~"api|web|worker"}

# Not equal

{env!="staging"}

# Multiple conditions

{namespace="production", app="api", level!="debug"}

```

#### Line Filters

```logql

# Contains

{app="api"} |= "error"

# Does not contain

{app="api"} != "debug"

# Regex match

{app="api"} |~ "error|exception|fatal"

# Case insensitive

{app="api"} |~ "(?i)error"

# Chaining filters

{app="api"} |= "error" != "timeout"

```

Parsing and Extraction

#### JSON Parsing

```logql

# Parse JSON logs

{app="api"} | json

# Extract specific fields

{app="api"} | json message="msg", level="severity"

# Filter on extracted field

{app="api"} | json | level="error"

# Nested JSON

{app="api"} | json | line_format "{{.response.status}}"

```

#### Logfmt Parsing

```logql

# Parse logfmt (key=value)

{app="api"} | logfmt

# Extract specific fields

{app="api"} | logfmt level, caller, msg

# Filter parsed fields

{app="api"} | logfmt | level="error"

```

#### Pattern Parsing

```logql

# Extract with pattern

{app="nginx"} | pattern - - <_> " <_>" <_>

# Filter on extracted values

{app="nginx"} | pattern <_> <_> | status >= 400

# Complex pattern

{app="api"} | pattern level= msg="" duration=ms

```

Aggregations and Metrics

#### Count Queries

```logql

# Count log lines over time

count_over_time({app="api"}[5m])

# Rate of logs

rate({app="api"}[5m])

# Errors per second

sum(rate({app="api"} |= "error" [5m])) by (namespace)

# Error ratio

sum(rate({app="api"} |= "error" [5m]))

sum(rate({app="api"}[5m]))

```

#### Extracted Metrics

```logql

# Average duration

avg_over_time({app="api"}

| logfmt

| unwrap duration [5m]) by (endpoint)

# P95 latency

quantile_over_time(0.95, {app="api"}

| regexp duration=(?P[0-9.]+)ms

| unwrap duration [5m]) by (method)

# Top 10 error messages

topk(10,

sum by (msg) (

count_over_time({app="api"}

| json

| level="error" [1h]

)

```

TraceQL Query Patterns

Basic Trace Queries

```traceql

# Find traces by service

{ .service.name = "api" }

# HTTP status codes

{ .http.status_code = 500 }

# Combine conditions

{ .service.name = "api" && .http.status_code >= 400 }

# Duration filter

{ duration > 1s }

```

Advanced TraceQL

```traceql

# Parent-child relationship

{ .service.name = "frontend" }

>> { .service.name = "backend" && .http.status_code = 500 }

# Descendant spans

{ .service.name = "api" }

>>+ { .db.system = "postgresql" && duration > 1s }

# Failed database queries

{ .service.name = "api" }

>> { .db.system = "postgresql" && status = "error" }

```

Complete Dashboard Examples

Application Observability Dashboard

```json

{

"dashboard": {

"title": "Application Observability - ${app}",

"tags": ["observability", "application"],

"timezone": "browser",

"editable": true,

"graphTooltip": 1,

"time": {

"from": "now-1h",

"to": "now"

"templating": {

"list": [

{

"name": "app",

"type": "query",

"datasource": "Mimir",

"query": "label_values(up, app)",

"current": {

"selected": false,

"text": "api",

"value": "api"

}

{

"name": "namespace",

"type": "query",

"datasource": "Mimir",

"query": "label_values(up{app=\"$app\"}, namespace)",

"multi": true,

"includeAll": true

}

]

"panels": [

{

"id": 1,

"title": "Request Rate",

"type": "graph",

"datasource": "Mimir",

"targets": [

{

"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",

"legendFormat": "{{method}} - {{status}}"

}

"gridPos": {

"h": 8,

"w": 12,

"x": 0,

"y": 0

"yaxes": [

{

"format": "reqps",

"label": "Requests/sec"

}

]

{

"id": 2,

"title": "P95 Latency",

"type": "graph",

"datasource": "Mimir",

"targets": [

{

"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",

"legendFormat": "{{endpoint}}"

}

"gridPos": {

"h": 8,

"w": 12,

"x": 12,

"y": 0

"yaxes": [

{

"format": "s",

"label": "Duration"

}

"thresholds": [

{

"value": 1,

"colorMode": "critical",

"fill": true,

"line": true,

"op": "gt"

}

]

{

"id": 3,

"title": "Error Rate",

"type": "graph",

"datasource": "Mimir",

"targets": [

{

"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",

"legendFormat": "Error %"

}

"gridPos": {

"h": 8,

"w": 12,

"x": 0,

"y": 8

"yaxes": [

{

"format": "percentunit",

"max": 1,

"min": 0

}

"alert": {

"conditions": [

{

"evaluator": {

"params": [0.01],

"type": "gt"

"operator": {

"type": "and"

"query": {

"params": ["A", "5m", "now"]

"reducer": {

"type": "avg"

"type": "query"

}

"frequency": "1m",

"handler": 1,

"name": "Error Rate Alert",

"noDataState": "no_data",

"notifications": []

}

{

"id": 4,

"title": "Recent Error Logs",

"type": "logs",

"datasource": "Loki",

"targets": [

{

"expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",

"refId": "A"

}

"options": {

"showTime": true,

"showLabels": false,

"showCommonLabels": false,

"wrapLogMessage": true,

"dedupStrategy": "none",

"enableLogDetails": true

"gridPos": {

"h": 8,

"w": 12,

"x": 12,

"y": 8

}

"links": [

{

"title": "Explore Logs",

"url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",

"type": "link",

"icon": "doc"

{

"title": "Explore Traces",

"url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",

"type": "link",

"icon": "gf-traces"

}

]

}

```

LGTM Stack Configuration

Loki Configuration

File: loki.yaml

```yaml

auth_enabled: false

server:

http_listen_port: 3100

grpc_listen_port: 9096

log_level: info

common:

path_prefix: /loki

storage:

filesystem:

chunks_directory: /loki/chunks

rules_directory: /loki/rules

replication_factor: 1

ring:

kvstore:

store: inmemory

schema_config:

configs:

- from: 2024-01-01

store: tsdb

object_store: s3

schema: v13

index:

prefix: index_

period: 24h

storage_config:

aws:

s3: s3://us-east-1/my-loki-bucket

s3forcepathstyle: true

tsdb_shipper:

active_index_directory: /loki/tsdb-index

cache_location: /loki/tsdb-cache

shared_store: s3

limits_config:

retention_period: 744h # 31 days

ingestion_rate_mb: 10

ingestion_burst_size_mb: 20

max_query_series: 500

max_query_lookback: 30d

reject_old_samples: true

reject_old_samples_max_age: 168h

compactor:

working_directory: /loki/compactor

shared_store: s3

compaction_interval: 10m

retention_enabled: true

retention_delete_delay: 2h

```

Tempo Configuration

File: tempo.yaml

```yaml

server:

http_listen_port: 3200

grpc_listen_port: 9096

distributor:

receivers:

otlp:

protocols:

http:

grpc:

jaeger:

protocols:

thrift_http:

grpc:

ingester:

max_block_duration: 5m

compactor:

compaction:

block_retention: 720h # 30 days

storage:

trace:

backend: s3

s3:

bucket: tempo-traces

endpoint: s3.amazonaws.com

region: us-east-1

wal:

path: /var/tempo/wal

metrics_generator:

registry:

external_labels:

source: tempo

cluster: primary

storage:

path: /var/tempo/generator/wal

remote_write:

- url: http://mimir:9009/api/v1/push

send_exemplars: true

```

Production Best Practices

Performance Optimization

#### Query Optimization

Use label filters before line filters
Limit time ranges for expensive queries
Use unwrap instead of parsing when possible
Cache query results with query frontend

#### Dashboard Performance

Limit number of panels (< 15 per dashboard)
Use appropriate time intervals
Avoid high-cardinality grouping
Use $__interval for adaptive sampling

#### Storage Optimization

Configure retention policies
Use compaction for Loki and Tempo
Implement tiered storage (hot/warm/cold)
Monitor storage growth

Security Best Practices

#### Authentication

Enable auth (auth_enabled: true in Loki/Tempo)
Use OAuth/LDAP for Grafana
Implement multi-tenancy with org isolation

#### Authorization

Configure RBAC in Grafana
Limit datasource access by team
Use folder permissions for dashboards

#### Network Security

TLS for all components
Network policies in Kubernetes
Rate limiting at ingress

Troubleshooting

#### Common Issues

High Cardinality: Too many unique label combinations

- Solution: Reduce label dimensions, use log parsing instead

Query Timeouts: Complex queries on large datasets

- Solution: Reduce time range, use aggregations, add query limits

Storage Growth: Unbounded retention

- Solution: Configure retention policies, enable compaction

Missing Traces: Incomplete trace data

- Solution: Check sampling rates, verify instrumentation

Resources

[Loki Documentation](https://grafana.com/docs/loki/latest/)
[Tempo Documentation](https://grafana.com/docs/tempo/latest/)
[Grafana Documentation](https://grafana.com/docs/grafana/latest/)
[LogQL Cheat Sheet](https://grafana.com/docs/loki/latest/logql/)
[TraceQL Guide](https://grafana.com/docs/tempo/latest/traceql/)
[Grafana Operator](https://github.com/grafana-operator/grafana-operator)

More from this repository10

🎯

refactoring🎯Skill

Systematically restructures code to enhance readability, maintainability, and performance while preserving its original behavior.

🎯

data-validation🎯Skill

Validates and sanitizes data across various formats and use cases, ensuring data integrity and security.

🎯

event-driven🎯Skill

Enables scalable, loosely-coupled systems by implementing event-driven architectures with message queues, pub/sub patterns, and distributed transaction management across various messaging platforms.

🎯

logging-observability🎯Skill

Enables comprehensive logging and observability by providing structured logging, distributed tracing, metrics collection, and centralized log management patterns.

🎯

background-jobs🎯Skill

Manages asynchronous task processing with robust job queues, scheduling, worker pools, and advanced retry strategies across various frameworks.

🎯

testing🎯Skill

Comprehensively tests software across domains, implementing unit, integration, and end-to-end tests with TDD/BDD workflows and robust test architecture.

🎯

auth🎯Skill

Implements robust authentication and authorization patterns including OAuth2, JWT, MFA, access control, and identity management.

🎯

feature-flags🎯Skill

Enables runtime feature control through configurable flags for gradual rollouts, A/B testing, user targeting, and dynamic system configuration.

🎯

ci-cd🎯Skill

Designs and implements robust CI/CD pipelines with automated testing, security scanning, and deployment strategies across multiple platforms and tools.

🎯

karpenter🎯Skill

Dynamically provisions and optimizes Kubernetes nodes using intelligent instance selection, spot management, and cost-efficient scaling strategies.