grafana
π―Skillfrom cosmix/claude-loom
Designs and configures Grafana dashboards, panels, and visualizations for observability using LGTM stack technologies like Loki, Tempo, and Mimir.
Part of
cosmix/claude-loom(12 items)
Installation
curl -fsSL https://raw.githubusercontent.com/cosmix/loom/main/install.sh | bashgit clone https://github.com/cosmix/loom.gitbash install.shSkill Details
|
Overview
# Grafana and LGTM Stack Skill
Overview
The LGTM stack provides a complete observability solution with comprehensive visualization and dashboard capabilities:
- Loki: Log aggregation and querying (LogQL)
- Grafana: Visualization, dashboarding, alerting, and exploration
- Tempo: Distributed tracing (TraceQL)
- Mimir: Long-term metrics storage (Prometheus-compatible)
This skill covers setup, configuration, dashboard creation, panel design, querying, alerting, templating, and production observability best practices.
When to Use This Skill
Primary Use Cases
- Creating or modifying Grafana dashboards
- Designing panels and visualizations (graphs, stats, tables, heatmaps, etc.)
- Writing queries (PromQL, LogQL, TraceQL)
- Configuring data sources (Prometheus, Loki, Tempo, Mimir)
- Setting up alerting rules and notification policies
- Implementing dashboard variables and templates
- Dashboard provisioning and GitOps workflows
- Troubleshooting observability queries
- Analyzing application performance, errors, or system behavior
Who Uses This Skill
- senior-software-engineer (PRIMARY): Production observability setup, LGTM stack deployment, dashboard architecture (use infrastructure skills for deployment)
- software-engineer: Application dashboards, service metrics visualization
LGTM Stack Components
Loki - Log Aggregation
#### Architecture - Loki
Horizontally scalable log aggregation inspired by Prometheus
- Indexes only metadata (labels), not log content
- Cost-effective storage with object stores (S3, GCS, etc.)
- LogQL query language similar to PromQL
#### Key Concepts - Loki
- Labels for indexing (low cardinality)
- Log streams identified by unique label sets
- Parsers: logfmt, JSON, regex, pattern
- Line filters and label filters
Grafana - Visualization
#### Features
- Multi-datasource dashboarding
- Panel types: Graph, Stat, Table, Heatmap, Bar Chart, Pie Chart, Gauge, Logs, Traces, Time Series
- Templating and variables for dynamic dashboards
- Alerting (unified alerting with contact points and notification policies)
- Dashboard provisioning and GitOps integration
- Role-based access control (RBAC)
- Explore mode for ad-hoc queries
- Annotations for event markers
- Dashboard folders and organization
Tempo - Distributed Tracing
#### Architecture - Tempo
Scalable distributed tracing backend
- Cost-effective trace storage
- TraceQL for trace querying
- Integration with logs and metrics (trace-to-logs, trace-to-metrics)
- OpenTelemetry compatible
Mimir - Metrics Storage
#### Architecture - Mimir
Horizontally scalable long-term Prometheus storage
- Multi-tenancy support
- Query federation
- High availability
- Prometheus remote_write compatible
Dashboard Design and Best Practices
Dashboard Organization Principles
- Hierarchy: Overview -> Service -> Component -> Deep Dive
- Golden Signals: Latency, Traffic, Errors, Saturation (RED/USE method)
- Variable-driven: Use templates for flexibility across environments
- Consistent Layouts: Grid alignment (24-column grid), logical top-to-bottom flow
- Performance: Limit queries, use query caching, appropriate time intervals
Panel Types and When to Use Them
| Panel Type | Use Case | Best For |
|------------|----------|----------|
| Time Series / Graph | Trends over time | Request rates, latency, resource usage |
| Stat | Single metric value | Error rates, current values, percentage |
| Gauge | Progress toward limit | CPU usage, memory, disk space |
| Bar Gauge | Comparative values | Top N items, distribution |
| Table | Structured data | Service lists, error details, resource inventory |
| Pie Chart | Proportions | Traffic distribution, error breakdown |
| Heatmap | Distribution over time | Latency percentiles, request patterns |
| Logs | Log streams | Error investigation, debugging |
| Traces | Distributed tracing | Performance analysis, dependency mapping |
Panel Configuration Best Practices
#### Titles and Descriptions
- Clear, descriptive titles: Include units and metric context
- Tooltips: Add description fields for panel documentation
- Examples:
- Good: "P95 Latency (seconds) by Endpoint"
- Bad: "Latency"
#### Legends and Labels
- Show legends only when needed (multiple series)
- Use
{{label}}format for dynamic legend names - Place legends appropriately (bottom, right, or hidden)
- Sort by value when showing Top N
#### Axes and Units
- Always label axes with units
- Use appropriate unit formats (seconds, bytes, percent, requests/sec)
- Set reasonable min/max ranges to avoid misleading scales
- Use logarithmic scales for wide value ranges
#### Thresholds and Colors
- Use thresholds for visual cues (green/yellow/red)
- Standard threshold pattern:
- Green: Normal operation
- Yellow: Warning (action may be needed)
- Red: Critical (immediate attention required)
- Examples:
- Error rate: 0% (green), 1% (yellow), 5% (red)
- P95 latency: <1s (green), 1-3s (yellow), >3s (red)
#### Links and Drilldowns
- Link panels to related dashboards
- Use data links for context (logs, traces, related services)
- Create drill-down paths: Overview -> Service -> Component -> Details
- Link to runbooks for alert panels
Dashboard Variables and Templating
Dashboard variables enable reusable, dynamic dashboards that work across environments, services, and time ranges.
#### Variable Types
| Type | Purpose | Example |
|------|---------|---------|
| Query | Populate from data source | Namespaces, services, pods |
| Custom | Static list of options | Environments (prod/staging/dev) |
| Interval | Time interval selection | Auto-adjusted query intervals |
| Datasource | Switch between data sources | Multiple Prometheus instances |
| Constant | Hidden values for queries | Cluster name, region |
| Text box | Free-form input | Custom filters |
#### Common Variable Patterns
```json
{
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"description": "Select Prometheus data source"
},
{
"name": "namespace",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(kube_pod_info, namespace)",
"multi": true,
"includeAll": true,
"description": "Kubernetes namespace filter"
},
{
"name": "app",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)",
"multi": true,
"includeAll": true,
"description": "Application filter (depends on namespace)"
},
{
"name": "interval",
"type": "interval",
"auto": true,
"auto_count": 30,
"auto_min": "10s",
"options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"],
"description": "Query resolution interval"
},
{
"name": "environment",
"type": "custom",
"options": [
{ "text": "Production", "value": "prod" },
{ "text": "Staging", "value": "staging" },
{ "text": "Development", "value": "dev" }
],
"current": { "text": "Production", "value": "prod" }
}
]
}
}
```
#### Variable Usage in Queries
Variables are referenced with $variable_name or ${variable_name} syntax:
```promql
# Simple variable reference
rate(http_requests_total{namespace="$namespace"}[5m])
# Multi-select with regex match
rate(http_requests_total{namespace=~"$namespace"}[5m])
# Variable in legend
rate(http_requests_total{app="$app"}[5m]) by (method)
# Legend format: "{{method}}"
# Using interval variable for adaptive queries
rate(http_requests_total[$__interval])
# Chained variables (app depends on namespace)
rate(http_requests_total{namespace="$namespace", app="$app"}[5m])
```
#### Advanced Variable Techniques
Regex filtering:
```json
{
"name": "pod",
"type": "query",
"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
"regex": "/^$app-.*/",
"description": "Filter pods by app prefix"
}
```
All option with custom value:
```json
{
"name": "status",
"type": "custom",
"options": ["200", "404", "500"],
"includeAll": true,
"allValue": ".*",
"description": "HTTP status code filter"
}
```
Dependent variables (variable chain):
$datasource(datasource type)$cluster(query: depends on datasource)$namespace(query: depends on cluster)$app(query: depends on namespace)$pod(query: depends on app)
Annotations
Annotations display events as vertical markers on time series panels:
```json
{
"annotations": {
"list": [
{
"name": "Deployments",
"datasource": "Prometheus",
"expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])",
"tagKeys": "deployment,namespace",
"textFormat": "Deployment: {{deployment}}",
"iconColor": "blue"
},
{
"name": "Alerts",
"datasource": "Loki",
"expr": "{app=\"alertmanager\"} | json | alertname!=\"\"",
"textFormat": "Alert: {{alertname}}",
"iconColor": "red"
}
]
}
}
```
Dashboard Performance Optimization
#### Query Optimization
- Limit number of panels (< 15 per dashboard)
- Use appropriate time ranges (avoid queries over months)
- Leverage
$__intervalfor adaptive sampling - Avoid high-cardinality grouping (too many series)
- Use query caching when available
#### Panel Performance
- Set max data points to reasonable values
- Use instant queries for current-state panels
- Combine related metrics into single queries when possible
- Disable auto-refresh on heavy dashboards
Dashboard as Code and Provisioning
Dashboard Provisioning
Dashboard provisioning enables GitOps workflows and version-controlled dashboard definitions.
#### Provisioning Provider Configuration
File: /etc/grafana/provisioning/dashboards/dashboards.yaml
```yaml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/provisioning/dashboards
- name: 'application'
orgId: 1
folder: 'Applications'
type: file
disableDeletion: true
editable: false
options:
path: /var/lib/grafana/dashboards/application
- name: 'infrastructure'
orgId: 1
folder: 'Infrastructure'
type: file
options:
path: /var/lib/grafana/dashboards/infrastructure
```
#### Dashboard JSON Structure
Complete dashboard JSON with metadata and provisioning:
```json
{
"dashboard": {
"title": "Application Observability - ${app}",
"uid": "app-observability",
"tags": ["observability", "application"],
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "30s",
"templating": { "list": [] },
"panels": [],
"links": []
},
"overwrite": true,
"folderId": null,
"folderUid": null
}
```
#### Kubernetes ConfigMap Provisioning
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
application-dashboard.json: |
{
"dashboard": {
"title": "Application Metrics",
"uid": "app-metrics",
"tags": ["application"],
"panels": []
}
}
```
#### Grafana Operator (CRD)
```yaml
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: application-observability
namespace: monitoring
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"dashboard": {
"title": "Application Observability",
"panels": []
}
}
```
Data Source Provisioning
#### Loki Data Source
File: /etc/grafana/provisioning/datasources/loki.yaml
```yaml
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
maxLines: 1000
derivedFields:
- datasourceUid: tempo_uid
matcherRegex: "trace_id=(\\w+)"
name: TraceID
url: "$${__value.raw}"
editable: false
```
#### Tempo Data Source
File: /etc/grafana/provisioning/datasources/tempo.yaml
```yaml
apiVersion: 1
datasources:
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
uid: tempo_uid
jsonData:
httpMethod: GET
tracesToLogs:
datasourceUid: loki_uid
tags: ["job", "instance", "pod", "namespace"]
mappedTags: [{ key: "service.name", value: "service" }]
spanStartTimeShift: "1h"
spanEndTimeShift: "1h"
tracesToMetrics:
datasourceUid: prometheus_uid
tags: [{ key: "service.name", value: "service" }]
serviceMap:
datasourceUid: prometheus_uid
nodeGraph:
enabled: true
editable: false
```
#### Mimir/Prometheus Data Source
File: /etc/grafana/provisioning/datasources/mimir.yaml
```yaml
apiVersion: 1
datasources:
- name: Mimir
type: prometheus
access: proxy
url: http://mimir:8080/prometheus
uid: prometheus_uid
jsonData:
httpMethod: POST
exemplarTraceIdDestinations:
- datasourceUid: tempo_uid
name: trace_id
prometheusType: Mimir
prometheusVersion: 2.40.0
cacheLevel: "High"
incrementalQuerying: true
incrementalQueryOverlapWindow: 10m
editable: false
```
Alerting
Alert Rule Configuration
Grafana unified alerting supports multi-datasource alerts with flexible evaluation and routing.
#### Prometheus/Mimir Alert Rule
File: /etc/grafana/provisioning/alerting/rules.yaml
```yaml
apiVersion: 1
groups:
- name: application_alerts
interval: 1m
rules:
- uid: error_rate_high
title: High Error Rate
condition: A
data:
- refId: A
queryType: ""
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus_uid
model:
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
intervalMs: 1000
maxDataPoints: 43200
noDataState: NoData
execErrState: Error
for: 5m
annotations:
description: 'Error rate is {{ printf "%.2f" $values.A.Value }}% (threshold: 5%)'
summary: Application error rate is above threshold
runbook_url: https://wiki.company.com/runbooks/high-error-rate
labels:
severity: critical
team: platform
isPaused: false
- uid: high_latency
title: High P95 Latency
condition: A
data:
- refId: A
datasourceUid: prometheus_uid
model:
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
) > 2
for: 10m
annotations:
description: "P95 latency is {{ $values.A.Value }}s on endpoint {{ $labels.endpoint }}"
runbook_url: https://wiki.company.com/runbooks/high-latency
labels:
severity: warning
```
#### Loki Alert Rule
```yaml
apiVersion: 1
groups:
- name: log_based_alerts
interval: 1m
rules:
- uid: error_spike
title: Error Log Spike
condition: A
data:
- refId: A
queryType: ""
datasourceUid: loki_uid
model:
expr: |
sum(rate({app="api"} | json | level="error" [5m]))
> 10
for: 2m
annotations:
description: "Error log rate is {{ $values.A.Value }} logs/sec"
summary: Spike in error logs detected
labels:
severity: warning
- uid: critical_error_pattern
title: Critical Error Pattern Detected
condition: A
data:
- refId: A
datasourceUid: loki_uid
model:
expr: |
sum(count_over_time({app="api"}
|~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
)) > 0
for: 0m
annotations:
description: "Critical error pattern found in logs"
labels:
severity: critical
page: true
```
Contact Points and Notification Policies
File: /etc/grafana/provisioning/alerting/contactpoints.yaml
```yaml
apiVersion: 1
contactPoints:
- orgId: 1
name: slack-critical
receivers:
- uid: slack_critical
type: slack
settings:
url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
title: "{{ .GroupLabels.alertname }}"
text: |
{{ range .Alerts }}
Alert: {{ .Labels.alertname }}
Summary: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
{{ end }}
disableResolveMessage: false
- orgId: 1
name: pagerduty-oncall
receivers:
- uid: pagerduty_oncall
type: pagerduty
settings:
integrationKey: YOUR_INTEGRATION_KEY
severity: critical
class: infrastructure
- orgId: 1
name: email-team
receivers:
- uid: email_team
type: email
settings:
addresses: team@company.com
singleEmail: true
notificationPolicies:
- orgId: 1
receiver: slack-critical
group_by: ["alertname", "namespace"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- receiver: pagerduty-oncall
matchers:
- severity = critical
- page = true
group_wait: 10s
repeat_interval: 1h
continue: true
- receiver: email-team
matchers:
- severity = warning
- team = platform
group_interval: 10m
repeat_interval: 12h
```
LogQL Query Patterns
Basic Log Queries
#### Stream Selection
```logql
# Simple label matching
{namespace="production", app="api"}
# Regex matching
{app=~"api|web|worker"}
# Not equal
{env!="staging"}
# Multiple conditions
{namespace="production", app="api", level!="debug"}
```
#### Line Filters
```logql
# Contains
{app="api"} |= "error"
# Does not contain
{app="api"} != "debug"
# Regex match
{app="api"} |~ "error|exception|fatal"
# Case insensitive
{app="api"} |~ "(?i)error"
# Chaining filters
{app="api"} |= "error" != "timeout"
```
Parsing and Extraction
#### JSON Parsing
```logql
# Parse JSON logs
{app="api"} | json
# Extract specific fields
{app="api"} | json message="msg", level="severity"
# Filter on extracted field
{app="api"} | json | level="error"
# Nested JSON
{app="api"} | json | line_format "{{.response.status}}"
```
#### Logfmt Parsing
```logql
# Parse logfmt (key=value)
{app="api"} | logfmt
# Extract specific fields
{app="api"} | logfmt level, caller, msg
# Filter parsed fields
{app="api"} | logfmt | level="error"
```
#### Pattern Parsing
```logql
# Extract with pattern
{app="nginx"} | pattern
# Filter on extracted values
{app="nginx"} | pattern <_> | status >= 400
# Complex pattern
{app="api"} | pattern level=
```
Aggregations and Metrics
#### Count Queries
```logql
# Count log lines over time
count_over_time({app="api"}[5m])
# Rate of logs
rate({app="api"}[5m])
# Errors per second
sum(rate({app="api"} |= "error" [5m])) by (namespace)
# Error ratio
sum(rate({app="api"} |= "error" [5m]))
/
sum(rate({app="api"}[5m]))
```
#### Extracted Metrics
```logql
# Average duration
avg_over_time({app="api"}
| logfmt
| unwrap duration [5m]) by (endpoint)
# P95 latency
quantile_over_time(0.95, {app="api"}
| regexp duration=(?P
| unwrap duration [5m]) by (method)
# Top 10 error messages
topk(10,
sum by (msg) (
count_over_time({app="api"}
| json
| level="error" [1h]
)
)
)
```
TraceQL Query Patterns
Basic Trace Queries
```traceql
# Find traces by service
{ .service.name = "api" }
# HTTP status codes
{ .http.status_code = 500 }
# Combine conditions
{ .service.name = "api" && .http.status_code >= 400 }
# Duration filter
{ duration > 1s }
```
Advanced TraceQL
```traceql
# Parent-child relationship
{ .service.name = "frontend" }
>> { .service.name = "backend" && .http.status_code = 500 }
# Descendant spans
{ .service.name = "api" }
>>+ { .db.system = "postgresql" && duration > 1s }
# Failed database queries
{ .service.name = "api" }
>> { .db.system = "postgresql" && status = "error" }
```
Complete Dashboard Examples
Application Observability Dashboard
```json
{
"dashboard": {
"title": "Application Observability - ${app}",
"tags": ["observability", "application"],
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-1h",
"to": "now"
},
"templating": {
"list": [
{
"name": "app",
"type": "query",
"datasource": "Mimir",
"query": "label_values(up, app)",
"current": {
"selected": false,
"text": "api",
"value": "api"
}
},
{
"name": "namespace",
"type": "query",
"datasource": "Mimir",
"query": "label_values(up{app=\"$app\"}, namespace)",
"multi": true,
"includeAll": true
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",
"legendFormat": "{{method}} - {{status}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"yaxes": [
{
"format": "reqps",
"label": "Requests/sec"
}
]
},
{
"id": 2,
"title": "P95 Latency",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",
"legendFormat": "{{endpoint}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"yaxes": [
{
"format": "s",
"label": "Duration"
}
],
"thresholds": [
{
"value": 1,
"colorMode": "critical",
"fill": true,
"line": true,
"op": "gt"
}
]
},
{
"id": 3,
"title": "Error Rate",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",
"legendFormat": "Error %"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"yaxes": [
{
"format": "percentunit",
"max": 1,
"min": 0
}
],
"alert": {
"conditions": [
{
"evaluator": {
"params": [0.01],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {
"type": "avg"
},
"type": "query"
}
],
"frequency": "1m",
"handler": 1,
"name": "Error Rate Alert",
"noDataState": "no_data",
"notifications": []
}
},
{
"id": 4,
"title": "Recent Error Logs",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",
"refId": "A"
}
],
"options": {
"showTime": true,
"showLabels": false,
"showCommonLabels": false,
"wrapLogMessage": true,
"dedupStrategy": "none",
"enableLogDetails": true
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
}
}
],
"links": [
{
"title": "Explore Logs",
"url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",
"type": "link",
"icon": "doc"
},
{
"title": "Explore Traces",
"url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",
"type": "link",
"icon": "gf-traces"
}
]
}
}
```
LGTM Stack Configuration
Loki Configuration
File: loki.yaml
```yaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: info
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: index_
period: 24h
storage_config:
aws:
s3: s3://us-east-1/my-loki-bucket
s3forcepathstyle: true
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
shared_store: s3
limits_config:
retention_period: 744h # 31 days
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_query_series: 500
max_query_lookback: 30d
reject_old_samples: true
reject_old_samples_max_age: 168h
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
```
Tempo Configuration
File: tempo.yaml
```yaml
server:
http_listen_port: 3200
grpc_listen_port: 9096
distributor:
receivers:
otlp:
protocols:
http:
grpc:
jaeger:
protocols:
thrift_http:
grpc:
ingester:
max_block_duration: 5m
compactor:
compaction:
block_retention: 720h # 30 days
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
wal:
path: /var/tempo/wal
metrics_generator:
registry:
external_labels:
source: tempo
cluster: primary
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://mimir:9009/api/v1/push
send_exemplars: true
```
Production Best Practices
Performance Optimization
#### Query Optimization
- Use label filters before line filters
- Limit time ranges for expensive queries
- Use
unwrapinstead of parsing when possible - Cache query results with query frontend
#### Dashboard Performance
- Limit number of panels (< 15 per dashboard)
- Use appropriate time intervals
- Avoid high-cardinality grouping
- Use
$__intervalfor adaptive sampling
#### Storage Optimization
- Configure retention policies
- Use compaction for Loki and Tempo
- Implement tiered storage (hot/warm/cold)
- Monitor storage growth
Security Best Practices
#### Authentication
- Enable auth (
auth_enabled: truein Loki/Tempo) - Use OAuth/LDAP for Grafana
- Implement multi-tenancy with org isolation
#### Authorization
- Configure RBAC in Grafana
- Limit datasource access by team
- Use folder permissions for dashboards
#### Network Security
- TLS for all components
- Network policies in Kubernetes
- Rate limiting at ingress
Troubleshooting
#### Common Issues
- High Cardinality: Too many unique label combinations
- Solution: Reduce label dimensions, use log parsing instead
- Query Timeouts: Complex queries on large datasets
- Solution: Reduce time range, use aggregations, add query limits
- Storage Growth: Unbounded retention
- Solution: Configure retention policies, enable compaction
- Missing Traces: Incomplete trace data
- Solution: Check sampling rates, verify instrumentation
Resources
- [Loki Documentation](https://grafana.com/docs/loki/latest/)
- [Tempo Documentation](https://grafana.com/docs/tempo/latest/)
- [Grafana Documentation](https://grafana.com/docs/grafana/latest/)
- [LogQL Cheat Sheet](https://grafana.com/docs/loki/latest/logql/)
- [TraceQL Guide](https://grafana.com/docs/tempo/latest/traceql/)
- [Grafana Operator](https://github.com/grafana-operator/grafana-operator)
More from this repository10
Systematically restructures code to enhance readability, maintainability, and performance while preserving its original behavior.
Validates and sanitizes data across various formats and use cases, ensuring data integrity and security.
Enables scalable, loosely-coupled systems by implementing event-driven architectures with message queues, pub/sub patterns, and distributed transaction management across various messaging platforms.
Enables comprehensive logging and observability by providing structured logging, distributed tracing, metrics collection, and centralized log management patterns.
Manages asynchronous task processing with robust job queues, scheduling, worker pools, and advanced retry strategies across various frameworks.
Comprehensively tests software across domains, implementing unit, integration, and end-to-end tests with TDD/BDD workflows and robust test architecture.
Implements robust authentication and authorization patterns including OAuth2, JWT, MFA, access control, and identity management.
Enables runtime feature control through configurable flags for gradual rollouts, A/B testing, user targeting, and dynamic system configuration.
Designs and implements robust CI/CD pipelines with automated testing, security scanning, and deployment strategies across multiple platforms and tools.
Dynamically provisions and optimizes Kubernetes nodes using intelligent instance selection, spot management, and cost-efficient scaling strategies.