🎯

grafana

🎯Skill

from cosmix/claude-loom

VibeIndex|
What it does

Designs and configures Grafana dashboards, panels, and visualizations for observability using LGTM stack technologies like Loki, Tempo, and Mimir.

πŸ“¦

Part of

cosmix/claude-loom(12 items)

grafana

Installation

Install ScriptRun install script
curl -fsSL https://raw.githubusercontent.com/cosmix/loom/main/install.sh | bash
git cloneClone repository
git clone https://github.com/cosmix/loom.git
Install ScriptRun install script
bash install.sh
πŸ“– Extracted from docs: cosmix/claude-loom
1Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

|

Overview

# Grafana and LGTM Stack Skill

Overview

The LGTM stack provides a complete observability solution with comprehensive visualization and dashboard capabilities:

  • Loki: Log aggregation and querying (LogQL)
  • Grafana: Visualization, dashboarding, alerting, and exploration
  • Tempo: Distributed tracing (TraceQL)
  • Mimir: Long-term metrics storage (Prometheus-compatible)

This skill covers setup, configuration, dashboard creation, panel design, querying, alerting, templating, and production observability best practices.

When to Use This Skill

Primary Use Cases

  • Creating or modifying Grafana dashboards
  • Designing panels and visualizations (graphs, stats, tables, heatmaps, etc.)
  • Writing queries (PromQL, LogQL, TraceQL)
  • Configuring data sources (Prometheus, Loki, Tempo, Mimir)
  • Setting up alerting rules and notification policies
  • Implementing dashboard variables and templates
  • Dashboard provisioning and GitOps workflows
  • Troubleshooting observability queries
  • Analyzing application performance, errors, or system behavior

Who Uses This Skill

  • senior-software-engineer (PRIMARY): Production observability setup, LGTM stack deployment, dashboard architecture (use infrastructure skills for deployment)
  • software-engineer: Application dashboards, service metrics visualization

LGTM Stack Components

Loki - Log Aggregation

#### Architecture - Loki

Horizontally scalable log aggregation inspired by Prometheus

  • Indexes only metadata (labels), not log content
  • Cost-effective storage with object stores (S3, GCS, etc.)
  • LogQL query language similar to PromQL

#### Key Concepts - Loki

  • Labels for indexing (low cardinality)
  • Log streams identified by unique label sets
  • Parsers: logfmt, JSON, regex, pattern
  • Line filters and label filters

Grafana - Visualization

#### Features

  • Multi-datasource dashboarding
  • Panel types: Graph, Stat, Table, Heatmap, Bar Chart, Pie Chart, Gauge, Logs, Traces, Time Series
  • Templating and variables for dynamic dashboards
  • Alerting (unified alerting with contact points and notification policies)
  • Dashboard provisioning and GitOps integration
  • Role-based access control (RBAC)
  • Explore mode for ad-hoc queries
  • Annotations for event markers
  • Dashboard folders and organization

Tempo - Distributed Tracing

#### Architecture - Tempo

Scalable distributed tracing backend

  • Cost-effective trace storage
  • TraceQL for trace querying
  • Integration with logs and metrics (trace-to-logs, trace-to-metrics)
  • OpenTelemetry compatible

Mimir - Metrics Storage

#### Architecture - Mimir

Horizontally scalable long-term Prometheus storage

  • Multi-tenancy support
  • Query federation
  • High availability
  • Prometheus remote_write compatible

Dashboard Design and Best Practices

Dashboard Organization Principles

  1. Hierarchy: Overview -> Service -> Component -> Deep Dive
  2. Golden Signals: Latency, Traffic, Errors, Saturation (RED/USE method)
  3. Variable-driven: Use templates for flexibility across environments
  4. Consistent Layouts: Grid alignment (24-column grid), logical top-to-bottom flow
  5. Performance: Limit queries, use query caching, appropriate time intervals

Panel Types and When to Use Them

| Panel Type | Use Case | Best For |

|------------|----------|----------|

| Time Series / Graph | Trends over time | Request rates, latency, resource usage |

| Stat | Single metric value | Error rates, current values, percentage |

| Gauge | Progress toward limit | CPU usage, memory, disk space |

| Bar Gauge | Comparative values | Top N items, distribution |

| Table | Structured data | Service lists, error details, resource inventory |

| Pie Chart | Proportions | Traffic distribution, error breakdown |

| Heatmap | Distribution over time | Latency percentiles, request patterns |

| Logs | Log streams | Error investigation, debugging |

| Traces | Distributed tracing | Performance analysis, dependency mapping |

Panel Configuration Best Practices

#### Titles and Descriptions

  • Clear, descriptive titles: Include units and metric context
  • Tooltips: Add description fields for panel documentation
  • Examples:

- Good: "P95 Latency (seconds) by Endpoint"

- Bad: "Latency"

#### Legends and Labels

  • Show legends only when needed (multiple series)
  • Use {{label}} format for dynamic legend names
  • Place legends appropriately (bottom, right, or hidden)
  • Sort by value when showing Top N

#### Axes and Units

  • Always label axes with units
  • Use appropriate unit formats (seconds, bytes, percent, requests/sec)
  • Set reasonable min/max ranges to avoid misleading scales
  • Use logarithmic scales for wide value ranges

#### Thresholds and Colors

  • Use thresholds for visual cues (green/yellow/red)
  • Standard threshold pattern:

- Green: Normal operation

- Yellow: Warning (action may be needed)

- Red: Critical (immediate attention required)

  • Examples:

- Error rate: 0% (green), 1% (yellow), 5% (red)

- P95 latency: <1s (green), 1-3s (yellow), >3s (red)

#### Links and Drilldowns

  • Link panels to related dashboards
  • Use data links for context (logs, traces, related services)
  • Create drill-down paths: Overview -> Service -> Component -> Details
  • Link to runbooks for alert panels

Dashboard Variables and Templating

Dashboard variables enable reusable, dynamic dashboards that work across environments, services, and time ranges.

#### Variable Types

| Type | Purpose | Example |

|------|---------|---------|

| Query | Populate from data source | Namespaces, services, pods |

| Custom | Static list of options | Environments (prod/staging/dev) |

| Interval | Time interval selection | Auto-adjusted query intervals |

| Datasource | Switch between data sources | Multiple Prometheus instances |

| Constant | Hidden values for queries | Cluster name, region |

| Text box | Free-form input | Custom filters |

#### Common Variable Patterns

```json

{

"templating": {

"list": [

{

"name": "datasource",

"type": "datasource",

"query": "prometheus",

"description": "Select Prometheus data source"

},

{

"name": "namespace",

"type": "query",

"datasource": "${datasource}",

"query": "label_values(kube_pod_info, namespace)",

"multi": true,

"includeAll": true,

"description": "Kubernetes namespace filter"

},

{

"name": "app",

"type": "query",

"datasource": "${datasource}",

"query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)",

"multi": true,

"includeAll": true,

"description": "Application filter (depends on namespace)"

},

{

"name": "interval",

"type": "interval",

"auto": true,

"auto_count": 30,

"auto_min": "10s",

"options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"],

"description": "Query resolution interval"

},

{

"name": "environment",

"type": "custom",

"options": [

{ "text": "Production", "value": "prod" },

{ "text": "Staging", "value": "staging" },

{ "text": "Development", "value": "dev" }

],

"current": { "text": "Production", "value": "prod" }

}

]

}

}

```

#### Variable Usage in Queries

Variables are referenced with $variable_name or ${variable_name} syntax:

```promql

# Simple variable reference

rate(http_requests_total{namespace="$namespace"}[5m])

# Multi-select with regex match

rate(http_requests_total{namespace=~"$namespace"}[5m])

# Variable in legend

rate(http_requests_total{app="$app"}[5m]) by (method)

# Legend format: "{{method}}"

# Using interval variable for adaptive queries

rate(http_requests_total[$__interval])

# Chained variables (app depends on namespace)

rate(http_requests_total{namespace="$namespace", app="$app"}[5m])

```

#### Advanced Variable Techniques

Regex filtering:

```json

{

"name": "pod",

"type": "query",

"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",

"regex": "/^$app-.*/",

"description": "Filter pods by app prefix"

}

```

All option with custom value:

```json

{

"name": "status",

"type": "custom",

"options": ["200", "404", "500"],

"includeAll": true,

"allValue": ".*",

"description": "HTTP status code filter"

}

```

Dependent variables (variable chain):

  1. $datasource (datasource type)
  2. $cluster (query: depends on datasource)
  3. $namespace (query: depends on cluster)
  4. $app (query: depends on namespace)
  5. $pod (query: depends on app)

Annotations

Annotations display events as vertical markers on time series panels:

```json

{

"annotations": {

"list": [

{

"name": "Deployments",

"datasource": "Prometheus",

"expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])",

"tagKeys": "deployment,namespace",

"textFormat": "Deployment: {{deployment}}",

"iconColor": "blue"

},

{

"name": "Alerts",

"datasource": "Loki",

"expr": "{app=\"alertmanager\"} | json | alertname!=\"\"",

"textFormat": "Alert: {{alertname}}",

"iconColor": "red"

}

]

}

}

```

Dashboard Performance Optimization

#### Query Optimization

  • Limit number of panels (< 15 per dashboard)
  • Use appropriate time ranges (avoid queries over months)
  • Leverage $__interval for adaptive sampling
  • Avoid high-cardinality grouping (too many series)
  • Use query caching when available

#### Panel Performance

  • Set max data points to reasonable values
  • Use instant queries for current-state panels
  • Combine related metrics into single queries when possible
  • Disable auto-refresh on heavy dashboards

Dashboard as Code and Provisioning

Dashboard Provisioning

Dashboard provisioning enables GitOps workflows and version-controlled dashboard definitions.

#### Provisioning Provider Configuration

File: /etc/grafana/provisioning/dashboards/dashboards.yaml

```yaml

apiVersion: 1

providers:

- name: 'default'

orgId: 1

folder: ''

type: file

disableDeletion: false

updateIntervalSeconds: 10

allowUiUpdates: true

options:

path: /etc/grafana/provisioning/dashboards

- name: 'application'

orgId: 1

folder: 'Applications'

type: file

disableDeletion: true

editable: false

options:

path: /var/lib/grafana/dashboards/application

- name: 'infrastructure'

orgId: 1

folder: 'Infrastructure'

type: file

options:

path: /var/lib/grafana/dashboards/infrastructure

```

#### Dashboard JSON Structure

Complete dashboard JSON with metadata and provisioning:

```json

{

"dashboard": {

"title": "Application Observability - ${app}",

"uid": "app-observability",

"tags": ["observability", "application"],

"timezone": "browser",

"editable": true,

"graphTooltip": 1,

"time": {

"from": "now-1h",

"to": "now"

},

"refresh": "30s",

"templating": { "list": [] },

"panels": [],

"links": []

},

"overwrite": true,

"folderId": null,

"folderUid": null

}

```

#### Kubernetes ConfigMap Provisioning

```yaml

apiVersion: v1

kind: ConfigMap

metadata:

name: grafana-dashboards

namespace: monitoring

labels:

grafana_dashboard: "1"

data:

application-dashboard.json: |

{

"dashboard": {

"title": "Application Metrics",

"uid": "app-metrics",

"tags": ["application"],

"panels": []

}

}

```

#### Grafana Operator (CRD)

```yaml

apiVersion: grafana.integreatly.org/v1beta1

kind: GrafanaDashboard

metadata:

name: application-observability

namespace: monitoring

spec:

instanceSelector:

matchLabels:

dashboards: "grafana"

json: |

{

"dashboard": {

"title": "Application Observability",

"panels": []

}

}

```

Data Source Provisioning

#### Loki Data Source

File: /etc/grafana/provisioning/datasources/loki.yaml

```yaml

apiVersion: 1

datasources:

- name: Loki

type: loki

access: proxy

url: http://loki:3100

jsonData:

maxLines: 1000

derivedFields:

- datasourceUid: tempo_uid

matcherRegex: "trace_id=(\\w+)"

name: TraceID

url: "$${__value.raw}"

editable: false

```

#### Tempo Data Source

File: /etc/grafana/provisioning/datasources/tempo.yaml

```yaml

apiVersion: 1

datasources:

- name: Tempo

type: tempo

access: proxy

url: http://tempo:3200

uid: tempo_uid

jsonData:

httpMethod: GET

tracesToLogs:

datasourceUid: loki_uid

tags: ["job", "instance", "pod", "namespace"]

mappedTags: [{ key: "service.name", value: "service" }]

spanStartTimeShift: "1h"

spanEndTimeShift: "1h"

tracesToMetrics:

datasourceUid: prometheus_uid

tags: [{ key: "service.name", value: "service" }]

serviceMap:

datasourceUid: prometheus_uid

nodeGraph:

enabled: true

editable: false

```

#### Mimir/Prometheus Data Source

File: /etc/grafana/provisioning/datasources/mimir.yaml

```yaml

apiVersion: 1

datasources:

- name: Mimir

type: prometheus

access: proxy

url: http://mimir:8080/prometheus

uid: prometheus_uid

jsonData:

httpMethod: POST

exemplarTraceIdDestinations:

- datasourceUid: tempo_uid

name: trace_id

prometheusType: Mimir

prometheusVersion: 2.40.0

cacheLevel: "High"

incrementalQuerying: true

incrementalQueryOverlapWindow: 10m

editable: false

```

Alerting

Alert Rule Configuration

Grafana unified alerting supports multi-datasource alerts with flexible evaluation and routing.

#### Prometheus/Mimir Alert Rule

File: /etc/grafana/provisioning/alerting/rules.yaml

```yaml

apiVersion: 1

groups:

- name: application_alerts

interval: 1m

rules:

- uid: error_rate_high

title: High Error Rate

condition: A

data:

- refId: A

queryType: ""

relativeTimeRange:

from: 300

to: 0

datasourceUid: prometheus_uid

model:

expr: |

sum(rate(http_requests_total{status=~"5.."}[5m]))

/

sum(rate(http_requests_total[5m]))

> 0.05

intervalMs: 1000

maxDataPoints: 43200

noDataState: NoData

execErrState: Error

for: 5m

annotations:

description: 'Error rate is {{ printf "%.2f" $values.A.Value }}% (threshold: 5%)'

summary: Application error rate is above threshold

runbook_url: https://wiki.company.com/runbooks/high-error-rate

labels:

severity: critical

team: platform

isPaused: false

- uid: high_latency

title: High P95 Latency

condition: A

data:

- refId: A

datasourceUid: prometheus_uid

model:

expr: |

histogram_quantile(0.95,

sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)

) > 2

for: 10m

annotations:

description: "P95 latency is {{ $values.A.Value }}s on endpoint {{ $labels.endpoint }}"

runbook_url: https://wiki.company.com/runbooks/high-latency

labels:

severity: warning

```

#### Loki Alert Rule

```yaml

apiVersion: 1

groups:

- name: log_based_alerts

interval: 1m

rules:

- uid: error_spike

title: Error Log Spike

condition: A

data:

- refId: A

queryType: ""

datasourceUid: loki_uid

model:

expr: |

sum(rate({app="api"} | json | level="error" [5m]))

> 10

for: 2m

annotations:

description: "Error log rate is {{ $values.A.Value }} logs/sec"

summary: Spike in error logs detected

labels:

severity: warning

- uid: critical_error_pattern

title: Critical Error Pattern Detected

condition: A

data:

- refId: A

datasourceUid: loki_uid

model:

expr: |

sum(count_over_time({app="api"}

|~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]

)) > 0

for: 0m

annotations:

description: "Critical error pattern found in logs"

labels:

severity: critical

page: true

```

Contact Points and Notification Policies

File: /etc/grafana/provisioning/alerting/contactpoints.yaml

```yaml

apiVersion: 1

contactPoints:

- orgId: 1

name: slack-critical

receivers:

- uid: slack_critical

type: slack

settings:

url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL

title: "{{ .GroupLabels.alertname }}"

text: |

{{ range .Alerts }}

Alert: {{ .Labels.alertname }}

Summary: {{ .Annotations.summary }}

Description: {{ .Annotations.description }}

Severity: {{ .Labels.severity }}

{{ end }}

disableResolveMessage: false

- orgId: 1

name: pagerduty-oncall

receivers:

- uid: pagerduty_oncall

type: pagerduty

settings:

integrationKey: YOUR_INTEGRATION_KEY

severity: critical

class: infrastructure

- orgId: 1

name: email-team

receivers:

- uid: email_team

type: email

settings:

addresses: team@company.com

singleEmail: true

notificationPolicies:

- orgId: 1

receiver: slack-critical

group_by: ["alertname", "namespace"]

group_wait: 30s

group_interval: 5m

repeat_interval: 4h

routes:

- receiver: pagerduty-oncall

matchers:

- severity = critical

- page = true

group_wait: 10s

repeat_interval: 1h

continue: true

- receiver: email-team

matchers:

- severity = warning

- team = platform

group_interval: 10m

repeat_interval: 12h

```

LogQL Query Patterns

Basic Log Queries

#### Stream Selection

```logql

# Simple label matching

{namespace="production", app="api"}

# Regex matching

{app=~"api|web|worker"}

# Not equal

{env!="staging"}

# Multiple conditions

{namespace="production", app="api", level!="debug"}

```

#### Line Filters

```logql

# Contains

{app="api"} |= "error"

# Does not contain

{app="api"} != "debug"

# Regex match

{app="api"} |~ "error|exception|fatal"

# Case insensitive

{app="api"} |~ "(?i)error"

# Chaining filters

{app="api"} |= "error" != "timeout"

```

Parsing and Extraction

#### JSON Parsing

```logql

# Parse JSON logs

{app="api"} | json

# Extract specific fields

{app="api"} | json message="msg", level="severity"

# Filter on extracted field

{app="api"} | json | level="error"

# Nested JSON

{app="api"} | json | line_format "{{.response.status}}"

```

#### Logfmt Parsing

```logql

# Parse logfmt (key=value)

{app="api"} | logfmt

# Extract specific fields

{app="api"} | logfmt level, caller, msg

# Filter parsed fields

{app="api"} | logfmt | level="error"

```

#### Pattern Parsing

```logql

# Extract with pattern

{app="nginx"} | pattern - - <_> " <_>" <_>

# Filter on extracted values

{app="nginx"} | pattern <_> <_> | status >= 400

# Complex pattern

{app="api"} | pattern level= msg="" duration=ms

```

Aggregations and Metrics

#### Count Queries

```logql

# Count log lines over time

count_over_time({app="api"}[5m])

# Rate of logs

rate({app="api"}[5m])

# Errors per second

sum(rate({app="api"} |= "error" [5m])) by (namespace)

# Error ratio

sum(rate({app="api"} |= "error" [5m]))

/

sum(rate({app="api"}[5m]))

```

#### Extracted Metrics

```logql

# Average duration

avg_over_time({app="api"}

| logfmt

| unwrap duration [5m]) by (endpoint)

# P95 latency

quantile_over_time(0.95, {app="api"}

| regexp duration=(?P[0-9.]+)ms

| unwrap duration [5m]) by (method)

# Top 10 error messages

topk(10,

sum by (msg) (

count_over_time({app="api"}

| json

| level="error" [1h]

)

)

)

```

TraceQL Query Patterns

Basic Trace Queries

```traceql

# Find traces by service

{ .service.name = "api" }

# HTTP status codes

{ .http.status_code = 500 }

# Combine conditions

{ .service.name = "api" && .http.status_code >= 400 }

# Duration filter

{ duration > 1s }

```

Advanced TraceQL

```traceql

# Parent-child relationship

{ .service.name = "frontend" }

>> { .service.name = "backend" && .http.status_code = 500 }

# Descendant spans

{ .service.name = "api" }

>>+ { .db.system = "postgresql" && duration > 1s }

# Failed database queries

{ .service.name = "api" }

>> { .db.system = "postgresql" && status = "error" }

```

Complete Dashboard Examples

Application Observability Dashboard

```json

{

"dashboard": {

"title": "Application Observability - ${app}",

"tags": ["observability", "application"],

"timezone": "browser",

"editable": true,

"graphTooltip": 1,

"time": {

"from": "now-1h",

"to": "now"

},

"templating": {

"list": [

{

"name": "app",

"type": "query",

"datasource": "Mimir",

"query": "label_values(up, app)",

"current": {

"selected": false,

"text": "api",

"value": "api"

}

},

{

"name": "namespace",

"type": "query",

"datasource": "Mimir",

"query": "label_values(up{app=\"$app\"}, namespace)",

"multi": true,

"includeAll": true

}

]

},

"panels": [

{

"id": 1,

"title": "Request Rate",

"type": "graph",

"datasource": "Mimir",

"targets": [

{

"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",

"legendFormat": "{{method}} - {{status}}"

}

],

"gridPos": {

"h": 8,

"w": 12,

"x": 0,

"y": 0

},

"yaxes": [

{

"format": "reqps",

"label": "Requests/sec"

}

]

},

{

"id": 2,

"title": "P95 Latency",

"type": "graph",

"datasource": "Mimir",

"targets": [

{

"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",

"legendFormat": "{{endpoint}}"

}

],

"gridPos": {

"h": 8,

"w": 12,

"x": 12,

"y": 0

},

"yaxes": [

{

"format": "s",

"label": "Duration"

}

],

"thresholds": [

{

"value": 1,

"colorMode": "critical",

"fill": true,

"line": true,

"op": "gt"

}

]

},

{

"id": 3,

"title": "Error Rate",

"type": "graph",

"datasource": "Mimir",

"targets": [

{

"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",

"legendFormat": "Error %"

}

],

"gridPos": {

"h": 8,

"w": 12,

"x": 0,

"y": 8

},

"yaxes": [

{

"format": "percentunit",

"max": 1,

"min": 0

}

],

"alert": {

"conditions": [

{

"evaluator": {

"params": [0.01],

"type": "gt"

},

"operator": {

"type": "and"

},

"query": {

"params": ["A", "5m", "now"]

},

"reducer": {

"type": "avg"

},

"type": "query"

}

],

"frequency": "1m",

"handler": 1,

"name": "Error Rate Alert",

"noDataState": "no_data",

"notifications": []

}

},

{

"id": 4,

"title": "Recent Error Logs",

"type": "logs",

"datasource": "Loki",

"targets": [

{

"expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",

"refId": "A"

}

],

"options": {

"showTime": true,

"showLabels": false,

"showCommonLabels": false,

"wrapLogMessage": true,

"dedupStrategy": "none",

"enableLogDetails": true

},

"gridPos": {

"h": 8,

"w": 12,

"x": 12,

"y": 8

}

}

],

"links": [

{

"title": "Explore Logs",

"url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",

"type": "link",

"icon": "doc"

},

{

"title": "Explore Traces",

"url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",

"type": "link",

"icon": "gf-traces"

}

]

}

}

```

LGTM Stack Configuration

Loki Configuration

File: loki.yaml

```yaml

auth_enabled: false

server:

http_listen_port: 3100

grpc_listen_port: 9096

log_level: info

common:

path_prefix: /loki

storage:

filesystem:

chunks_directory: /loki/chunks

rules_directory: /loki/rules

replication_factor: 1

ring:

kvstore:

store: inmemory

schema_config:

configs:

- from: 2024-01-01

store: tsdb

object_store: s3

schema: v13

index:

prefix: index_

period: 24h

storage_config:

aws:

s3: s3://us-east-1/my-loki-bucket

s3forcepathstyle: true

tsdb_shipper:

active_index_directory: /loki/tsdb-index

cache_location: /loki/tsdb-cache

shared_store: s3

limits_config:

retention_period: 744h # 31 days

ingestion_rate_mb: 10

ingestion_burst_size_mb: 20

max_query_series: 500

max_query_lookback: 30d

reject_old_samples: true

reject_old_samples_max_age: 168h

compactor:

working_directory: /loki/compactor

shared_store: s3

compaction_interval: 10m

retention_enabled: true

retention_delete_delay: 2h

```

Tempo Configuration

File: tempo.yaml

```yaml

server:

http_listen_port: 3200

grpc_listen_port: 9096

distributor:

receivers:

otlp:

protocols:

http:

grpc:

jaeger:

protocols:

thrift_http:

grpc:

ingester:

max_block_duration: 5m

compactor:

compaction:

block_retention: 720h # 30 days

storage:

trace:

backend: s3

s3:

bucket: tempo-traces

endpoint: s3.amazonaws.com

region: us-east-1

wal:

path: /var/tempo/wal

metrics_generator:

registry:

external_labels:

source: tempo

cluster: primary

storage:

path: /var/tempo/generator/wal

remote_write:

- url: http://mimir:9009/api/v1/push

send_exemplars: true

```

Production Best Practices

Performance Optimization

#### Query Optimization

  • Use label filters before line filters
  • Limit time ranges for expensive queries
  • Use unwrap instead of parsing when possible
  • Cache query results with query frontend

#### Dashboard Performance

  • Limit number of panels (< 15 per dashboard)
  • Use appropriate time intervals
  • Avoid high-cardinality grouping
  • Use $__interval for adaptive sampling

#### Storage Optimization

  • Configure retention policies
  • Use compaction for Loki and Tempo
  • Implement tiered storage (hot/warm/cold)
  • Monitor storage growth

Security Best Practices

#### Authentication

  • Enable auth (auth_enabled: true in Loki/Tempo)
  • Use OAuth/LDAP for Grafana
  • Implement multi-tenancy with org isolation

#### Authorization

  • Configure RBAC in Grafana
  • Limit datasource access by team
  • Use folder permissions for dashboards

#### Network Security

  • TLS for all components
  • Network policies in Kubernetes
  • Rate limiting at ingress

Troubleshooting

#### Common Issues

  1. High Cardinality: Too many unique label combinations

- Solution: Reduce label dimensions, use log parsing instead

  1. Query Timeouts: Complex queries on large datasets

- Solution: Reduce time range, use aggregations, add query limits

  1. Storage Growth: Unbounded retention

- Solution: Configure retention policies, enable compaction

  1. Missing Traces: Incomplete trace data

- Solution: Check sampling rates, verify instrumentation

Resources

  • [Loki Documentation](https://grafana.com/docs/loki/latest/)
  • [Tempo Documentation](https://grafana.com/docs/tempo/latest/)
  • [Grafana Documentation](https://grafana.com/docs/grafana/latest/)
  • [LogQL Cheat Sheet](https://grafana.com/docs/loki/latest/logql/)
  • [TraceQL Guide](https://grafana.com/docs/tempo/latest/traceql/)
  • [Grafana Operator](https://github.com/grafana-operator/grafana-operator)

More from this repository10

🎯
refactoring🎯Skill

Systematically restructures code to enhance readability, maintainability, and performance while preserving its original behavior.

🎯
data-validation🎯Skill

Validates and sanitizes data across various formats and use cases, ensuring data integrity and security.

🎯
event-driven🎯Skill

Enables scalable, loosely-coupled systems by implementing event-driven architectures with message queues, pub/sub patterns, and distributed transaction management across various messaging platforms.

🎯
logging-observability🎯Skill

Enables comprehensive logging and observability by providing structured logging, distributed tracing, metrics collection, and centralized log management patterns.

🎯
background-jobs🎯Skill

Manages asynchronous task processing with robust job queues, scheduling, worker pools, and advanced retry strategies across various frameworks.

🎯
testing🎯Skill

Comprehensively tests software across domains, implementing unit, integration, and end-to-end tests with TDD/BDD workflows and robust test architecture.

🎯
auth🎯Skill

Implements robust authentication and authorization patterns including OAuth2, JWT, MFA, access control, and identity management.

🎯
feature-flags🎯Skill

Enables runtime feature control through configurable flags for gradual rollouts, A/B testing, user targeting, and dynamic system configuration.

🎯
ci-cd🎯Skill

Designs and implements robust CI/CD pipelines with automated testing, security scanning, and deployment strategies across multiple platforms and tools.

🎯
karpenter🎯Skill

Dynamically provisions and optimizes Kubernetes nodes using intelligent instance selection, spot management, and cost-efficient scaling strategies.