🎯

railway-observability

🎯Skill

from adaptationio/skrillz

What it does

railway-observability skill from adaptationio/skrillz

📦

Part of

adaptationio/skrillz(191 items)

railway-observability

Installation

Shell ScriptRun shell script

./deploy-grafana-stack.sh

📖 Extracted from docs: adaptationio/skrillz

Need more details? View full documentation on GitHub →

1Installs

Last UpdatedJan 16, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Railway.com built-in metrics, monitoring dashboards, alerting (Pro plan), and external OTEL integration with Grafana. Use when setting up monitoring, creating dashboards, configuring alerts, integrating Prometheus/Loki/Tempo, deploying Grafana stack, or analyzing Railway service metrics.

Overview

# Railway Observability

Comprehensive guide for Railway.com observability including built-in metrics, customizable dashboards, alerting (Pro plan), and external OTEL integration with Grafana/Prometheus/Loki/Tempo.

Overview

Railway provides multi-tier observability capabilities:

Built-in metrics (CPU, memory, disk, network)
Customizable dashboards with drag-and-drop widgets
Alerting with email and webhook integrations (Pro plan)
External OTEL integration via Grafana Alloy
Full observability stack deployment (Grafana template)

Keywords: metrics, monitoring, observability, dashboard, alerts, Grafana, Prometheus, Loki, Tempo, OTEL, Alloy, Railway

When to Use This Skill

Setting up monitoring for Railway services
Creating custom observability dashboards
Configuring alerts for resource thresholds
Integrating external monitoring systems
Deploying Grafana observability stack
Analyzing service performance metrics
Troubleshooting resource issues
Setting up distributed tracing

Quick Start

1. Access Built-in Observability Dashboard

Navigate to your Railway project:

```

Railway Dashboard → Project → Service → Metrics Tab

```

What you see:

CPU usage (% of allocated resources)
Memory usage (MB/GB over time)
Disk usage (storage consumption)
Network I/O (ingress/egress)
30-day metric retention

2. Configure Custom Dashboard Widgets

Add and customize metric widgets:

```

Metrics Tab → Add Widget → Select Metric Type

```

Available widgets:

Line charts (time-series metrics)
Bar charts (deployment comparisons)
Gauges (current values)
Tables (multi-metric views)

Customization:

Drag-and-drop layout
Adjustable time ranges (1h, 6h, 24h, 7d, 30d)
Multi-replica aggregation
Deployment markers on charts

3. Set Up Alerting (Pro Plan Required)

Configure alerts for threshold violations:

```

Service Settings → Alerts → Create Alert Rule

```

Alert types:

CPU threshold (e.g., > 80% for 5 minutes)
Memory threshold (e.g., > 90% for 10 minutes)
Disk usage (e.g., > 85%)
Custom metric thresholds

Notification channels:

Email (default)
Discord webhook
Slack webhook
Custom webhook (JSON payload)

4. Integrate External OTEL Collectors

Export metrics to external systems:

```

Service Settings → Observability → OTEL Integration

```

Configure environment variables:

```bash

OTEL_EXPORTER_OTLP_ENDPOINT=https://your-collector:4318

OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer

OTEL_SERVICE_NAME=my-railway-service

```

See references/otel-integration.md for complete setup.

5. Deploy Full Grafana Stack

Use Railway template for complete observability:

```bash

# Option 1: Deploy via Railway Dashboard

# Template ID: 8TLSQD (Grafana Stack)

# Includes: Grafana, Prometheus, Loki, Tempo, Alloy

# Option 2: Deploy via script

.claude/skills/railway-observability/scripts/deploy-grafana-stack.sh

```

Stack components:

Grafana (visualization)
Prometheus (metrics)
Loki (logs)
Tempo (traces)
Alloy (collector)

Workflow

Step 1: Access Built-in Metrics

Railway provides instant metrics without configuration.

Navigate to metrics:

```

Project → Service → Metrics

```

Available metrics:

CPU: % usage, allocated cores
Memory: MB used, % of limit
Disk: GB used, read/write IOPS
Network: MB/s ingress/egress

Retention: 30 days for all metrics

Step 2: Customize Dashboard

Create personalized monitoring views.

Add widgets:

Click "Add Widget"
Select metric type
Configure aggregation
Set time range
Position via drag-and-drop

Best practices:

Group related metrics (CPU + Memory)
Use deployment markers for correlation
Configure multi-replica views
Set appropriate time ranges

Step 3: Configure Alerts (Pro Plan)

Set up proactive monitoring.

Create alert rule:

```

Service → Settings → Alerts → New Rule

```

Alert configuration:

```yaml

Metric: CPU Usage

Condition: Greater than 80%

Duration: 5 minutes

Notification: Slack webhook

```

Webhook payload example:

```json

{

"service": "backend-production",

"metric": "cpu_usage",

"threshold": 80,

"current": 87.5,

"timestamp": "2025-11-26T10:30:00Z"

}

```

See references/dashboard-widgets.md for all alert types.

Step 4: Integrate OTEL Exporters

Send metrics to external systems.

Configure Alloy collector:

```bash

# Use template from templates/alloy-config.river

# Deploy as Railway service

# Configure OTEL endpoints

```

Environment setup:

```bash

# In your Railway service

OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4318

OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

OTEL_METRICS_EXPORTER=otlp

OTEL_LOGS_EXPORTER=otlp

OTEL_TRACES_EXPORTER=otlp

```

Verify integration:

```bash

# Check Alloy logs

railway logs -s alloy

# Should show: "Successfully received OTLP metrics"

```

See references/otel-integration.md for complete guide.

Step 5: Deploy Grafana Observability Stack

Full monitoring solution on Railway.

Deploy stack:

```bash

# Run deployment script

cd .claude/skills/railway-observability/scripts

./deploy-grafana-stack.sh

# Or deploy manually via Railway Dashboard

# Template: 8TLSQD (Grafana Stack)

```

Stack includes:

Grafana (port 3000) - Dashboards
Prometheus (port 9090) - Metrics storage
Loki (port 3100) - Log aggregation
Tempo (port 3200) - Distributed tracing
Alloy (port 4318) - OTLP collector

Access Grafana:

```

URL: https://.up.railway.app

Username: admin

Password: (set during deployment)

```

Architecture

Progressive Disclosure Structure

SKILL.md (this file): Quick start and workflow
references/: Detailed documentation

- metrics-reference.md - Complete metrics catalog

- dashboard-widgets.md - Widget configuration guide

- otel-integration.md - External integration setup

scripts/: Automation tools

- deploy-grafana-stack.sh - Deploy observability stack

templates/: Configuration files

- alloy-config.river - Grafana Alloy collector config

Key Features

Built-in Metrics (Always Available)

|--------|-------------|-------|-----------|

No configuration required - Metrics collected automatically for all services.

Dashboard Customization

Drag-and-drop widgets:

Resize and reposition
Multiple chart types
Custom time ranges
Deployment markers

Multi-replica support:

Aggregated views (avg, min, max, sum)
Per-replica breakdowns
Replica scaling events

Time range options:

Last 1 hour
Last 6 hours
Last 24 hours
Last 7 days
Last 30 days
Custom range

Alerting (Pro Plan Only)

Threshold alerts:

CPU > X% for Y minutes
Memory > X MB for Y minutes
Disk > X% capacity
Network > X MB/s

Notification channels:

```bash

# Email

alert@example.com

# Discord webhook

https://discord.com/api/webhooks/...

# Slack webhook

https://hooks.slack.com/services/...

# Custom webhook

https://your-api.com/alerts

```

Alert states:

Pending (condition met, waiting duration)
Firing (threshold exceeded)
Resolved (back to normal)

OTEL Integration

Supported protocols:

OTLP/HTTP (port 4318)
OTLP/gRPC (port 4317)

Signal types:

Metrics (Prometheus format)
Logs (JSON structured)
Traces (Jaeger/Zipkin format)

Collector options:

Grafana Alloy (recommended)
OpenTelemetry Collector
Custom OTLP exporters

Grafana Stack Details

Template Deployment (8TLSQD)

One-click deployment:

```

Railway Dashboard → New Project → Deploy Template → Search "8TLSQD"

```

Services deployed:

Grafana (visualization)
Prometheus (time-series DB)
Loki (log aggregation)
Tempo (trace storage)
Alloy (OTLP collector)

Configuration:

Pre-configured data sources
Sample dashboards included
Persistent storage volumes
Internal networking enabled

Alloy Collector

Purpose: Receive OTLP signals from Railway services and forward to Grafana stack.

Configuration (templates/alloy-config.river):

```river

// OTLP receiver

otelcol.receiver.otlp "default" {

grpc {

endpoint = "0.0.0.0:4317"

}

http {

endpoint = "0.0.0.0:4318"

}

output {

metrics = [otelcol.exporter.prometheus.default.input]

logs = [otelcol.exporter.loki.default.input]

traces = [otelcol.exporter.otlp.tempo.input]

}

// Prometheus exporter

otelcol.exporter.prometheus "default" {

forward_to = [prometheus.remote_write.railway.receiver]

}

// Loki exporter

otelcol.exporter.loki "default" {

forward_to = [loki.write.railway.receiver]

}

// Tempo exporter

otelcol.exporter.otlp "tempo" {

client {

endpoint = "tempo:4317"

}

```

Grafana Dashboards

Pre-built dashboards:

Railway Service Metrics (CPU, Memory, Disk, Network)
OTLP Metrics Overview
Log Analytics (Loki)
Trace Explorer (Tempo)

Custom dashboards:

Create via Grafana UI
Import from Grafana.com
Build with JSON models

Integration Examples

Example 1: Monitor Node.js Service

```bash

# Install OTEL SDK

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node

# Configure OTEL (in Railway service)

export OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4318

export OTEL_SERVICE_NAME=nodejs-backend

export OTEL_METRICS_EXPORTER=otlp

export OTEL_LOGS_EXPORTER=otlp

export OTEL_TRACES_EXPORTER=otlp

# Auto-instrumentation

node --require @opentelemetry/auto-instrumentations-node/register app.js

```

View in Grafana:

Metrics: Grafana → Dashboards → Railway Service Metrics
Logs: Grafana → Explore → Loki
Traces: Grafana → Explore → Tempo

Example 2: Alert on High Memory

```yaml

# Pro Plan: Service → Alerts → New Rule

Name: High Memory Alert

Metric: Memory Usage

Condition: Greater than 512 MB

Duration: 10 minutes

Notification: Slack webhook

Webhook URL: https://hooks.slack.com/services/YOUR/WEBHOOK/URL

```

Slack notification:

```json

{

"text": "🚨 High Memory Alert",

"blocks": [

{

"type": "section",

"text": {

"type": "mrkdwn",

"text": "Service: backend-production\nMemory: 567 MB (> 512 MB threshold)\nDuration: 12 minutes"

}

]

}

```

Example 3: Custom Metrics from Python

```python

from opentelemetry import metrics

from opentelemetry.sdk.metrics import MeterProvider

from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter

# Configure OTLP exporter

exporter = OTLPMetricExporter(

endpoint="http://alloy:4318/v1/metrics"

)

# Create meter provider

provider = MeterProvider(metric_readers=[

PeriodicExportingMetricReader(exporter, export_interval_millis=60000)

])

metrics.set_meter_provider(provider)

# Create custom metrics

meter = metrics.get_meter(__name__)

request_counter = meter.create_counter("api_requests_total")

response_time = meter.create_histogram("api_response_time_seconds")

# Record metrics

request_counter.add(1, {"endpoint": "/api/users", "method": "GET"})

response_time.record(0.125, {"endpoint": "/api/users"})

```

View in Grafana:

```

Explore → Prometheus → Metrics Browser → api_requests_total

```

Cross-References

railway-api: Query metrics via GraphQL API
railway-logs: Log aggregation (complements metrics)
railway-deployment: Deployment marker correlation
observability-stack-setup: Local LGTM stack (similar to Railway stack)

Troubleshooting

Metrics Not Showing

Check service status:

```bash

railway status -s

```

Verify metrics enabled:

Metrics are always enabled
Check service is running
Wait 1-2 minutes for first data points

Alerts Not Firing

Requirements:

Pro plan subscription required
Alert rule properly configured
Condition met for full duration
Valid notification channel

Debug checklist:

```

Verify Pro plan active
Check threshold configuration
Confirm duration setting
Test webhook URL manually
Check Railway dashboard for alert status

```

OTEL Integration Issues

Common problems:

Incorrect endpoint URL
Network connectivity (use internal Railway URLs)
Authentication headers missing
Protocol mismatch (HTTP vs gRPC)

Verify Alloy receiving data:

```bash

# Check Alloy logs

railway logs -s alloy

# Look for:

# ✅ "OTLP receiver started"

# ✅ "Received X metric points"

# ❌ "Connection refused" = endpoint issue

# ❌ "Unauthorized" = auth issue

```

See references/otel-integration.md for detailed troubleshooting.

Best Practices

Dashboard Design

Group related metrics (CPU + Memory together)
Use deployment markers to correlate changes
Set appropriate time ranges (7d for trends, 1h for incidents)
Create separate dashboards for different services
Use multi-replica views for scaled services

Alerting Strategy

Set thresholds based on actual usage patterns
Use duration to avoid false positives (5-10 min)
Test notification channels before production
Document escalation procedures
Review and tune alerts regularly

OTEL Integration

Use internal Railway URLs (no egress costs)
Batch metrics for efficiency (60s intervals)
Add service tags for filtering
Use sampling for high-volume traces
Monitor collector resource usage

Grafana Stack

Enable persistent volumes for data retention
Configure backup strategy
Secure Grafana with strong password
Use Railway internal networking
Monitor stack resource consumption

Quick Reference

Accessing Observability

```bash

# Railway Dashboard

https://railway.app/project//service//metrics

# Via Railway CLI

railway status -s

railway metrics -s

```

Common Alert Thresholds

| Metric | Warning | Critical |

|--------|---------|----------|

| CPU | 70% | 90% |

| Memory | 75% | 90% |

| Disk | 80% | 95% |

| Network | 80% bandwidth | 95% bandwidth |

OTEL Environment Variables

```bash

OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4318

OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

OTEL_SERVICE_NAME=my-service

OTEL_METRICS_EXPORTER=otlp

OTEL_LOGS_EXPORTER=otlp

OTEL_TRACES_EXPORTER=otlp

```

Grafana Stack Template

```

Template ID: 8TLSQD

Components: Grafana, Prometheus, Loki, Tempo, Alloy

Deployment: Railway Dashboard → New Project → Deploy Template

Cost: ~$20-30/month (depends on usage)

```

Resources

Railway Observability Docs: https://docs.railway.com/reference/observability
Grafana Template: https://railway.app/template/8TLSQD
OTEL Documentation: https://opentelemetry.io/docs/
Grafana Alloy: https://grafana.com/docs/alloy/

Notes

Built-in metrics are free for all plans
Alerting requires Pro plan ($20/month)
OTEL integration works on all plans
Grafana stack deploys as separate Railway services
Metrics retained for 30 days
Use internal Railway URLs for OTEL (no egress costs)
Deployment markers automatically added to charts
Multi-replica metrics aggregated automatically

---

Updated: November 26, 2025

Template ID: 8TLSQD (Grafana Stack)

Retention: 30 days (built-in metrics)

More from this repository10

🎯

xai-stock-sentiment🎯Skill

xai-stock-sentiment skill from adaptationio/skrillz

🎯

ralph-prompt-single-task🎯Skill

Generates Ralph-compatible prompts for single implementation tasks with clear completion criteria and automatic verification.

🎯

autonomous-loop🎯Skill

Orchestrates continuous autonomous coding sessions, managing feature implementation, testing, and progress tracking with intelligent checkpointing and recovery mechanisms.

🎯

auto-claude-troubleshooting🎯Skill

auto-claude-troubleshooting skill from adaptationio/skrillz

🎯

auto-claude-setup🎯Skill

auto-claude-setup skill from adaptationio/skrillz

🎯

observability-analyzer🎯Skill

Analyzes Claude Code observability data to generate insights on performance, costs, errors, tool usage, sessions, conversations, and subagents through advanced metrics and log querying.

🎯

xai-financial-integration🎯Skill

xai-financial-integration skill from adaptationio/skrillz

🎯

xai-agent-tools🎯Skill

xai-agent-tools skill from adaptationio/skrillz

🎯

xai-crypto-sentiment🎯Skill

xai-crypto-sentiment skill from adaptationio/skrillz

🎯

auto-claude-memory🎯Skill

auto-claude-memory skill from adaptationio/skrillz