🎯

railway-troubleshooting

🎯Skill

from adaptationio/skrillz

VibeIndex|
What it does

Systematically diagnoses and resolves Railway.com deployment issues across builds, services, networking, and database problems using structured troubleshooting workflows.

railway-troubleshooting

Installation

Install skill:
npx skills add https://github.com/adaptationio/skrillz --skill railway-troubleshooting
1
AddedJan 27, 2026

Skill Details

SKILL.md

Railway debugging and issue resolution. Use when deployments fail, builds error, services crash, performance degrades, or networking issues occur.

Overview

# Railway Troubleshooting

Systematic debugging and issue resolution for Railway.com deployments.

Overview

This skill provides decision trees, diagnostic workflows, and recovery procedures for Railway platform issues. It covers build failures, runtime crashes, networking problems, database issues, and performance degradation.

Quick Start

Use this decision tree to diagnose and resolve Railway issues:

```

Railway Issue?

β”‚

β”œβ”€β”€ Deployment Failed?

β”‚ β”œβ”€β”€ Build Error β†’ Operation 1: Diagnose Build Failures

β”‚ β”œβ”€β”€ Deploy Error β†’ Operation 1: Diagnose Deployment Failures

β”‚ β”œβ”€β”€ Health Check Failed β†’ Check service health endpoint

β”‚ └── Timeout β†’ Check build/deploy timeouts in settings

β”‚

β”œβ”€β”€ Service Crashing?

β”‚ β”œβ”€β”€ Immediate crash β†’ Operation 2: Debug Runtime Crashes

β”‚ β”œβ”€β”€ Crash after time β†’ Check memory limits, memory leaks

β”‚ β”œβ”€β”€ Restart loop β†’ Check startup command, dependencies

β”‚ └── Exit code errors β†’ Check application logs for specifics

β”‚

β”œβ”€β”€ Networking Issues?

β”‚ β”œβ”€β”€ Service unreachable β†’ Operation 3: Troubleshoot Networking

β”‚ β”œβ”€β”€ Intermittent connectivity β†’ Check DNS, service discovery

β”‚ β”œβ”€β”€ SSL errors β†’ Check domain configuration, certificates

β”‚ └── Timeout errors β†’ Check port configuration, firewalls

β”‚

β”œβ”€β”€ Build Issues?

β”‚ β”œβ”€β”€ Nixpacks detection wrong β†’ Operation 4: Fix Build Errors

β”‚ β”œβ”€β”€ Dependencies failing β†’ Check package.json, requirements.txt

β”‚ β”œβ”€β”€ Build commands failing β†’ Verify build scripts

β”‚ └── Cache issues β†’ Clear build cache, force rebuild

β”‚

└── Database Problems?

β”œβ”€β”€ Connection refused β†’ Operation 5: Resolve Database Issues

β”œβ”€β”€ Timeout errors β†’ Check connection pools, query performance

β”œβ”€β”€ Performance slow β†’ Check indices, query optimization

└── Data corruption β†’ Check backups, recovery procedures

```

Operations

Operation 1: Diagnose Deployment Failures

Identify and resolve deployment failures through systematic log analysis.

When to use: Deployment status shows failed, builds succeed but deploys fail, health checks failing.

Workflow:

  1. Check Deployment Status

```bash

# CLI approach

railway status

railway logs --deployment

# API approach (see references/debug-workflow.md for GraphQL)

# Query deployment status and recent deploys

```

  1. Analyze Deploy Logs

- Check for port binding issues (Railway expects PORT env var)

- Verify health check endpoint responding

- Check startup command execution

- Identify timeout issues

  1. Common Deploy Failures

- Port not bound: App must listen on process.env.PORT

- Health check timeout: Increase timeout or fix endpoint

- Missing environment variables: Check service variables

- Startup command wrong: Verify start command in settings

  1. Fix and Redeploy

- Apply fix to code/configuration

- Trigger new deployment

- Monitor deployment logs

- Verify service healthy

See: references/common-errors.md for specific error messages and solutions.

Operation 2: Debug Runtime Crashes

Investigate and resolve service crashes and restart loops.

When to use: Service shows restarting, exit codes in logs, OOM errors, crash reports.

Workflow:

  1. Gather Crash Information

```bash

# Get runtime logs

railway logs --tail 500

# Check service metrics

railway metrics

# Use diagnostic script

./scripts/diagnose.sh [service-id] --verbose

```

  1. Identify Crash Pattern

- Immediate crash: Startup issue (missing deps, config error)

- Crash after time: Memory leak, resource exhaustion

- Intermittent crash: Race condition, external dependency

- Exit code 137: Out of Memory (OOM) killed

  1. Check Resource Limits

- Memory usage trending up β†’ Memory leak

- CPU at 100% β†’ Infinite loop, CPU-intensive operation

- Disk full β†’ Log rotation issue, temp files

- Connection limits β†’ Database pool exhausted

  1. Common Crash Causes

- OOM: Increase memory limit or fix memory leak

- Missing dependencies: Check package installation

- Uncaught exceptions: Add error handling

- External service down: Add retry logic, circuit breakers

See: references/debug-workflow.md for systematic debugging steps.

Operation 3: Troubleshoot Networking

Resolve networking issues including service discovery, DNS, and connectivity.

When to use: Services can't reach each other, DNS resolution fails, external access issues, SSL errors.

Workflow:

  1. Verify Service Discovery

```bash

# Check private networking enabled

# Services use: [service-name].[project-name].railway.internal

# Test DNS resolution

railway run nslookup [service-name].[project-name].railway.internal

```

  1. Check Network Configuration

- Private networking enabled in project settings

- Service names correct (use Railway-provided names)

- Port configuration matches application

- Environment variables for service URLs set

  1. Debug External Access

- Domain configured correctly in service settings

- DNS records pointing to Railway

- SSL certificate provisioned (check domain settings)

- Generate domain option enabled for public access

  1. Common Network Issues

- Service discovery: Use full internal domain name

- Port mismatch: App must listen on PORT env var

- SSL not working: Allow time for cert provisioning (5-10 min)

- Timeout: Check for firewall rules, rate limiting

See: references/common-errors.md Network Errors section.

Operation 4: Fix Build Errors

Resolve build failures, nixpacks configuration issues, and dependency problems.

When to use: Build fails, wrong builder detected, dependencies not installing, build commands fail.

Workflow:

  1. Check Build Logs

```bash

railway logs --build

# Identify build phase failure:

# - Detection phase: Nixpacks provider detection

# - Install phase: Dependencies installation

# - Build phase: Build commands execution

```

  1. Verify Builder Configuration

- Check nixpacks.toml or railway.toml for custom config

- Verify build command in service settings

- Check for language version specification

- Ensure correct provider detected (Node, Python, Go, etc.)

  1. Fix Dependency Issues

- Lock file present (package-lock.json, yarn.lock, requirements.txt)

- Dependencies compatible with build environment

- Private packages have auth configured

- Build dependencies vs runtime dependencies separated

  1. Force Rebuild if Needed

```bash

# Clear cache and rebuild

./scripts/force-rebuild.sh [service-id] --no-cache

# Or via CLI

railway up --detach

```

Common Build Errors:

  • Wrong nixpacks provider: Add nixpacks.toml with correct provider
  • Dependency resolution: Update lock files, fix version conflicts
  • Build timeout: Optimize build, increase timeout in settings
  • Cache issues: Clear build cache with force rebuild

See: references/common-errors.md Build Errors section.

Operation 5: Resolve Database Issues

Debug database connection problems, timeouts, and performance issues.

When to use: Connection refused, database timeouts, slow queries, connection pool exhausted.

Workflow:

  1. Verify Database Connection

```bash

# Check database service status

railway status

# Test connection with database URL

railway run psql $DATABASE_URL -c "SELECT 1"

```

  1. Check Connection Configuration

- DATABASE_URL environment variable set correctly

- Connection pool size appropriate for service plan

- Connection timeout settings reasonable

- SSL mode configured if required

  1. Debug Connection Issues

- Connection refused: Database not started, wrong host/port

- Timeout: Network issue, slow queries, pool exhausted

- Auth failed: Wrong credentials, user permissions

- Too many connections: Pool size exceeded, connection leak

  1. Performance Troubleshooting

- Slow queries: Check query plans, add indices

- High CPU: Identify expensive queries, optimize

- Connection pool exhausted: Increase pool size or fix leaks

- Disk space: Clean up old data, increase storage

Emergency Recovery:

  • Restart database service: railway restart [service-id]
  • Check backups: Railway auto-backups available
  • Scale vertically: Upgrade database plan if needed
  • Connection leak: Restart application services

See: references/recovery-procedures.md for emergency procedures.

Related Skills

  • railway-auth: Authentication setup for Railway CLI/API
  • railway-logs: Advanced log querying and analysis
  • railway-deployment: Deployment workflows and strategies
  • railway-api: GraphQL API queries and operations

When to Use This Skill

Use railway-troubleshooting when you encounter:

  • ❌ Deployment failures or build errors
  • πŸ”„ Service restart loops or crashes
  • 🌐 Networking or connectivity issues
  • πŸ› Runtime errors or performance problems
  • πŸ’Ύ Database connection or query issues
  • ⚑ Performance degradation
  • πŸ”§ Configuration or environment issues

Quick Diagnostic

Run the diagnostic script for automated issue detection:

```bash

cd /mnt/c/data/github/skrillz/.claude/skills/railway-troubleshooting/scripts

./diagnose.sh [service-id] --verbose

```

The script will:

  • Check service health status
  • Analyze recent deployment logs
  • Scan for common error patterns
  • Check resource utilization
  • Provide specific recommendations

Additional Resources

  • Common Errors Guide: references/common-errors.md - 20+ documented errors with solutions
  • Debug Workflow: references/debug-workflow.md - Systematic debugging methodology
  • Recovery Procedures: references/recovery-procedures.md - Emergency recovery steps
  • Diagnostic Script: scripts/diagnose.sh - Automated diagnostics
  • Force Rebuild: scripts/force-rebuild.sh - Clear cache and rebuild

Best Practices

  1. Always check logs first: Build logs, deploy logs, runtime logs
  2. Verify environment variables: Missing vars cause most deployment failures
  3. Check resource limits: Memory/CPU limits appropriate for workload
  4. Test locally first: Reproduce issues locally when possible
  5. Monitor metrics: Use Railway dashboard for trends
  6. Document solutions: Update common-errors.md with new patterns
  7. Use private networking: For inter-service communication
  8. Enable health checks: Catch deployment issues early

Support

For issues not covered by this skill:

  • Railway Documentation: https://docs.railway.com
  • Railway Discord: Active community support
  • Railway Status: https://status.railway.com
  • GitHub Issues: https://github.com/railwayapp/railway/issues

More from this repository10

🎯
analysis🎯Skill

Performs comprehensive analysis of code, skills, processes, and data to extract actionable insights, identify patterns, and drive data-driven improvements.

🎯
auto-claude-troubleshooting🎯Skill

Automatically diagnoses and resolves Auto-Claude installation, configuration, and runtime issues across different platforms and environments.

🎯
xai-auth🎯Skill

Authenticates and configures xAI Grok API access using Twitter/X account credentials, enabling seamless integration with OpenAI-compatible SDK methods.

🎯
xai-financial-integration🎯Skill

Retrieve and integrate xAI Grok sentiment with financial data APIs to generate comprehensive market insights and analysis.

🎯
xai-crypto-sentiment🎯Skill

xai-crypto-sentiment skill from adaptationio/skrillz

🎯
twelvedata-api🎯Skill

Retrieves comprehensive financial market data including stocks, forex, crypto, and technical indicators using the Twelve Data API.

🎯
xai-x-search🎯Skill

Enables real-time Twitter/X searches using Grok API to extract insights, track trends, monitor accounts, and analyze social discussions.

🎯
xai-agent-tools🎯Skill

Enables autonomous agents to search X, web, execute code, and analyze documents with server-side tool management.

🎯
auto-claude-optimization🎯Skill

Optimizes Claude AI performance by reducing token usage, managing API costs, and improving build speed through intelligent model and context selection.

🎯
auto-claude-setup🎯Skill

Automates comprehensive installation and setup of Auto-Claude across Windows, macOS, Linux, and WSL with multi-platform support and dependency management.