🎯

drawio

🎯Skill

from akiojin/llmlb

What it does

Generates or visualizes architecture and system design diagrams for the LLM Load Balancer project using Draw.io/diagrams.net format.

📦

Part of

akiojin/llmlb(9 items)

drawio

Installation

npm installInstall npm package

npm install -g @llmlb/mcp-server

npxRun with npx

npx @llmlb/mcp-server

CargoRun with Cargo (Rust)

cargo build --release -p llmlb

Claude Desktop ConfigurationAdd this to your claude_desktop_config.json

{
  "mcpServers": {
    "llmlb": {
      "type": "stdio",
      "command": "npx"...

📖 Extracted from docs: akiojin/llmlb

Need more details? View full documentation on GitHub →

1Installs

AddedFeb 4, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Overview

# LLM Load Balancer

A centralized management system for coordinating LLM inference runtimes across multiple machines

English | [日本語](./README.ja.md)

Overview

LLM Load Balancer is a powerful centralized system that provides unified management and a single API endpoint for multiple LLM inference runtimes running across different machines. It features intelligent load balancing, automatic failure detection, real-time monitoring capabilities, and seamless integration for enhanced scalability.

Vision

LLM Load Balancer is designed to serve three primary use cases:

Private LLM Server - For individuals and small teams who want to run their own LLM infrastructure with full control over their data and models
Enterprise Gateway - For organizations requiring centralized management, access control, and monitoring of LLM resources across departments
Cloud Provider Integration - Seamlessly route requests to OpenAI, Google, or Anthropic APIs through the same unified endpoint

Multi-Engine Architecture

LLM Load Balancer uses a manager-based multi-engine architecture:

|--------|--------|--------|----------|

Manager-based runtimes replace the legacy plugin system. See docs/manager-migration.md

for migration steps.

Engine Selection Policy:

Models with GGUF available → Use llama.cpp (Metal/CUDA ready)
Models with safetensors only → Implement built-in engine (Metal/CUDA support required)

Safetensors Architecture Support (Implementation-Aligned)

| Architecture | Status | Notes |

|-------------|--------|-------|

| nemotron3 (Mamba-Transformer MoE) | Staged (not wired) | Not connected to the forward pass yet |

See /blob/main/specs/SPEC-69549000/spec.md for the authoritative list and updates.

GGUF Architecture Coverage (llama.cpp, Examples)

These are representative examples of model families supported via GGUF/llama.cpp. This list is

non-exhaustive and follows upstream llama.cpp compatibility.

| Architecture | Example models | Notes |

|-------------|----------------|-------|

| llama | Llama 3.1, Llama 3.2, Llama 3.3, DeepSeek-R1-Distill-Llama | Meta Llama family |

| mistral | Mistral, Mistral-Nemo | Mistral AI family |

| gemma | Gemma3, Gemma3n, Gemma3-QAT, FunctionGemma, EmbeddingGemma | Google Gemma family |

| qwen | Qwen2.5, Qwen3, QwQ, Qwen3-VL, Qwen3-Coder, Qwen3-Embedding, Qwen3-Reranker | Alibaba Qwen family |

| phi | Phi-4 | Microsoft Phi family |

| nemotron | Nemotron | NVIDIA Nemotron family |

| deepseek | DeepSeek-V3.2, DeepCoder-Preview | DeepSeek family |

| gpt-oss | GPT-OSS, GPT-OSS-Safeguard | OpenAI GPT-OSS family |

| granite | Granite-4.0-H-Small/Tiny/Micro, Granite-Docling | IBM Granite family |

| smollm | SmolLM2, SmolLM3, SmolVLM | HuggingFace SmolLM family |

| kimi | Kimi-K2 | Moonshot Kimi family |

| moondream | Moondream2 | Moondream family |

| devstral | Devstral-Small | Mistral derivative (coding-focused) |

| magistral | Magistral-Small-3.2 | Mistral derivative (multimodal) |

Multimodal Support

Beyond text generation, LLM Load Balancer provides OpenAI-compatible APIs for:

Text-to-Speech (TTS): /v1/audio/speech - Generate natural speech from text
Speech-to-Text (ASR): /v1/audio/transcriptions - Transcribe audio to text
Image Generation: /v1/images/generations - Generate images from text prompts

Text generation should use the Responses API (/v1/responses) by default. Chat Completions remains

available for compatibility.

Key Features

Unified API Endpoint: Access multiple LLM runtime instances through a single URL
Automatic Load Balancing: Latency-based request distribution across available endpoints
Endpoint Management: Centralized management of Ollama, vLLM, xLLM and other OpenAI-compatible servers
Model Sync: Automatic model discovery via GET /v1/models from registered endpoints
Automatic Failure Detection: Detect offline endpoints and exclude them from routing
Real-time Monitoring: Comprehensive visualization of endpoint states and performance metrics via web dashboard
Request History Tracking: Complete request/response logging with 7-day retention
WebUI Management: Manage endpoints, monitoring, and control through browser-based dashboard
Cross-Platform Support: Works on Windows 10+, macOS 12+, and Linux
GPU-Aware Routing: Intelligent request routing based on GPU capabilities and availability
Cloud Model Prefixes: Add openai: google: or anthropic: in the model name to proxy to the corresponding cloud provider while keeping the same OpenAI-compatible endpoint.

MCP Server for LLM Assistants

LLM assistants (like Claude Code) can interact with LLM Load Balancer through a dedicated

MCP server. This is the recommended approach over using Bash with curl commands

directly.

The MCP server is installed and run with npm/npx; the repository root uses pnpm

for workspace tasks.

Why MCP Server over Bash + curl?

| Feature | MCP Server | Bash + curl |

|---------|------------|-------------|

| Authentication | Auto-injected | Manual header management |

| Security | Host whitelist, injection prevention | No built-in protection |

| Shell injection | Protected (shell: false) | Vulnerable |

| API documentation | Built-in as MCP resources | External reference needed |

| Credential handling | Automatic masking in logs | Exposed in command history |

| Timeout management | Configurable per-request | Manual implementation |

| Error handling | Structured JSON responses | Raw text parsing |

Installation

```bash

npm install -g @llmlb/mcp-server

# or

npx @llmlb/mcp-server

```

Configuration (.mcp.json)

```json

{

"mcpServers": {

"llmlb": {

"type": "stdio",

"command": "npx",

"args": ["-y", "@llmlb/mcp-server"],

"env": {

"LLMLB_URL": "http://localhost:32768",

"LLMLB_API_KEY": "sk_your_api_key"

}

```

For detailed documentation, see [mcp-server/README.md](./mcp-server/README.md).

Quick Start

LLM Load Balancer (llmlb)

```bash

# Build

cargo build --release -p llmlb

# Run

./target/release/llmlb

# Default: http://0.0.0.0:32768

# Access dashboard

# Open http://localhost:32768/dashboard?internal_token=YOUR_TOKEN in browser

# (LLMLB_INTERNAL_API_TOKEN is required)

```

Environment Variables:

| Variable | Default | Description |

|----------|---------|-------------|

| LLMLB_HOST | 0.0.0.0 | Bind address |

| LLMLB_PORT | 32768 | Listen port |

| LLMLB_LOG_LEVEL | info | Log level |

| LLMLB_JWT_SECRET | (auto-generated) | JWT signing secret |

| LLMLB_ADMIN_USERNAME | admin | Initial admin username |

| LLMLB_ADMIN_PASSWORD | (required) | Initial admin password |

| LLMLB_INTERNAL_API_TOKEN | (required) | Internal token for /api, /dashboard, /ws |

Backward compatibility: Legacy env var names (LLMLB_PORT etc.) are supported but deprecated.

System Tray (Windows/macOS only):

On Windows 10+ and macOS 12+, the load balancer displays a system tray icon.

Double-click to open the dashboard. Docker/Linux runs as a headless CLI process.

CLI Reference

the load balancer CLI currently exposes only basic flags (--help, --version).

Day-to-day management is done via the Dashboard UI (/dashboard) or the HTTP APIs.

xLLM (C++)

The xLLM

More from this repository8

🎯

frontend-design🎯Skill

Designs distinctive, production-grade frontend interfaces with creative aesthetics, avoiding generic AI design and focusing on bold, memorable visual experiences.

🎯

mcp-server-development🎯Skill

Develops robust MCP servers using TypeScript SDK, implementing reliable, discoverable tools with precise JSON-RPC and protocol compliance.

🎯

claude-opus-4-5-migration🎯Skill

Migrates prompts and code from Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings and handling platform-specific API changes.

🎯

vercel-react-best-practices🎯Skill

Provides performance optimization guidelines for React and Next.js, offering 45 rules across 8 categories to improve application speed, bundle size, and rendering efficiency.

🎯

gh-pr🎯Skill

Manages GitHub pull request interactions, likely automating PR creation, review, or status tracking within the LLM Load Balancer project workflow.

🎯

web-design-guidelines🎯Skill

Provides web design guidelines and best practices for creating user-friendly, accessible, and visually appealing websites.

🎯

gh-fix-ci🎯Skill

Automatically detects and resolves GitHub Actions CI configuration issues by analyzing workflow files and suggesting or applying fixes.

🎯

writing hookify rules🎯Skill

Generates customizable transformation rules for adapting and routing LLM inference requests across different runtime engines and hardware architectures.