🎯

drawio

🎯Skill

from akiojin/llmlb

VibeIndex|
What it does

Generates or visualizes architecture and system design diagrams for the LLM Load Balancer project using Draw.io/diagrams.net format.

πŸ“¦

Part of

akiojin/llmlb(9 items)

drawio

Installation

npm installInstall npm package
npm install -g @llmlb/mcp-server
npxRun with npx
npx @llmlb/mcp-server
CargoRun with Cargo (Rust)
cargo build --release -p llmlb
Claude Desktop ConfigurationAdd this to your claude_desktop_config.json
{ "mcpServers": { "llmlb": { "type": "stdio", "command": "npx"...
πŸ“– Extracted from docs: akiojin/llmlb
1Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

Overview

# LLM Load Balancer

A centralized management system for coordinating LLM inference runtimes across multiple machines

English | [ζ—₯本θͺž](./README.ja.md)

Overview

LLM Load Balancer is a powerful centralized system that provides unified management and a single API endpoint for multiple LLM inference runtimes running across different machines. It features intelligent load balancing, automatic failure detection, real-time monitoring capabilities, and seamless integration for enhanced scalability.

Vision

LLM Load Balancer is designed to serve three primary use cases:

  1. Private LLM Server - For individuals and small teams who want to run their own LLM infrastructure with full control over their data and models
  2. Enterprise Gateway - For organizations requiring centralized management, access control, and monitoring of LLM resources across departments
  3. Cloud Provider Integration - Seamlessly route requests to OpenAI, Google, or Anthropic APIs through the same unified endpoint

Multi-Engine Architecture

LLM Load Balancer uses a manager-based multi-engine architecture:

| Engine | Status | Models | Hardware |

|--------|--------|--------|----------|

| llama.cpp | Production | GGUF format (LLaMA, Mistral, etc.) | CPU, CUDA, Metal |

| GPT-OSS | Production (Metal/CUDA) | Safetensors (official GPU artifacts) | Apple Silicon, Windows |

| Whisper | Production | Speech-to-Text (ASR) | CPU, CUDA, Metal |

| Stable Diffusion | Production | Image Generation | CUDA, Metal |

| Nemotron | Validation | Safetensors format | CUDA |

Manager-based runtimes replace the legacy plugin system. See docs/manager-migration.md

for migration steps.

Engine Selection Policy:

  • Models with GGUF available β†’ Use llama.cpp (Metal/CUDA ready)
  • Models with safetensors only β†’ Implement built-in engine (Metal/CUDA support required)

Safetensors Architecture Support (Implementation-Aligned)

| Architecture | Status | Notes |

|-------------|--------|-------|

| gpt-oss (MoE + MXFP4) | Implemented | Uses mlp.router. and mlp.experts._(blocks\|scales\|bias) with MoE forward |

| nemotron3 (Mamba-Transformer MoE) | Staged (not wired) | Not connected to the forward pass yet |

See /blob/main/specs/SPEC-69549000/spec.md for the authoritative list and updates.

GGUF Architecture Coverage (llama.cpp, Examples)

These are representative examples of model families supported via GGUF/llama.cpp. This list is

non-exhaustive and follows upstream llama.cpp compatibility.

| Architecture | Example models | Notes |

|-------------|----------------|-------|

| llama | Llama 3.1, Llama 3.2, Llama 3.3, DeepSeek-R1-Distill-Llama | Meta Llama family |

| mistral | Mistral, Mistral-Nemo | Mistral AI family |

| gemma | Gemma3, Gemma3n, Gemma3-QAT, FunctionGemma, EmbeddingGemma | Google Gemma family |

| qwen | Qwen2.5, Qwen3, QwQ, Qwen3-VL, Qwen3-Coder, Qwen3-Embedding, Qwen3-Reranker | Alibaba Qwen family |

| phi | Phi-4 | Microsoft Phi family |

| nemotron | Nemotron | NVIDIA Nemotron family |

| deepseek | DeepSeek-V3.2, DeepCoder-Preview | DeepSeek family |

| gpt-oss | GPT-OSS, GPT-OSS-Safeguard | OpenAI GPT-OSS family |

| granite | Granite-4.0-H-Small/Tiny/Micro, Granite-Docling | IBM Granite family |

| smollm | SmolLM2, SmolLM3, SmolVLM | HuggingFace SmolLM family |

| kimi | Kimi-K2 | Moonshot Kimi family |

| moondream | Moondream2 | Moondream family |

| devstral | Devstral-Small | Mistral derivative (coding-focused) |

| magistral | Magistral-Small-3.2 | Mistral derivative (multimodal) |

Multimodal Support

Beyond text generation, LLM Load Balancer provides OpenAI-compatible APIs for:

  • Text-to-Speech (TTS): /v1/audio/speech - Generate natural speech from text
  • Speech-to-Text (ASR): /v1/audio/transcriptions - Transcribe audio to text
  • Image Generation: /v1/images/generations - Generate images from text prompts

Text generation should use the Responses API (/v1/responses) by default. Chat Completions remains

available for compatibility.

Key Features

  • Unified API Endpoint: Access multiple LLM runtime instances through a single URL
  • Automatic Load Balancing: Latency-based request distribution across available endpoints
  • Endpoint Management: Centralized management of Ollama, vLLM, xLLM and other OpenAI-compatible servers
  • Model Sync: Automatic model discovery via GET /v1/models from registered endpoints
  • Automatic Failure Detection: Detect offline endpoints and exclude them from routing
  • Real-time Monitoring: Comprehensive visualization of endpoint states and performance metrics via web dashboard
  • Request History Tracking: Complete request/response logging with 7-day retention
  • WebUI Management: Manage endpoints, monitoring, and control through browser-based dashboard
  • Cross-Platform Support: Works on Windows 10+, macOS 12+, and Linux
  • GPU-Aware Routing: Intelligent request routing based on GPU capabilities and availability
  • Cloud Model Prefixes: Add openai: google: or anthropic: in the model name to proxy to the corresponding cloud provider while keeping the same OpenAI-compatible endpoint.

MCP Server for LLM Assistants

LLM assistants (like Claude Code) can interact with LLM Load Balancer through a dedicated

MCP server. This is the recommended approach over using Bash with curl commands

directly.

The MCP server is installed and run with npm/npx; the repository root uses pnpm

for workspace tasks.

Why MCP Server over Bash + curl?

| Feature | MCP Server | Bash + curl |

|---------|------------|-------------|

| Authentication | Auto-injected | Manual header management |

| Security | Host whitelist, injection prevention | No built-in protection |

| Shell injection | Protected (shell: false) | Vulnerable |

| API documentation | Built-in as MCP resources | External reference needed |

| Credential handling | Automatic masking in logs | Exposed in command history |

| Timeout management | Configurable per-request | Manual implementation |

| Error handling | Structured JSON responses | Raw text parsing |

Installation

```bash

npm install -g @llmlb/mcp-server

# or

npx @llmlb/mcp-server

```

Configuration (.mcp.json)

```json

{

"mcpServers": {

"llmlb": {

"type": "stdio",

"command": "npx",

"args": ["-y", "@llmlb/mcp-server"],

"env": {

"LLMLB_URL": "http://localhost:32768",

"LLMLB_API_KEY": "sk_your_api_key"

}

}

}

}

```

For detailed documentation, see [mcp-server/README.md](./mcp-server/README.md).

Quick Start

LLM Load Balancer (llmlb)

```bash

# Build

cargo build --release -p llmlb

# Run

./target/release/llmlb

# Default: http://0.0.0.0:32768

# Access dashboard

# Open http://localhost:32768/dashboard?internal_token=YOUR_TOKEN in browser

# (LLMLB_INTERNAL_API_TOKEN is required)

```

Environment Variables:

| Variable | Default | Description |

|----------|---------|-------------|

| LLMLB_HOST | 0.0.0.0 | Bind address |

| LLMLB_PORT | 32768 | Listen port |

| LLMLB_LOG_LEVEL | info | Log level |

| LLMLB_JWT_SECRET | (auto-generated) | JWT signing secret |

| LLMLB_ADMIN_USERNAME | admin | Initial admin username |

| LLMLB_ADMIN_PASSWORD | (required) | Initial admin password |

| LLMLB_INTERNAL_API_TOKEN | (required) | Internal token for /api, /dashboard, /ws |

Backward compatibility: Legacy env var names (LLMLB_PORT etc.) are supported but deprecated.

System Tray (Windows/macOS only):

On Windows 10+ and macOS 12+, the load balancer displays a system tray icon.

Double-click to open the dashboard. Docker/Linux runs as a headless CLI process.

CLI Reference

the load balancer CLI currently exposes only basic flags (--help, --version).

Day-to-day management is done via the Dashboard UI (/dashboard) or the HTTP APIs.

xLLM (C++)

The xLLM

More from this repository8