🎯

runtime-skills

🎯Skill

from llama-farm/llamafarm

VibeIndex|
What it does

Optimizes ML inference runtime with best practices for PyTorch, Transformers, and FastAPI, focusing on device management, model loading, and performance tuning.

runtime-skills

Installation

Install skill:
npx skills add https://github.com/llama-farm/llamafarm --skill runtime-skills
8
AddedJan 27, 2026

Skill Details

SKILL.md

Universal Runtime best practices for PyTorch inference, Transformers models, and FastAPI serving. Covers device management, model loading, memory optimization, and performance tuning.

Overview

# Universal Runtime Skills

Best practices and code review checklists for the Universal Runtime - LlamaFarm's local ML inference server.

Overview

The Universal Runtime provides OpenAI-compatible endpoints for HuggingFace models:

  • Text generation (Causal LMs: GPT, Llama, Mistral, Qwen)
  • Text embeddings (BERT, sentence-transformers, ModernBERT)
  • Classification, NER, and reranking
  • OCR and document understanding
  • Anomaly detection

Directory: runtimes/universal/

Python: 3.11+

Key Dependencies: PyTorch, Transformers, FastAPI, llama-cpp-python

Links to Shared Skills

This skill extends the shared Python practices. Always apply these first:

| Topic | File | Priority |

|-------|------|----------|

| Patterns | [python-skills/patterns.md](../python-skills/patterns.md) | Medium |

| Async | [python-skills/async.md](../python-skills/async.md) | High |

| Typing | [python-skills/typing.md](../python-skills/typing.md) | Medium |

| Testing | [python-skills/testing.md](../python-skills/testing.md) | Medium |

| Errors | [python-skills/error-handling.md](../python-skills/error-handling.md) | High |

| Security | [python-skills/security.md](../python-skills/security.md) | Critical |

Runtime-Specific Checklists

| Topic | File | Key Points |

|-------|------|------------|

| PyTorch | [pytorch.md](pytorch.md) | Device management, dtype, memory cleanup |

| Transformers | [transformers.md](transformers.md) | Model loading, tokenization, inference |

| FastAPI | [fastapi.md](fastapi.md) | API design, streaming, lifespan |

| Performance | [performance.md](performance.md) | Batching, caching, optimizations |

Architecture

```

runtimes/universal/

β”œβ”€β”€ server.py # FastAPI app, model caching, endpoints

β”œβ”€β”€ core/

β”‚ └── logging.py # UniversalRuntimeLogger (structlog)

β”œβ”€β”€ models/

β”‚ β”œβ”€β”€ base.py # BaseModel ABC with device management

β”‚ β”œβ”€β”€ language_model.py # Transformers text generation

β”‚ β”œβ”€β”€ gguf_language_model.py # llama-cpp-python for GGUF

β”‚ β”œβ”€β”€ encoder_model.py # Embeddings, classification, NER, reranking

β”‚ └── ... # OCR, anomaly, document models

β”œβ”€β”€ routers/

β”‚ └── chat_completions/ # Chat completions with streaming

β”œβ”€β”€ utils/

β”‚ β”œβ”€β”€ device.py # Device detection (CUDA/MPS/CPU)

β”‚ β”œβ”€β”€ model_cache.py # TTL-based model caching

β”‚ β”œβ”€β”€ model_format.py # GGUF vs transformers detection

β”‚ └── context_calculator.py # GGUF context size computation

└── tests/

```

Key Patterns

1. Model Loading with Double-Checked Locking

```python

_model_load_lock = asyncio.Lock()

async def load_encoder(model_id: str, task: str = "embedding"):

cache_key = f"encoder:{task}:{model_id}"

if cache_key not in _models:

async with _model_load_lock:

# Double-check after acquiring lock

if cache_key not in _models:

model = EncoderModel(model_id, device, task=task)

await model.load()

_models[cache_key] = model

return _models.get(cache_key)

```

2. Device-Aware Tensor Operations

```python

class BaseModel(ABC):

def get_dtype(self, force_float32: bool = False):

if force_float32:

return torch.float32

if self.device in ("cuda", "mps"):

return torch.float16

return torch.float32

def to_device(self, tensor: torch.Tensor, dtype=None):

# Don't change dtype for integer tensors

if tensor.dtype in (torch.int32, torch.int64, torch.long):

return tensor.to(device=self.device)

dtype = dtype or self.get_dtype()

return tensor.to(device=self.device, dtype=dtype)

```

3. TTL-Based Model Caching

```python

_models: ModelCache[BaseModel] = ModelCache(ttl=300) # 5 min TTL

async def _cleanup_idle_models():

while True:

await asyncio.sleep(CLEANUP_CHECK_INTERVAL)

for cache_key, model in _models.pop_expired():

await model.unload()

```

4. Async Generation with Thread Pools

```python

# GGUF models use blocking llama-cpp, run in executor

self._executor = ThreadPoolExecutor(max_workers=1)

async def generate(self, messages, max_tokens=512, ...):

loop = asyncio.get_running_loop()

return await loop.run_in_executor(self._executor, self._generate_sync)

```

Review Priority

When reviewing Universal Runtime code:

  1. Critical - Security

- Path traversal prevention in file endpoints

- Input sanitization for model IDs

  1. High - Memory & Device

- Proper CUDA/MPS cache clearing on unload

- torch.no_grad() for inference

- Correct dtype for device

  1. Medium - Performance

- Model caching patterns

- Batch processing where applicable

- Streaming implementation

  1. Low - Code Style

- Consistent with patterns.md

- Proper type hints

More from this repository10

🎯
common-skills🎯Skill

Manages shared Python utilities for LlamaFarm, focusing on HuggingFace model handling, GGUF file management, and cross-service consistency.

🎯
rag-skills🎯Skill

Implements robust RAG document processing and retrieval using LlamaIndex, ChromaDB, and Celery for efficient, scalable AI document workflows.

🎯
generate-subsystem-skills🎯Skill

Generates specialized Claude Code skills for each subsystem, creating shared language and subsystem-specific checklists to optimize AI code generation across the monorepo.

🎯
electron-skills🎯Skill

Configures secure Electron desktop application architecture with isolated processes, type-safe IPC, and cross-platform packaging for LlamaFarm.

🎯
go-skills🎯Skill

Enforces Go best practices and idiomatic patterns for secure, maintainable LlamaFarm CLI development.

🎯
typescript-skills🎯Skill

Enforces strict TypeScript best practices for React and Electron frontend applications, ensuring type safety, immutability, and clean code patterns.

🎯
cli-skills🎯Skill

Provides comprehensive Go CLI development guidelines using Cobra, Bubbletea, and Lipgloss for creating robust, interactive command-line interfaces in LlamaFarm projects.

🎯
commit-push-pr🎯Skill

Automates git workflow by committing changes, pushing to GitHub, and opening a PR with intelligent checks and handling of edge cases.

🎯
server-skills🎯Skill

Provides server-side best practices and code review guidelines for FastAPI, Celery, and Pydantic frameworks in Python.

🎯
python-skills🎯Skill

Provides comprehensive Python best practices and code review guidelines for ensuring high-quality, secure, and maintainable code across LlamaFarm's Python components.