runtime-skills
π―Skillfrom llama-farm/llamafarm
Optimizes ML inference runtime with best practices for PyTorch, Transformers, and FastAPI, focusing on device management, model loading, and performance tuning.
Installation
npx skills add https://github.com/llama-farm/llamafarm --skill runtime-skillsSkill Details
Universal Runtime best practices for PyTorch inference, Transformers models, and FastAPI serving. Covers device management, model loading, memory optimization, and performance tuning.
Overview
# Universal Runtime Skills
Best practices and code review checklists for the Universal Runtime - LlamaFarm's local ML inference server.
Overview
The Universal Runtime provides OpenAI-compatible endpoints for HuggingFace models:
- Text generation (Causal LMs: GPT, Llama, Mistral, Qwen)
- Text embeddings (BERT, sentence-transformers, ModernBERT)
- Classification, NER, and reranking
- OCR and document understanding
- Anomaly detection
Directory: runtimes/universal/
Python: 3.11+
Key Dependencies: PyTorch, Transformers, FastAPI, llama-cpp-python
Links to Shared Skills
This skill extends the shared Python practices. Always apply these first:
| Topic | File | Priority |
|-------|------|----------|
| Patterns | [python-skills/patterns.md](../python-skills/patterns.md) | Medium |
| Async | [python-skills/async.md](../python-skills/async.md) | High |
| Typing | [python-skills/typing.md](../python-skills/typing.md) | Medium |
| Testing | [python-skills/testing.md](../python-skills/testing.md) | Medium |
| Errors | [python-skills/error-handling.md](../python-skills/error-handling.md) | High |
| Security | [python-skills/security.md](../python-skills/security.md) | Critical |
Runtime-Specific Checklists
| Topic | File | Key Points |
|-------|------|------------|
| PyTorch | [pytorch.md](pytorch.md) | Device management, dtype, memory cleanup |
| Transformers | [transformers.md](transformers.md) | Model loading, tokenization, inference |
| FastAPI | [fastapi.md](fastapi.md) | API design, streaming, lifespan |
| Performance | [performance.md](performance.md) | Batching, caching, optimizations |
Architecture
```
runtimes/universal/
βββ server.py # FastAPI app, model caching, endpoints
βββ core/
β βββ logging.py # UniversalRuntimeLogger (structlog)
βββ models/
β βββ base.py # BaseModel ABC with device management
β βββ language_model.py # Transformers text generation
β βββ gguf_language_model.py # llama-cpp-python for GGUF
β βββ encoder_model.py # Embeddings, classification, NER, reranking
β βββ ... # OCR, anomaly, document models
βββ routers/
β βββ chat_completions/ # Chat completions with streaming
βββ utils/
β βββ device.py # Device detection (CUDA/MPS/CPU)
β βββ model_cache.py # TTL-based model caching
β βββ model_format.py # GGUF vs transformers detection
β βββ context_calculator.py # GGUF context size computation
βββ tests/
```
Key Patterns
1. Model Loading with Double-Checked Locking
```python
_model_load_lock = asyncio.Lock()
async def load_encoder(model_id: str, task: str = "embedding"):
cache_key = f"encoder:{task}:{model_id}"
if cache_key not in _models:
async with _model_load_lock:
# Double-check after acquiring lock
if cache_key not in _models:
model = EncoderModel(model_id, device, task=task)
await model.load()
_models[cache_key] = model
return _models.get(cache_key)
```
2. Device-Aware Tensor Operations
```python
class BaseModel(ABC):
def get_dtype(self, force_float32: bool = False):
if force_float32:
return torch.float32
if self.device in ("cuda", "mps"):
return torch.float16
return torch.float32
def to_device(self, tensor: torch.Tensor, dtype=None):
# Don't change dtype for integer tensors
if tensor.dtype in (torch.int32, torch.int64, torch.long):
return tensor.to(device=self.device)
dtype = dtype or self.get_dtype()
return tensor.to(device=self.device, dtype=dtype)
```
3. TTL-Based Model Caching
```python
_models: ModelCache[BaseModel] = ModelCache(ttl=300) # 5 min TTL
async def _cleanup_idle_models():
while True:
await asyncio.sleep(CLEANUP_CHECK_INTERVAL)
for cache_key, model in _models.pop_expired():
await model.unload()
```
4. Async Generation with Thread Pools
```python
# GGUF models use blocking llama-cpp, run in executor
self._executor = ThreadPoolExecutor(max_workers=1)
async def generate(self, messages, max_tokens=512, ...):
loop = asyncio.get_running_loop()
return await loop.run_in_executor(self._executor, self._generate_sync)
```
Review Priority
When reviewing Universal Runtime code:
- Critical - Security
- Path traversal prevention in file endpoints
- Input sanitization for model IDs
- High - Memory & Device
- Proper CUDA/MPS cache clearing on unload
- torch.no_grad() for inference
- Correct dtype for device
- Medium - Performance
- Model caching patterns
- Batch processing where applicable
- Streaming implementation
- Low - Code Style
- Consistent with patterns.md
- Proper type hints
More from this repository10
Manages shared Python utilities for LlamaFarm, focusing on HuggingFace model handling, GGUF file management, and cross-service consistency.
Implements robust RAG document processing and retrieval using LlamaIndex, ChromaDB, and Celery for efficient, scalable AI document workflows.
Generates specialized Claude Code skills for each subsystem, creating shared language and subsystem-specific checklists to optimize AI code generation across the monorepo.
Configures secure Electron desktop application architecture with isolated processes, type-safe IPC, and cross-platform packaging for LlamaFarm.
Enforces Go best practices and idiomatic patterns for secure, maintainable LlamaFarm CLI development.
Enforces strict TypeScript best practices for React and Electron frontend applications, ensuring type safety, immutability, and clean code patterns.
Provides comprehensive Go CLI development guidelines using Cobra, Bubbletea, and Lipgloss for creating robust, interactive command-line interfaces in LlamaFarm projects.
Automates git workflow by committing changes, pushing to GitHub, and opening a PR with intelligent checks and handling of edge cases.
Provides server-side best practices and code review guidelines for FastAPI, Celery, and Pydantic frameworks in Python.
Provides comprehensive Python best practices and code review guidelines for ensuring high-quality, secure, and maintainable code across LlamaFarm's Python components.