megatron-memory-estimator
π―Skillfrom yzlnew/infra-skills
megatron-memory-estimator skill from yzlnew/infra-skills
Installation
npx skills add https://github.com/yzlnew/infra-skills --skill megatron-memory-estimatorSkill Details
Overview
# AI Infrastructure Agent Skills
> β οΈ WARNING
> This project is under active development and heavily generated by LLMs without strict proofreading. Use with caution and verify all code before production use.
A collection of specialized agent skills for AI infrastructure development, enabling Claude Code to write, optimize, and debug high-performance systems.
Overview
This repository provides expert-level skills for AI infrastructure engineering tasks. Each skill packages domain knowledge, code examples, and best practices to transform Claude into a specialized developer for specific frameworks and tools.
Construction Methodology (Unless Otherwise Specified)
- Knowledge Gathering: Use Gemini DeepResearch to collect comprehensive, up-to-date information on target frameworks
- Skill Development: Transform research into structured skills using
skill-creatorin Claude Code - Validation: Test skill-generated code examples to ensure correctness
- Maintenance: Regular updates based on latest official documentation
Available Skills
TileLang Developer
Write high-performance GPU kernels using TileLang for NVIDIA, AMD, and Ascend hardware.
Capabilities:
- Matrix multiplication (GEMM) kernels
- FlashAttention implementations
- DeepSeek MLA operators
- Performance optimization (swizzle layouts, pipelining, warp specialization)
- Cross-platform kernel development
Status: β Complete
Megatron Memory Estimator
Estimate GPU memory usage for Megatron-based MoE and dense models. Built upon [megatron_memory_estimator](https://huggingface.co/spaces/ISEEKYAN/megatron_memory_estimator).
Capabilities:
- Estimate memory from HuggingFace configs
- Support for MoE models (DeepSeek-V3, Qwen, etc.)
- Parallelism strategy comparison (TP/PP/EP/CP)
- Memory optimization recommendations
Status: β Complete
SLIME User
Guide for using SLIME (LLM post-training framework for RL Scaling). Built upon [THUDM/slime](https://github.com/THUDM/slime).
Capabilities:
- RL training setup and configuration (GRPO, GSPO, PPO, Reinforce++)
- Multi-turn tool calling and agent workflows
- Custom reward models and generation functions
- Megatron and FSDP backend configuration
- SGLang integration and optimization
- Dynamic sampling and partial rollout
- Multi-node distributed training
Status: β Complete
Prompt to create this skill, with Sonnet 4.5:
```
Use skill-creator to create a skill called slime-user at this repo. slime is an LLM
post-training framework for RL Scaling. Its repo is https://github.com/THUDM/slime.
Skill creation procedure:
- Git clone the latest repo
- Analyze
docs/en, understand basic structure and write a doc navigation guide for user
getting started or finding docs for advanced usage
- Gather valuable examples from the docs and
examplesdir, write key ideas and script
path down for quick reference
- Checkout some important source code, for example
slime/slime/utils/arguments.pyand
slime/rollout/sglang_rollout.py, provide its path and functions for a quick find.
```
TikZ Flowchart
Create professional flowcharts and architecture diagrams using LaTeX TikZ with standardized styles.
Capabilities:
- Professional flowcharts with Google Material-like color palette
- Standardized node types (data, memory, operation, kernel boxes)
- Architecture diagrams and process flows
- Grouping and layout best practices
- Clean orthogonal edges and relative positioning
Status: β Complete
Planned Skills
SGLang Developer
Development skill for SGLang (Structured Generation Language) runtime and optimization.
Planned capabilities:
- SGLang runtime configuration
- Custom sampling strategies
- Performance tuning for LLM inference
- Multi-GPU serving optimization
Status: π§ Planned
vLLM Developer
Skill for vLLM engine development and deployment.
Planned capabilities:
- PagedAttention implementation
- Custom scheduler development
- Multi-LoRA