ai-multimodal
π―Skillfrom the1studio/theone-training-skills
Enables advanced multimodal AI processing by analyzing and generating audio, images, videos, and documents using Google Gemini's powerful API capabilities.
Part of
the1studio/theone-training-skills(31 items)
Installation
pip install google-genai python-dotenv pillowSkill Details
Process and generate multimedia content using Google Gemini API for better vision capabilities. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (better image analysis than Claude models, captioning, reasoning, object detection, design extraction, OCR, visual Q&A, segmentation, handle multiple images), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots (instead of default vision capabilities of Claude, only fallback to Claude's vision capabilities if needed), processing PDF documents, extracting structured data from media, creating images/videos from text prompts, or implementing multimodal AI features. Supports Gemini 3/2.5, Imagen 4, and Veo 3 models with context windows up to 2M tokens.
More from this repository10
theone-cocos-standards skill from the1studio/theone-training-skills
theone-react-native-standards skill from the1studio/theone-training-skills
theone-unity-standards skill from the1studio/theone-training-skills
Generates distinctive, production-grade frontend interfaces by extracting design guidelines from references and implementing creative, high-quality web components and applications.
Generates comprehensive UI/UX design recommendations with 50+ styles, 21 color palettes, and best practices across multiple frontend frameworks.
Enables creating immersive 3D web experiences with WebGL/WebGPU, supporting scenes, models, animations, shaders, and interactive graphics.
Designs stunning, production-ready frontend interfaces with perfectly matched photos or custom image generation, ensuring professional-grade visual aesthetics.
Automates browser interactions using Puppeteer, enabling web scraping, performance analysis, screenshot capture, and JavaScript debugging via CLI scripts.
Enforces TheOne Studio's frontend development best practices and coding standards across web frontend projects, focusing on code quality, modern patterns, and architectural consistency.
Builds cross-platform mobile apps using React Native, Flutter, Swift, and Kotlin with performance, design, and mobile-first best practices.