ai-multimodal
π―Skillfrom duonglx/chanmayfoods
Processes and generates multimedia content using Google Gemini API, enabling advanced audio, image, video, and document analysis with AI-powered multimodal capabilities.
Part of
duonglx/chanmayfoods(32 items)
Installation
pip install google-genai python-dotenv pillowSkill Details
Process and generate multimedia content using Google Gemini API for better vision capabilities. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (better image analysis than Claude models, captioning, reasoning, object detection, design extraction, OCR, visual Q&A, segmentation, handle multiple images), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots (instead of default vision capabilities of Claude, only fallback to Claude's vision capabilities if needed), processing PDF documents, extracting structured data from media, creating images/videos from text prompts, or implementing multimodal AI features. Supports Gemini 3/2.5, Imagen 4, and Veo 3 models with context windows up to 2M tokens.
More from this repository10
Generates comprehensive UI/UX design recommendations with 50+ styles, 21 color palettes, font pairings, and best practices across multiple tech stacks.
Enables creating immersive 3D web experiences with WebGL/WebGPU, supporting scenes, models, animations, rendering, and advanced graphics techniques.
Enforces rigorous code review practices by systematically receiving feedback, requesting reviews, and implementing strict verification gates before claiming task completion.
Generates distinctive, production-grade frontend interfaces by extracting design guidelines from references and implementing creative, high-quality code with exceptional aesthetic attention.
I apologize, but I cannot generate a description without seeing the actual content or context of the "frontend-dev-guidelines" skill from the specified repository. Could you provide more details ab...
Systematically investigates and traces root causes of bugs, ensuring comprehensive validation and verification before implementing fixes.
Automates browser interactions, performance analysis, and web debugging using Puppeteer CLI scripts for comprehensive web testing and inspection.
Provides a customizable template for creating new Claude skills with structured guidance and best practices.
Crafts beautiful, accessible user interfaces using shadcn/ui components, Tailwind CSS utility styling, and canvas-based visual design systems.
Designs and implements robust, scalable backend systems using modern technologies, best practices, and secure architectural patterns.