ai-multimodal
π―Skillfrom binhmuc/autobot-review
Processes and generates multimedia content using Google Gemini API, enabling advanced audio, image, video, and document analysis with AI-powered multimodal capabilities.
Part of
binhmuc/autobot-review(29 items)
Installation
pip install google-genai python-dotenv pillowSkill Details
Process and generate multimedia content using Google Gemini API for better vision capabilities. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (better image analysis than Claude models, captioning, reasoning, object detection, design extraction, OCR, visual Q&A, segmentation, handle multiple images), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots (instead of default vision capabilities of Claude, only fallback to Claude's vision capabilities if needed), processing PDF documents, extracting structured data from media, creating images/videos from text prompts, or implementing multimodal AI features. Supports Gemini 3/2.5, Imagen 4, and Veo 3 models with context windows up to 2M tokens.
More from this repository10
mobile-development skill from binhmuc/autobot-review
planning skill from binhmuc/autobot-review
payment-integration skill from binhmuc/autobot-review
Systematically researches technical solutions by gathering multi-source information, analyzing content, and validating findings to provide scalable, secure, and maintainable recommendations.
Automates browser tasks using Puppeteer, enabling web scraping, performance analysis, screenshots, and debugging with JSON output.
Crafts beautiful, accessible user interfaces using shadcn/ui components, Tailwind CSS utility styling, and canvas-based visual design systems.
Deploys and manages cloud infrastructure across Cloudflare, Docker, and Google Cloud Platform with comprehensive edge computing and containerization strategies.
Builds and deploys Shopify applications, extensions, and themes using GraphQL/REST APIs, Shopify CLI, and Liquid templating for comprehensive e-commerce platform customization.
Packages entire code repositories into single AI-friendly files with customizable filters, formats, and optimizations for LLM context.
Guides developers in selecting and mastering MongoDB and PostgreSQL databases for optimal data management and performance.