🎯

gemini-3-multimodal

🎯Skill

from adaptationio/skrillz

What it does

gemini-3-multimodal skill from adaptationio/skrillz

📦

Part of

adaptationio/skrillz(191 items)

gemini-3-multimodal

Installation

Add MarketplaceAdd marketplace to Claude Code

/plugin marketplace add adaptationio/Skrillz

Install PluginInstall plugin from marketplace

/plugin install skrillz@adaptationio-Skrillz

Claude CodeAdd plugin in Claude Code

/plugin enable skrillz@adaptationio-Skrillz

Add MarketplaceAdd marketplace to Claude Code

/plugin marketplace add /path/to/skrillz

Install PluginInstall plugin from marketplace

/plugin install skrillz@local

+ 4 more commands

📖 Extracted from docs: adaptationio/skrillz

Need more details? View full documentation on GitHub →

1Installs

Last UpdatedJan 16, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Process multimodal inputs (images, video, audio, PDFs) with Gemini 3 Pro. Covers image understanding, video analysis, audio processing, document extraction, media resolution control, OCR, and token optimization. Use when analyzing images, processing video, transcribing audio, extracting PDF content, or working with multimodal data.

Overview

# Gemini 3 Pro Multimodal Input Processing

Comprehensive guide for processing multimodal inputs with Gemini 3 Pro, including image understanding, video analysis, audio processing, and PDF document extraction. This skill focuses on INPUT processing (analyzing media) - see gemini-3-image-generation for OUTPUT (generating images).

Overview

Gemini 3 Pro provides native multimodal capabilities for understanding and analyzing various media types. This skill covers all input processing operations with granular control over quality, performance, and token consumption.

Key Capabilities

Image Understanding: Object detection, OCR, visual Q&A, code from screenshots
Video Processing: Up to 1 hour of video, frame analysis, OCR
Audio Processing: Up to 9.5 hours of audio, speech understanding
PDF Documents: Native PDF support, multi-page analysis, text extraction
Media Resolution Control: Low/medium/high resolution for token optimization
Token Optimization: Granular control over processing costs

When to Use This Skill

Analyzing images, photos, or screenshots
Processing video content for insights
Transcribing or understanding audio/speech
Extracting information from PDF documents
Building multimodal applications
Optimizing media processing costs

---

Quick Start

Prerequisites

Gemini API setup (see gemini-3-pro-api skill)
Media files in supported formats

Python Quick Start

```python

import google.generativeai as genai

from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel("gemini-3-pro-preview")

# Upload and analyze image

image_file = genai.upload_file(Path("photo.jpg"))

response = model.generate_content([

"What's in this image?",

image_file

])

print(response.text)

```

Node.js Quick Start

```typescript

import { GoogleGenerativeAI } from "@google/generative-ai";

import { GoogleAIFileManager } from "@google/generative-ai/server";

import fs from "fs";

const genAI = new GoogleGenerativeAI("YOUR_API_KEY");

const fileManager = new GoogleAIFileManager("YOUR_API_KEY");

// Upload and analyze image

const uploadResult = await fileManager.uploadFile("photo.jpg", {

mimeType: "image/jpeg"

});

const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });

const result = await model.generateContent([

"What's in this image?",

{ fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }

]);

console.log(result.response.text());

```

---

Core Tasks

Task 1: Analyze Image Content

Goal: Extract information, objects, text, or insights from images.

Use Cases:

Object detection and recognition
OCR (text extraction from images)
Visual Q&A
Code generation from UI screenshots
Chart/diagram analysis
Product identification

Python Example:

```python

import google.generativeai as genai

from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

# Configure model with high resolution for best quality

model = genai.GenerativeModel(

"gemini-3-pro-preview",

generation_config={

"thinking_level": "high",

"media_resolution": "high" # 1,120 tokens per image

}

)

# Upload image

image_path = Path("screenshot.png")

image_file = genai.upload_file(image_path)

# Analyze with specific prompt

response = model.generate_content([

"""Analyze this image and provide:

1. Main objects and their locations

2. Any visible text (OCR)

3. Overall context and purpose

4. If code/UI: describe the functionality

""",

image_file

])

print(response.text)

# Check token usage

print(f"Tokens used: {response.usage_metadata.total_token_count}")

```

Node.js Example:

```typescript

import { GoogleGenerativeAI } from "@google/generative-ai";

import { GoogleAIFileManager } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);

const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload image

const uploadResult = await fileManager.uploadFile("screenshot.png", {

mimeType: "image/png"

});

// Configure model with high resolution

const model = genAI.getGenerativeModel({

model: "gemini-3-pro-preview",

generationConfig: {

thinking_level: "high",

media_resolution: "high" // Best quality for OCR

}

});

const result = await model.generateContent([

`Analyze this image and provide:

1. Main objects and their locations

2. Any visible text (OCR)

3. Overall context and purpose`,

{ fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }

]);

console.log(result.response.text());

```

Resolution Options:

| Resolution | Tokens per Image | Best For |

|-----------|------------------|----------|

| low | 280 tokens | Quick analysis, low detail |

| medium | 560 tokens | Balanced quality/cost |

| high | 1,120 tokens | OCR, fine details, small text |

Supported Formats: JPEG, PNG, WEBP, HEIC, HEIF

See: references/image-understanding.md for advanced patterns

---

Task 2: Process Video Content

Goal: Analyze video content, extract insights, perform frame-by-frame analysis.

Use Cases:

Video summarization
Object tracking
Scene detection
Video OCR
Content moderation
Educational video analysis

Python Example:

```python

import google.generativeai as genai

from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

# Configure for video processing

model = genai.GenerativeModel(

"gemini-3-pro-preview",

generation_config={

"thinking_level": "high",

"media_resolution": "medium" # 70 tokens/frame (balanced)

}

)

# Upload video (up to 1 hour supported)

video_path = Path("tutorial.mp4")

video_file = genai.upload_file(video_path)

# Wait for processing

import time

while video_file.state.name == "PROCESSING":

time.sleep(5)

video_file = genai.get_file(video_file.name)

if video_file.state.name == "FAILED":

raise ValueError("Video processing failed")

# Analyze video

response = model.generate_content([

"""Analyze this video and provide:

1. Overall summary of content

2. Key scenes and timestamps

3. Main topics covered

4. Any visible text throughout the video

""",

video_file

])

print(response.text)

print(f"Tokens used: {response.usage_metadata.total_token_count}")

```

Node.js Example:

```typescript

import { GoogleGenerativeAI } from "@google/generative-ai";

import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);

const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload video

const uploadResult = await fileManager.uploadFile("tutorial.mp4", {

mimeType: "video/mp4"

});

// Wait for processing

let file = await fileManager.getFile(uploadResult.file.name);

while (file.state === FileState.PROCESSING) {

await new Promise(resolve => setTimeout(resolve, 5000));

file = await fileManager.getFile(uploadResult.file.name);

}

if (file.state === FileState.FAILED) {

throw new Error("Video processing failed");

}

// Analyze video

const model = genAI.getGenerativeModel({

model: "gemini-3-pro-preview",

generationConfig: {

media_resolution: "medium"

}

});

const result = await model.generateContent([

`Analyze this video and provide:

1. Overall summary

2. Key scenes and timestamps

3. Main topics covered`,

{ fileData: { fileUri: file.uri, mimeType: file.mimeType } }

]);

console.log(result.response.text());

```

Video Specs:

Max Duration: 1 hour
Formats: MP4, MOV, AVI, etc.
Resolution Options: Low (70 tokens/frame), Medium (70 tokens/frame), High (280 tokens/frame)
OCR: Available with high resolution

See: references/video-processing.md for advanced patterns

---

Task 3: Process Audio/Speech

Goal: Transcribe and understand audio content, process speech.

Use Cases:

Audio transcription
Speech analysis
Podcast summarization
Meeting notes
Language understanding
Audio classification

Python Example:

```python

import google.generativeai as genai

from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel("gemini-3-pro-preview")

# Upload audio file (up to 9.5 hours supported)

audio_path = Path("podcast.mp3")

audio_file = genai.upload_file(audio_path)

# Wait for processing

import time

while audio_file.state.name == "PROCESSING":

time.sleep(5)

audio_file = genai.get_file(audio_file.name)

# Process audio

response = model.generate_content([

"""Process this audio and provide:

1. Full transcription

2. Summary of main points

3. Key speakers (if multiple)

4. Important timestamps

5. Action items or conclusions

""",

audio_file

])

print(response.text)

print(f"Tokens used: {response.usage_metadata.total_token_count}")

```

Node.js Example:

```typescript

import { GoogleGenerativeAI } from "@google/generative-ai";

import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);

const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload audio

const uploadResult = await fileManager.uploadFile("podcast.mp3", {

mimeType: "audio/mp3"

});

// Wait for processing

let file = await fileManager.getFile(uploadResult.file.name);

while (file.state === FileState.PROCESSING) {

await new Promise(resolve => setTimeout(resolve, 5000));

file = await fileManager.getFile(uploadResult.file.name);

}

const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });

const result = await model.generateContent([

`Process this audio and provide:

1. Full transcription

2. Summary of main points

3. Key timestamps`,

{ fileData: { fileUri: file.uri, mimeType: file.mimeType } }

]);

console.log(result.response.text());

```

Audio Specs:

Max Duration: 9.5 hours
Formats: WAV, MP3, FLAC, AAC, etc.
Languages: Supports multiple languages

See: references/audio-processing.md for advanced patterns

---

Task 4: Process PDF Documents

Goal: Extract and analyze content from PDF documents.

Use Cases:

Document analysis
Information extraction
Form processing
Research paper analysis
Contract review
Multi-page document understanding

Python Example:

```python

import google.generativeai as genai

from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

# Configure with medium resolution (recommended for PDFs)

model = genai.GenerativeModel(

"gemini-3-pro-preview",

generation_config={

"thinking_level": "high",

"media_resolution": "medium" # 560 tokens/page (saturation point)

}

)

# Upload PDF

pdf_path = Path("research_paper.pdf")

pdf_file = genai.upload_file(pdf_path)

# Wait for processing

import time

while pdf_file.state.name == "PROCESSING":

time.sleep(5)

pdf_file = genai.get_file(pdf_file.name)

# Analyze PDF

response = model.generate_content([

"""Analyze this PDF document and provide:

1. Document type and purpose

2. Main sections and structure

3. Key findings or arguments

4. Important data or statistics

5. Conclusions or recommendations

""",

pdf_file

])

print(response.text)

print(f"Tokens used: {response.usage_metadata.total_token_count}")

```

Node.js Example:

```typescript

import { GoogleGenerativeAI } from "@google/generative-ai";

import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);

const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload PDF

const uploadResult = await fileManager.uploadFile("research_paper.pdf", {

mimeType: "application/pdf"

});

// Wait for processing

let file = await fileManager.getFile(uploadResult.file.name);

while (file.state === FileState.PROCESSING) {

await new Promise(resolve => setTimeout(resolve, 5000));

file = await fileManager.getFile(uploadResult.file.name);

}

// Analyze with medium resolution (recommended)

const model = genAI.getGenerativeModel({

model: "gemini-3-pro-preview",

generationConfig: {

media_resolution: "medium"

}

});

const result = await model.generateContent([

`Analyze this PDF and extract:

1. Main sections

2. Key findings

3. Important data`,

{ fileData: { fileUri: file.uri, mimeType: file.mimeType } }

]);

console.log(result.response.text());

```

PDF Processing Tips:

Recommended Resolution: medium (560 tokens/page) - saturation point for quality
Multi-page: Automatically processes all pages
Native Support: No conversion to images needed
Text Extraction: High-quality text extraction built-in

See: references/document-processing.md for advanced patterns

---

Task 5: Optimize Media Processing Costs

Goal: Balance quality and token consumption based on use case.

Strategy:

|-----------|-----------|---------|----------|

| Images | low | 280 | Quick scan, thumbnails |

Python Optimization Example:

```python

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

# Different resolutions for different use cases

def analyze_image_optimized(image_path, need_ocr=False):

"""Analyze image with appropriate resolution"""

resolution = "high" if need_ocr else "medium"

model = genai.GenerativeModel(

"gemini-3-pro-preview",

generation_config={

"media_resolution": resolution

}

)

image_file = genai.upload_file(image_path)

response = model.generate_content([

"Describe this image" if not need_ocr else "Extract all text from this image",

image_file

])

# Log token usage for cost tracking

tokens = response.usage_metadata.total_token_count

cost = (tokens / 1_000_000) * 2.00 # Input pricing

print(f"Resolution: {resolution}, Tokens: {tokens}, Cost: ${cost:.6f}")

return response.text

# Use appropriate resolution

analyze_image_optimized("photo.jpg", need_ocr=False) # medium

analyze_image_optimized("document.png", need_ocr=True) # high

```

Per-Item Resolution Control:

```python

# Set different resolutions for different media in same request

response = model.generate_content([

"Compare these images",

{"file": image1, "media_resolution": "high"}, # High detail

{"file": image2, "media_resolution": "low"}, # Low detail OK

])

```

Cost Monitoring:

```python

def log_media_costs(response):

"""Log media processing costs"""

usage = response.usage_metadata

# Pricing for ≤200k context

input_cost = (usage.prompt_token_count / 1_000_000) * 2.00

output_cost = (usage.candidates_token_count / 1_000_000) * 12.00

print(f"Input tokens: {usage.prompt_token_count} (${input_cost:.6f})")

print(f"Output tokens: {usage.candidates_token_count} (${output_cost:.6f})")

print(f"Total cost: ${input_cost + output_cost:.6f}")

```

See: references/token-optimization.md for comprehensive strategies

---

Media Resolution Control

Resolution Options

|---------|--------|------|-------------------|----------------|

Configuration

Global Setting (all media):

```python

model = genai.GenerativeModel(

"gemini-3-pro-preview",

generation_config={

"media_resolution": "high" # Applies to all media

}

)

```

Per-Item Setting (mixed resolutions):

```python

response = model.generate_content([

"Analyze these files",

{"file": high_detail_image, "media_resolution": "high"},

{"file": low_detail_image, "media_resolution": "low"}

])

```

Best Practices

Images: Use high for OCR/text extraction, medium for general analysis
PDFs: Use medium (saturation point - higher resolutions show diminishing returns)
Video: Use low or medium unless OCR needed
Cost Control: Start with low, increase only if quality insufficient

See: references/media-resolution.md for detailed guide

---

File Management

Upload Files

```python

import google.generativeai as genai

# Upload file

file = genai.upload_file("path/to/file.jpg")

print(f"Uploaded: {file.name}")

# Check processing status

while file.state.name == "PROCESSING":

time.sleep(5)

file = genai.get_file(file.name)

print(f"Status: {file.state.name}")

```

List Uploaded Files

```python

# List all files

for file in genai.list_files():

print(f"{file.name} - {file.display_name}")

```

Delete Files

```python

# Delete specific file

genai.delete_file(file.name)

# Delete all files

for file in genai.list_files():

genai.delete_file(file.name)

print(f"Deleted: {file.name}")

```

File Lifecycle

Upload: Immediate
Processing: Async (especially for video/audio)
Storage: Files persist until deleted
Expiration: Files may expire after period (check docs)

---

Multi-File Processing

Process Multiple Images

```python

# Upload multiple images

images = [

genai.upload_file("photo1.jpg"),

genai.upload_file("photo2.jpg"),

genai.upload_file("photo3.jpg")

]

# Analyze together

response = model.generate_content([

"Compare these images and identify common elements",

*images

])

print(response.text)

```

Mixed Media Types

```python

# Combine different media types

image = genai.upload_file("chart.png")

pdf = genai.upload_file("report.pdf")

response = model.generate_content([

"Does the chart match the data in the report?",

image,

pdf

])

```

---

References

Core Guides

[Image Understanding](references/image-understanding.md) - Complete image analysis patterns
[Video Processing](references/video-processing.md) - Video analysis and frame extraction
[Audio Processing](references/audio-processing.md) - Audio transcription and analysis
[Document Processing](references/document-processing.md) - PDF and document extraction

Optimization

[Media Resolution](references/media-resolution.md) - Resolution control and quality tuning
[OCR Extraction](references/ocr-extraction.md) - Text extraction best practices
[Token Optimization](references/token-optimization.md) - Cost control and efficiency

Scripts

[Analyze Image Script](scripts/analyze-image.py) - Production-ready image analysis
[Process Video Script](scripts/process-video.py) - Video processing automation
[Process Audio Script](scripts/process-audio.py) - Audio transcription
[Process PDF Script](scripts/process-pdf.py) - PDF extraction

Official Resources

[Vision Capabilities](https://ai.google.dev/gemini-api/docs/vision)
[Prompting with Media](https://ai.google.dev/gemini-api/docs/prompting_with_media)
[File API](https://ai.google.dev/gemini-api/docs/file-api)

---

Related Skills

gemini-3-pro-api - Basic setup, authentication, text generation
gemini-3-image-generation - Image OUTPUT (generating images)
gemini-3-advanced - Function calling, tools, caching, batch processing

---

Common Use Cases

Visual Q&A Application

Combine image understanding with chat:

```python

model = genai.GenerativeModel("gemini-3-pro-preview")

chat = model.start_chat()

# Upload image

image = genai.upload_file("product.jpg")

# Ask questions about it

response1 = chat.send_message(["What product is this?", image])

response2 = chat.send_message("What are its main features?")

response3 = chat.send_message("What's the price range for similar products?")

```

Document Analysis Pipeline

Process multiple PDFs and extract insights:

```python

import google.generativeai as genai

from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel(

"gemini-3-pro-preview",

generation_config={"media_resolution": "medium"}

)

# Process all PDFs in directory

pdf_dir = Path("documents/")

results = {}

for pdf_path in pdf_dir.glob("*.pdf"):

pdf_file = genai.upload_file(pdf_path)

# Wait for processing

while pdf_file.state.name == "PROCESSING":

time.sleep(5)

pdf_file = genai.get_file(pdf_file.name)

# Extract key information

response = model.generate_content([

"Extract: 1) Document type, 2) Key dates, 3) Important numbers, 4) Summary",

pdf_file

])

results[pdf_path.name] = response.text

# Clean up

genai.delete_file(pdf_file.name)

# Save results

import json

with open("analysis_results.json", "w") as f:

json.dump(results, f, indent=2)

```

Video Content Moderation

Analyze video for specific content:

```python

video = genai.upload_file("user_upload.mp4")

# Wait for processing

while video.state.name == "PROCESSING":

time.sleep(10)

video = genai.get_file(video.name)

response = model.generate_content([

"""Analyze this video for:

1. Inappropriate content (yes/no)

2. Violence or harmful content (yes/no)

3. Overall content rating (G/PG/PG-13/R)

4. Brief justification

Provide structured response.

""",

video

])

print(response.text)

```

---

Troubleshooting

Issue: File processing stuck at "PROCESSING"

Solution: Large files (especially video) can take time. Wait 30-60 seconds between checks. If stuck > 5 minutes, file may have failed.

Issue: Low quality OCR results

Solution: Use media_resolution: "high" for images with text. Ensure image is clear and high resolution.

Issue: High token costs

Solution: Use appropriate media resolution. Start with low, increase only if needed. For PDFs, medium is usually sufficient.

Issue: Video analysis missing details

Solution: Use media_resolution: "high" for better frame analysis, or provide more specific prompts about what to look for.

Issue: Audio transcription inaccurate

Solution: Ensure audio quality is good (no excessive background noise). Provide context in prompt about accent, language, or domain.

---

Summary

This skill provides comprehensive multimodal input processing capabilities:

✅ Image analysis with OCR and object detection

✅ Video processing up to 1 hour

✅ Audio transcription up to 9.5 hours

✅ Native PDF document processing

✅ Granular media resolution control

✅ Token optimization strategies

✅ Multi-file processing

✅ Production-ready examples

Ready to analyze multimodal content? Start with the task that matches your use case above!

More from this repository10

🎯

xai-stock-sentiment🎯Skill

xai-stock-sentiment skill from adaptationio/skrillz

🎯

ralph-prompt-single-task🎯Skill

Generates Ralph-compatible prompts for single implementation tasks with clear completion criteria and automatic verification.

🎯

autonomous-loop🎯Skill

Orchestrates continuous autonomous coding sessions, managing feature implementation, testing, and progress tracking with intelligent checkpointing and recovery mechanisms.

🎯

auto-claude-troubleshooting🎯Skill

auto-claude-troubleshooting skill from adaptationio/skrillz

🎯

auto-claude-setup🎯Skill

auto-claude-setup skill from adaptationio/skrillz

🎯

observability-analyzer🎯Skill

Analyzes Claude Code observability data to generate insights on performance, costs, errors, tool usage, sessions, conversations, and subagents through advanced metrics and log querying.

🎯

xai-financial-integration🎯Skill

xai-financial-integration skill from adaptationio/skrillz

🎯

xai-agent-tools🎯Skill

xai-agent-tools skill from adaptationio/skrillz

🎯

xai-crypto-sentiment🎯Skill

xai-crypto-sentiment skill from adaptationio/skrillz

🎯

auto-claude-memory🎯Skill

auto-claude-memory skill from adaptationio/skrillz