🎯

speech-use

🎯Skill

from cnemri/google-genai-skills

What it does

Generates, transcribes, and clones voices using Google's GenAI and Cloud Speech SDKs with support for Gemini-TTS, Chirp 3, and custom voice models.

📦

Part of

cnemri/google-genai-skills(9 items)

speech-use

Installation

uv runRun with uv

uv run skills/speech-use/scripts/generate_speech.py "Hello world, this is a test." --voice Puck --output hello.wav

uv runRun with uv

uv run skills/speech-use/scripts/generate_speech.py "This is my custom voice speaking." --voice-cloning-key "YOUR_KEY_HERE" --output custom.wav

uv runRun with uv

uv run skills/speech-use/scripts/create_custom_voice.py --reference-audio reference.wav --consent-audio consent.wav

uv runRun with uv

uv run skills/speech-use/scripts/transcribe_audio.py audio.wav --language en-US --output transcript.txt

📖 Extracted from docs: cnemri/google-genai-skills

Need more details? View full documentation on GitHub →

7Installs

AddedFeb 4, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

"Generate (TTS), Transcribe (STT), and Clone voices using Google's GenAI and Cloud Speech SDKs. Supports Gemini-TTS, Chirp 3, and Instant Custom Voice."

Overview

# Speech Use

Use this skill to perform Text-to-Speech (TTS), Speech-to-Text (STT), and Voice Cloning operations.

This skill uses portable Python scripts managed by uv.

Prerequisites

Environment Variables:

* GOOGLE_API_KEY (for TTS via Gemini)

* GOOGLE_CLOUD_PROJECT (Required for STT and Voice Cloning)

* GOOGLE_APPLICATION_CREDENTIALS (Recommended for STT/Voice Cloning)

APIs Enabled:

* Text-to-Speech API (texttospeech.googleapis.com)

* Speech-to-Text API (speech.googleapis.com)

Usage

1. Generate Speech (TTS)

Generate audio from text using Gemini-TTS.

Standard Voice:

```bash

uv run skills/speech-use/scripts/generate_speech.py "Hello world, this is a test." --voice Puck --output hello.wav

```

Custom Voice (Cloned):

```bash

uv run skills/speech-use/scripts/generate_speech.py "This is my custom voice speaking." --voice-cloning-key "YOUR_KEY_HERE" --output custom.wav

```

2. Create Custom Voice (Voice Cloning)

Generate a voiceCloningKey from a reference audio file and a consent file.

Requirements:

reference.wav: 10-30s of clear speech (the voice to clone).
consent.wav: The speaker saying: "I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model."

```bash

uv run skills/speech-use/scripts/create_custom_voice.py --reference-audio reference.wav --consent-audio consent.wav

```

Save the output key to use with generate_speech.py.

3. Transcribe Audio (STT)

Transcribe audio files using Chirp 3.

```bash

uv run skills/speech-use/scripts/transcribe_audio.py audio.wav --language en-US --output transcript.txt

```

Options

generate_speech.py

--voice: Prebuilt voice (e.g., Kore, Puck, Fenrir, Aoede).
--voice-cloning-key: Key from create_custom_voice.py.
--model: Default gemini-2.5-flash-preview-tts.

transcribe_audio.py

--model: Default chirp_3.
--language: Default auto.
--location: Cloud region (default us).

More from this repository8

🎯

google-adk-python🎯Skill

Provides expert guidance and Python code examples for building, configuring, and deploying intelligent agents using the Google Agent Development Kit (ADK).

🎯

google-genai-sdk-python🎯Skill

Provides expert Python code guidance for leveraging Google's Gemini API with the official GenAI SDK, covering text, chat, multimodal, and generative AI tasks.

🎯

deep-research🎯Skill

Autonomously conducts multi-step research by searching web, analyzing files, and generating comprehensive, cited reports using Gemini.

🎯

veo-use🎯Skill

Generates and edits videos using Google's Veo AI models with text, image, and reference-based inputs across multiple creative modes.

🎯

nano-banana-use🎯Skill

Generates compact, efficient Python code snippets for processing and analyzing small banana-related datasets with minimal computational overhead.

🎯

veo-build🎯Skill

Generates and edits videos using Google's Veo AI models, supporting text-to-video, image-to-video, and advanced video manipulation techniques.

🎯

nano-banana-build🎯Skill

Generates and edits high-quality images using Gemini's Nano Banana models, supporting text-to-image, style transfer, and character consistency.

🎯

speech-build🎯Skill

Generates speech audio from text using Google's text-to-speech technology, enabling easy audio conversion for various applications.