🎯

elevenlabs

🎯Skill

from digitalsamba/claude-code-video-toolkit

VibeIndex|
What it does

Generates AI voiceovers and audio content using ElevenLabs APIs, supporting text-to-speech, voice cloning, and custom audio synthesis across multiple languages and styles.

πŸ“¦

Part of

digitalsamba/claude-code-video-toolkit(6 items)

elevenlabs

Installation

PythonRun Python server
python tools/voiceover.py --scene-dir public/audio/scenes --json
πŸ“– Extracted from docs: digitalsamba/claude-code-video-toolkit
15Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

Generate AI voiceovers, sound effects, and music using ElevenLabs APIs. Use when creating audio content for videos, podcasts, or games. Triggers include generating voiceovers, narration, dialogue, sound effects from descriptions, background music, soundtrack generation, voice cloning, or any audio synthesis task.

Overview

# ElevenLabs Audio Generation

Requires ELEVENLABS_API_KEY in .env.

Text-to-Speech

```python

from elevenlabs.client import ElevenLabs

from elevenlabs import save, VoiceSettings

import os

client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))

audio = client.text_to_speech.convert(

text="Welcome to my video!",

voice_id="JBFqnCBsd6RMkjVDRZzb",

model_id="eleven_multilingual_v2",

voice_settings=VoiceSettings(

stability=0.5,

similarity_boost=0.75,

style=0.5,

speed=1.0

)

)

save(audio, "voiceover.mp3")

```

Models

| Model | Quality | SSML Support | Notes |

|-------|---------|--------------|-------|

| eleven_multilingual_v2 | Highest consistency | None | Stable, production-ready, 29 languages |

| eleven_flash_v2_5 | Good | , | Fast, supports pause/pronunciation tags |

| eleven_turbo_v2_5 | Good | , | Fastest latency |

| eleven_v3 | Most expressive | None | Alpha β€” unreliable, needs prompt engineering |

Choose: multilingual_v2 for reliability, flash/turbo for SSML control, v3 for maximum expressiveness (expect retakes).

Voice Settings by Style

| Style | stability | similarity | style | speed |

|-------|-----------|------------|-------|-------|

| Natural/professional | 0.75-0.85 | 0.9 | 0.0-0.1 | 1.0 |

| Conversational | 0.5-0.6 | 0.85 | 0.3-0.4 | 0.9-1.0 |

| Energetic/YouTuber | 0.3-0.5 | 0.75 | 0.5-0.7 | 1.0-1.1 |

Pauses Between Sections

With flash/turbo models: Use SSML break tags inline:

```

...end of section. Start of next...

```

Max 3 seconds per break. Excessive breaks can cause speed artifacts.

With multilingual_v2 / v3: No SSML support. Options:

  • Paragraph breaks (blank lines) β€” creates ~0.3-0.5s natural pause
  • Post-process with ffmpeg: split audio and insert silence

WARNING: ... (ellipsis) is NOT a reliable pause β€” it can be vocalized as a word/sound. Do not use ellipsis as a pause mechanism.

Pronunciation Control

Phonetic spelling (any model): Write words as you want them pronounced:

  • Janus β†’ Jan-us
  • nginx β†’ engine-x
  • Use dashes, capitals, apostrophes to guide pronunciation

SSML phoneme tags (flash/turbo only):

```

Janus

```

Iterative Workflow

  1. Generate β†’ listen β†’ identify pronunciation/pacing issues
  2. Adjust: phonetic spellings, break tags, voice settings
  3. Regenerate. If pauses aren't precise enough, add silence in post with ffmpeg rather than fighting the TTS engine.

Voice Cloning

Instant Voice Clone

```python

with open("sample.mp3", "rb") as f:

voice = client.voices.ivc.create(

name="My Voice",

files=[f],

remove_background_noise=True

)

print(f"Voice ID: {voice.voice_id}")

```

  • Use client.voices.ivc.create() (not client.voices.clone())
  • Pass file handles in binary mode ("rb"), not paths
  • Convert m4a first: ffmpeg -i input.m4a -codec:a libmp3lame -qscale:a 2 output.mp3
  • Multiple samples (2-3 clips) improve accuracy
  • Save voice ID for reuse

Professional Voice Clone: Requires Creator plan+, 30+ min audio. See [reference.md](reference.md).

Sound Effects

Max 22 seconds per generation.

```python

result = client.text_to_sound_effects.convert(

text="Thunder rumbling followed by heavy rain",

duration_seconds=10,

prompt_influence=0.3

)

with open("thunder.mp3", "wb") as f:

for chunk in result:

f.write(chunk)

```

Prompt tips: Be specific β€” "Heavy footsteps on wooden floorboards, slow and deliberate, with creaking"

Music Generation

10 seconds to 5 minutes. Use client.music.compose() (not .generate()).

```python

result = client.music.compose(

prompt="Upbeat indie rock, catchy guitar riff, energetic drums, travel vlog",

music_length_ms=60000,

force_instrumental=True

)

with open("music.mp3", "wb") as f:

for chunk in result:

f.write(chunk)

```

Prompt structure: Genre, mood, instruments, tempo, use case. Add "no vocals" or use force_instrumental=True for background music.

Remotion Integration

Complete Workflow: Script to Synchronized Scene

```

VOICEOVER-SCRIPT.md β†’ voiceover.py β†’ public/audio/ β†’ Remotion composition

↓ ↓ ↓ ↓

Scene narration Generate MP3 Audio files

with durations per scene with timing synced to scenes

```

Step 1: Generate Per-Scene Audio

Use the toolkit's voiceover tool to generate audio for each scene:

```bash

# Generate voiceover files for each scene

python tools/voiceover.py --scene-dir public/audio/scenes --json

# Output:

# public/audio/scenes/

# β”œβ”€β”€ scene-01-title.mp3

# β”œβ”€β”€ scene-02-problem.mp3

# β”œβ”€β”€ scene-03-solution.mp3

# └── manifest.json (durations for each file)

```

The manifest.json contains timing info:

```json

{

"scenes": [

{ "file": "scene-01-title.mp3", "duration": 4.2 },

{ "file": "scene-02-problem.mp3", "duration": 12.8 },

{ "file": "scene-03-solution.mp3", "duration": 15.3 }

],

"totalDuration": 32.3

}

```

Step 2: Use Audio in Remotion Composition

```tsx

// src/Composition.tsx

import { Audio, staticFile, Series, useVideoConfig } from 'remotion';

// Import scene components

import { TitleSlide } from './scenes/TitleSlide';

import { ProblemSlide } from './scenes/ProblemSlide';

import { SolutionSlide } from './scenes/SolutionSlide';

// Scene durations (from manifest.json, converted to frames at 30fps)

const SCENE_DURATIONS = {

title: Math.ceil(4.2 * 30), // 126 frames

problem: Math.ceil(12.8 * 30), // 384 frames

solution: Math.ceil(15.3 * 30), // 459 frames

};

export const MainComposition: React.FC = () => {

return (

<>

{/ Scene sequence /}

{/ Audio track - plays continuously across all scenes /}

{/ Optional: Background music at lower volume /}

);

};

```

Step 3: Per-Scene Audio (Alternative)

For more control, add audio to each scene individually:

```tsx

// src/scenes/ProblemSlide.tsx

import { Audio, staticFile, useCurrentFrame } from 'remotion';

export const ProblemSlide: React.FC = () => {

const frame = useCurrentFrame();

return (

slide styles / }}>

The Problem

{/ Scene content /}

{/ Audio starts when this scene starts (frame 0 of this sequence) /}

);

};

```

Syncing Visuals to Voiceover

Calculate scene duration from audio, not the other way around:

```tsx

// src/config/timing.ts

import manifest from '../../public/audio/scenes/manifest.json';

const FPS = 30;

// Convert audio durations to frame counts

export const sceneDurations = manifest.scenes.reduce((acc, scene) => {

const name = scene.file.replace(/^scene-\d+-/, '').replace('.mp3', '');

acc[name] = Math.ceil(scene.duration * FPS);

return acc;

}, {} as Record);

// Usage in composition:

//

```

Audio Timing Patterns

```tsx

import { Audio, Sequence, interpolate, useCurrentFrame } from 'remotion';

// Fade in audio

export const FadeInAudio: React.FC<{ src: string; fadeFrames?: number }> = ({

src,

fadeFrames = 30

}) => {

const frame = useCurrentFrame();

const volume = interpolate(frame, [0, fadeFrames], [0, 1], {

extrapolateRight: 'clamp',

});

return

};

// Delayed audio start

export const DelayedAudio: React.FC<{ src: string; delayFrames: number }> = ({

src,

delayFrames

}) => (

);

// Usage:

//

//

```

Voiceover + Demo Video Sync

When a scene has both voiceover and demo video:

```tsx

import { Audio, OffthreadVideo, staticFile, useVideoConfig } from 'remotion';

export const DemoScene: React.FC = () => {

const { durationInFrames, fps } = useVideoConfig();

// Calculate playback rate to fit demo into voiceover duration

const demoDuration = 45; // seconds (original demo length)

const sceneDuration = durationInFrames / fps; // seconds (from voiceover)

const playbackRate = demoDuration / sceneDuration;

return (

<>

src={staticFile('demos/feature-demo.mp4')}

playbackRate={playbackRate}

/>

);

};

```

Error Handling

```tsx

import { Audio, staticFile, delayRender, continueRender } from 'remotion';

import { useEffect, useState } from 'react';

export const SafeAudio: React.FC<{ src: string }> = ({ src }) => {

const [handle] = useState(() => delayRender());

const [audioReady, setAudioReady] = useState(false);

useEffect(() => {

const audio = new window.Audio(src);

audio.oncanplaythrough = () => {

setAudioReady(true);

continueRender(handle);

};

audio.onerror = () => {

console.error(Failed to load audio: ${src});

continueRender(handle); // Continue without audio rather than hang

};

}, [src, handle]);

if (!audioReady) return null;

return

};

```

Toolkit Command: /generate-voiceover

The /generate-voiceover command handles the full workflow:

```

/generate-voiceover

  1. Reads VOICEOVER-SCRIPT.md
  2. Extracts narration for each scene
  3. Generates audio via ElevenLabs API
  4. Saves to public/audio/scenes/
  5. Creates manifest.json with durations
  6. Updates project.json with timing info

```

Popular Voices

  • George: JBFqnCBsd6RMkjVDRZzb (warm narrator)
  • Rachel: 21m00Tcm4TlvDq8ikWAM (clear female)
  • Adam: pNInz6obpgDQGcFmaJgB (professional male)

List all: client.voices.get_all()

For full API docs, see [reference.md](reference.md).