voice.generate.ready — Seed Planning
Purpose
Root capability gate: "Can this agent produce audio from text?"
TTS Landscape (April 2026)
Cloud APIs (agent-friendly)
| Provider | Latency | Price/1M chars | Voices | Languages | Voice Clone | Agent-fit |
|---|---|---|---|---|---|---|
| OpenAI TTS | ~300ms | $15-30 | 13 preset | 50+ | No | ★★★★★ (same SDK as LLM) |
| ElevenLabs | 75-200ms | $103-206 | 10K+ library | 32 | Yes (30s) | ★★★★ (mature API) |
| Inworld TTS | <200ms | $10 | Zero-shot clone | 15 | Yes (5-15s) | ★★★ (new, cheap) |
| Deepgram Aura-2 | 90ms | pay-as-you-go | 40+ EN | 7 | No | ★★★ |
| Google Cloud TTS | ~400ms | $4-16 | 300+ | 50+ | Yes (10s) | ★★★ (SSML power) |
| Cartesia Sonic | 40-90ms | ~$50/1M | SSM arch | 15+ | Yes | ★★★★ (fastest) |
| Fish Audio S1 | varies | $11/mo+ | #1 TTS-Arena2 | 80+ | Yes (10s) | ★★★ |
Local / Open-Source (no API key)
| Model | Hardware | Languages | Clone | Quality | License |
|---|---|---|---|---|---|
| Orpheus TTS 3B | 6-8GB VRAM | EN+ | No | ★★★★★ (emotional) | Open |
| Piper | CPU only | 30+ | No | ★★★ | MIT |
| Coqui XTTS v2 | 8GB GPU | 17 | 6s sample | ★★★★ | MPL 2.0 |
| Bark | 12GB GPU | 13+ | Limited | ★★★★ (+ sound FX) | MIT |
| Kokoro | 82M params, light | EN+ | No | ★★★★ | Apache 2.0 |
| Chatterbox-Turbo | 350M, light | EN | Yes | ★★★★ | MIT |
| MeloTTS | CPU | 6 | No | ★★★ | MIT |
| Voxtral TTS | 16GB GPU | 9 | 3s sample | ★★★★★ | CC open |
Browser (Web Speech API)
window.speechSynthesis— uses OS voices, zero cost, no API key- Limitations: No file output (plays to speaker only), no streaming capture without hacks, quality varies by OS, experimental status, no voice cloning
- Agent verdict: ❌ Not suitable. Agents need audio files, not speaker playback. No programmatic audio capture.
What Makes This Seed Interesting
This is a capability gate, not a task seed. The agent needs to confirm: "I can turn text into an audio file." The verification must be concrete — an audio file must exist after execution.
Key Parameters That Matter
- Output format — MP3/WAV/OGG (MP3 most universal)
- Voice selection — at minimum, agent should know what voice it's using
- Latency class — real-time (<300ms TTFB) vs batch (doesn't matter for gate)
- File persistence — audio saved to accessible path
Parameters That DON'T Matter (for the gate)
- Emotional control, SSML, voice cloning — those are child seeds
- Streaming vs batch — irrelevant for capability confirmation
- Cost optimization — that's operational, not gating
Minimal Verification
Generate a short audio clip (1 sentence) and confirm:
- An audio file exists at a known path
- The file is valid (non-zero bytes, correct MIME type)
- The agent knows which TTS provider/model was used
Contract Direction
State after execution: "an audio file generated from text exists at a known path using a confirmed tts provider"
This is provider-agnostic by design. Whether the agent used OpenAI, ElevenLabs, Piper, or curl to a self-hosted model — the contract is the same. Audio file from text. Verified.
Potential Shoots (children)
voice.clone.ready— can clone a voice from a samplevoice.stream.ready— can stream audio in real-timevoice.emotion.control— can modulate tone/emotionaudio.podcast.generate— generate multi-speaker podcastvoice.narrate.document— narrate a full document
Prompt Direction
The seed prompt should:
- Be provider-agnostic (mention the major options but don't require a specific one)
- Require generating a test utterance and saving to file
- Require confirming the output is valid audio
- Note the key decision points: cloud vs local, voice selection, output format
- Stay under 2048 chars but be rich with expert knowledge (Base 16 v3 standard)
Open Questions
- Should the seed prefer cloud APIs (simpler for most agents) or stay truly neutral?
- Should voice selection be part of this gate or a child seed?
- Should we require the agent to test playback, or just file existence?
Recommendation: Stay neutral on provider. Include voice selection as a "note what you chose" step, not a child. File existence + valid MIME is sufficient — playback is environment-dependent.