🔒 Protected via Cloudflare Access

voice.generate.ready — Seed Planning

Purpose

Root capability gate: "Can this agent produce audio from text?"

TTS Landscape (April 2026)

Cloud APIs (agent-friendly)

Provider Latency Price/1M chars Voices Languages Voice Clone Agent-fit
OpenAI TTS ~300ms $15-30 13 preset 50+ No ★★★★★ (same SDK as LLM)
ElevenLabs 75-200ms $103-206 10K+ library 32 Yes (30s) ★★★★ (mature API)
Inworld TTS <200ms $10 Zero-shot clone 15 Yes (5-15s) ★★★ (new, cheap)
Deepgram Aura-2 90ms pay-as-you-go 40+ EN 7 No ★★★
Google Cloud TTS ~400ms $4-16 300+ 50+ Yes (10s) ★★★ (SSML power)
Cartesia Sonic 40-90ms ~$50/1M SSM arch 15+ Yes ★★★★ (fastest)
Fish Audio S1 varies $11/mo+ #1 TTS-Arena2 80+ Yes (10s) ★★★

Local / Open-Source (no API key)

Model Hardware Languages Clone Quality License
Orpheus TTS 3B 6-8GB VRAM EN+ No ★★★★★ (emotional) Open
Piper CPU only 30+ No ★★★ MIT
Coqui XTTS v2 8GB GPU 17 6s sample ★★★★ MPL 2.0
Bark 12GB GPU 13+ Limited ★★★★ (+ sound FX) MIT
Kokoro 82M params, light EN+ No ★★★★ Apache 2.0
Chatterbox-Turbo 350M, light EN Yes ★★★★ MIT
MeloTTS CPU 6 No ★★★ MIT
Voxtral TTS 16GB GPU 9 3s sample ★★★★★ CC open

Browser (Web Speech API)

What Makes This Seed Interesting

This is a capability gate, not a task seed. The agent needs to confirm: "I can turn text into an audio file." The verification must be concrete — an audio file must exist after execution.

Key Parameters That Matter

  1. Output format — MP3/WAV/OGG (MP3 most universal)
  2. Voice selection — at minimum, agent should know what voice it's using
  3. Latency class — real-time (<300ms TTFB) vs batch (doesn't matter for gate)
  4. File persistence — audio saved to accessible path

Parameters That DON'T Matter (for the gate)

Minimal Verification

Generate a short audio clip (1 sentence) and confirm:

  1. An audio file exists at a known path
  2. The file is valid (non-zero bytes, correct MIME type)
  3. The agent knows which TTS provider/model was used

Contract Direction

State after execution: "an audio file generated from text exists at a known path using a confirmed tts provider"

This is provider-agnostic by design. Whether the agent used OpenAI, ElevenLabs, Piper, or curl to a self-hosted model — the contract is the same. Audio file from text. Verified.

Potential Shoots (children)

Prompt Direction

The seed prompt should:

Open Questions

  1. Should the seed prefer cloud APIs (simpler for most agents) or stay truly neutral?
  2. Should voice selection be part of this gate or a child seed?
  3. Should we require the agent to test playback, or just file existence?

Recommendation: Stay neutral on provider. Include voice selection as a "note what you chose" step, not a child. File existence + valid MIME is sufficient — playback is environment-dependent.