🔒 Protected via Cloudflare Access

voice.generate.ready — Seed Planning

Purpose

Root capability gate: "Can this agent produce audio from text?"

TTS Landscape (April 2026)

Cloud APIs (agent-friendly)

Provider	Latency	Price/1M chars	Voices	Languages	Voice Clone	Agent-fit
OpenAI TTS	~300ms	$15-30	13 preset	50+	No	★★★★★ (same SDK as LLM)
ElevenLabs	75-200ms	$103-206	10K+ library	32	Yes (30s)	★★★★ (mature API)
Inworld TTS	<200ms	$10	Zero-shot clone	15	Yes (5-15s)	★★★ (new, cheap)
Deepgram Aura-2	90ms	pay-as-you-go	40+ EN	7	No	★★★
Google Cloud TTS	~400ms	$4-16	300+	50+	Yes (10s)	★★★ (SSML power)
Cartesia Sonic	40-90ms	~$50/1M	SSM arch	15+	Yes	★★★★ (fastest)
Fish Audio S1	varies	$11/mo+	#1 TTS-Arena2	80+	Yes (10s)	★★★

Local / Open-Source (no API key)

Model	Hardware	Languages	Clone	Quality	License
Orpheus TTS 3B	6-8GB VRAM	EN+	No	★★★★★ (emotional)	Open
Piper	CPU only	30+	No	★★★	MIT
Coqui XTTS v2	8GB GPU	17	6s sample	★★★★	MPL 2.0
Bark	12GB GPU	13+	Limited	★★★★ (+ sound FX)	MIT
Kokoro	82M params, light	EN+	No	★★★★	Apache 2.0
Chatterbox-Turbo	350M, light	EN	Yes	★★★★	MIT
MeloTTS	CPU	6	No	★★★	MIT
Voxtral TTS	16GB GPU	9	3s sample	★★★★★	CC open

Browser (Web Speech API)

window.speechSynthesis — uses OS voices, zero cost, no API key
Limitations: No file output (plays to speaker only), no streaming capture without hacks, quality varies by OS, experimental status, no voice cloning
Agent verdict: ❌ Not suitable. Agents need audio files, not speaker playback. No programmatic audio capture.

What Makes This Seed Interesting

This is a capability gate, not a task seed. The agent needs to confirm: "I can turn text into an audio file." The verification must be concrete — an audio file must exist after execution.

Key Parameters That Matter

Output format — MP3/WAV/OGG (MP3 most universal)
Voice selection — at minimum, agent should know what voice it's using
Latency class — real-time (<300ms TTFB) vs batch (doesn't matter for gate)
File persistence — audio saved to accessible path

Parameters That DON'T Matter (for the gate)

Emotional control, SSML, voice cloning — those are child seeds
Streaming vs batch — irrelevant for capability confirmation
Cost optimization — that's operational, not gating

Minimal Verification

Generate a short audio clip (1 sentence) and confirm:

An audio file exists at a known path
The file is valid (non-zero bytes, correct MIME type)
The agent knows which TTS provider/model was used

Contract Direction

State after execution: "an audio file generated from text exists at a known path using a confirmed tts provider"

This is provider-agnostic by design. Whether the agent used OpenAI, ElevenLabs, Piper, or curl to a self-hosted model — the contract is the same. Audio file from text. Verified.

Potential Shoots (children)

voice.clone.ready — can clone a voice from a sample
voice.stream.ready — can stream audio in real-time
voice.emotion.control — can modulate tone/emotion
audio.podcast.generate — generate multi-speaker podcast
voice.narrate.document — narrate a full document

Prompt Direction

The seed prompt should:

Be provider-agnostic (mention the major options but don't require a specific one)
Require generating a test utterance and saving to file
Require confirming the output is valid audio
Note the key decision points: cloud vs local, voice selection, output format
Stay under 2048 chars but be rich with expert knowledge (Base 16 v3 standard)

Open Questions

Should the seed prefer cloud APIs (simpler for most agents) or stay truly neutral?
Should voice selection be part of this gate or a child seed?
Should we require the agent to test playback, or just file existence?

Recommendation: Stay neutral on provider. Include voice selection as a "note what you chose" step, not a child. File existence + valid MIME is sufficient — playback is environment-dependent.