Back to full roadmap
topicadvanced
Voice Agents
STT + LLM + TTS pipeline or OpenAI Realtime — phone, call center, hands-free.
4 hours3 resources
2 architectures:
Pipeline (classic): mic → Whisper/Deepgram (STT) → LLM agent (tool use) → ElevenLabs/Cartesia (TTS) → speaker. 1-3s latency, modular, cheap.
Realtime (modern): OpenAI Realtime API / Gemini Live — single model for audio in/out, ~300ms latency. Premium UX but pricey.
Stack:
- Telephony: Twilio Voice, Vapi.ai, Retell, Bland.ai
- STT: Deepgram, AssemblyAI, Whisper API
- TTS: ElevenLabs, Cartesia, OpenAI TTS
- Voice agent framework: Pipecat (Daily), LiveKit Agents, Vapi
Use cases: restaurant reservations, doctor appointments, lead qualification, customer support phone, IVR replacement.