Skip to content
Back to full roadmap
topicadvanced

Voice Agents

STT + LLM + TTS pipeline or OpenAI Realtime — phone, call center, hands-free.

4 hours3 resources

2 architectures:

Pipeline (classic): mic → Whisper/Deepgram (STT) → LLM agent (tool use) → ElevenLabs/Cartesia (TTS) → speaker. 1-3s latency, modular, cheap.

Realtime (modern): OpenAI Realtime API / Gemini Live — single model for audio in/out, ~300ms latency. Premium UX but pricey.

Stack:

  • Telephony: Twilio Voice, Vapi.ai, Retell, Bland.ai
  • STT: Deepgram, AssemblyAI, Whisper API
  • TTS: ElevenLabs, Cartesia, OpenAI TTS
  • Voice agent framework: Pipecat (Daily), LiveKit Agents, Vapi

Use cases: restaurant reservations, doctor appointments, lead qualification, customer support phone, IVR replacement.

Resources(3)

Voice Agents · AI Agent Engineer Roadmap | Şükrü Yusuf Kaya