# Voice AI Engineering Training (OpenAI Realtime + ElevenLabs + Cartesia Sonic + Sesame Maya + Whisper + Vapi + LiveKit Agents + Moshi)

> Source: https://sukruyusufkaya.com/en/training/voice-ai-muhendisligi-egitimi
> Updated: 2026-05-19T19:04:33.533Z
> Level: advanced
> Topics: voice ai, openai realtime api, claude voice, gemini 2.5 live api, sesame maya, hume evi, whisper v3, elevenlabs, cartesia sonic, vapi voice agent, retell ai, livekit agents, pipecat, moshi kyutai, f5-tts, higgs audio, speech-to-speech llm, voice cloning, twilio voice ai, kvkk uyumlu voice ai
**TLDR:** A 3-day advanced Turkish training that covers end to end the real-time speech-to-speech LLM + Voice AI ecosystem — one of 2024-2026's hottest frontiers. Includes OpenAI GPT-4o Realtime API, Anthropic Claude Voice, Google Gemini 2.5 Live API, Sesame Maya (2025), Hume EVI 3, Whisper v3 Large, ElevenLabs Conversational AI, Cartesia Sonic 2 (sub-100ms), Vapi (YC W24), Retell AI, LiveKit Agents, Pipecat, Moshi (Kyutai open-source), F5-TTS, Higgs Audio v2; Twilio + Telnyx + Turkish telecom SIP telephony; banking IVR + healthcare triage + e-commerce call-center use cases; KVKK + BDDK-compliant deployment.

## Açıklama

The Voice AI Engineering Training is a 3-day advanced program designed to teach end to end — in Turkish — the real-time speech-to-speech LLM and voice-agent ecosystem that defined the 2024-2026 period. Calibrated for AI Engineers, Voice Engineers, Backend Developers, Conversational AI Designers, Senior Product Engineers, and call-center managers.

## Kazanımlar

- Skillfully manage the paradigm shift from classical STT+LLM+TTS pipeline to native S2S LLM.
- Make team-appropriate choices among OpenAI Realtime + Gemini Live + Claude Voice + Sesame Maya.
- Build a Whisper v3 + faster-whisper + WhisperX + Azure Speech Turkish STT production pipeline.
- Deploy ElevenLabs Multilingual v2 + voice cloning + Conversational AI.
- Build ultra-low-latency voice agents with Cartesia Sonic 2 sub-100ms TTS.
- Perform voice-agent orchestration with Vapi + Retell AI + LiveKit Agents + Pipecat.
- Perform Moshi + F5-TTS + Higgs Audio v2 self-hosted KVKK-compliant deployment.
- Build production telephony integration with Twilio + Telnyx + Turkish telecom SIP.
- Apply conversation design + turn-taking + interruption + hallucination-prevention discipline.
- Measure voice quality with WER + MOS + Voice Arena + custom Turkish benchmark + Langfuse observability.

<p>This training is designed to teach end to end — in Turkish — the voice-AI discipline that was the paradigm-opening agent layer of the 2024-2026 period. With OpenAI's GPT-4o Realtime API launch in October 2024, Anthropic Claude Voice and Google Gemini 2.5 Live API arriving in 2025, Sesame Maya opening the conversational-presence paradigm, Hume EVI 3's empathic voice interface, Cartesia Sonic 2's sub-100ms TTS, ElevenLabs' ultra-natural TTS + Conversational AI platform in 32 languages, Vapi (YC W24) + Retell AI (YC S23) voice-agent orchestrators, LiveKit Agents + Pipecat open-source frameworks, and Moshi (Kyutai), F5-TTS, Higgs Audio v2 open-source alternatives — the voice-AI ecosystem became a production-grade discipline. Voice AI automation offers critical advantage for the Turkish banking (BDDK IVR), healthcare (SBSGM emergency-call triage), e-commerce (Trendyol/Hepsiburada call center), and public-services (444 hotlines) sectors — yet a training that addresses this discipline end to end in Turkish is virtually nonexistent. This program is designed to fill that gap as Turkey's most comprehensive production-grade voice-AI reference training.</p>

<p>The program's strategic backbone is the first module, which clarifies the rationale for the transition from the classical 3-stage pipeline (STT → LLM text → TTS) approach to the native real-time speech-to-speech (S2S) LLM paradigm. In the classical pipeline, latency budget is high (STT 200ms + LLM 500ms + TTS 200ms = 900ms TTFB) and emotion + prosody information is lost; native S2S LLMs (GPT-4o Realtime, Gemini 2.5 Live, Claude Voice, Sesame Maya, Moshi) provide <500ms TTFB + emotion preservation + interruption handling. The 2026 ecosystem map is comparatively presented: commercial S2S (OpenAI Realtime, Claude Voice, Gemini Live), specialized voice (Sesame Maya, Hume EVI 3, ElevenLabs Conversational), open-source (Moshi, F5-TTS, Higgs Audio v2). Turkish market use cases: banking BDDK IVR automation + KVKK-compliant voice authentication, healthcare SBSGM emergency-call triage + appointment system, e-commerce customer support + returns management, public 444 hotlines + e-Government voice access.</p>

<p>The second module covers end to end the Realtime API, which opened the S2S paradigm with OpenAI's launch in October 2024 and became the production standard in 2025-2026. WebSocket protocol + bidirectional streaming: session.update + input_audio_buffer + conversation.item events; pcm16 24kHz mono base64 audio format. gpt-4o-realtime-preview and production gpt-realtime models. 8 native voices (alloy, echo, fable, onyx, nova, shimmer, ash, ballad) — Turkish-accent quality improved dramatically in 2025-2026 versions. Function calling (tools array + function_call_arguments delta event), interruption handling (response.cancel + input_audio_buffer.clear), server-side VAD (voice activity detection) + turn detection. Browser integration + echo cancellation with OpenAI WebRTC SDK + ephemeral-key authentication. Pricing: $0.06/min audio input + $0.24/min audio output, 50-70% cost reduction with prompt caching. A production recipe for Turkish enterprise banking + call-center practice is provided.</p>

<p>The third module covers in detail OpenAI Whisper v3 Large (November 2023 release, 2024-2026 production standard) and the modern STT ecosystem. Whisper v3 Large 1550M params, 99 languages (including Turkish), open-source weights. Production optimization: faster-whisper (CTranslate2 + INT8 + 4x speed), insanely-fast-whisper (batched + Flash Attention 2), WhisperX (word-level timestamps + pyannote diarization), Distil-Whisper (6x smaller, preserves quality). Fine-tuning for Turkish (Turkish Common Voice + custom dataset), biased decoding (context bias via initial_prompt parameter), custom vocabulary (medical drug names, banking IBAN format). Alternative production STT: AssemblyAI Universal-2, Deepgram Nova-3, Google Speech-to-Text v2 Chirp, Azure Speech (best commercial especially for Turkish), ElevenLabs Speech to Text (2024 launch). Diarization (speaker separation) via pyannote-audio + WhisperX combination.</p>

<p>The fourth module covers end to end the ElevenLabs ecosystem — the TTS leader. Multilingual v2 (32 languages + Turkish ultra-natural), Turbo v2.5 (270ms latency), Flash v2.5 (75ms TTFB — sub-100ms fastest). Voice cloning: Instant Voice Cloning (1-minute sample → cloned voice) + Professional Voice Cloning (30-minute sample, high quality). Voice Design: voice generation from text prompts ('warm middle-aged male Turkish voice'). Production quality with stability + similarity_boost + style parameter tuning. ElevenLabs Conversational AI platform (2024 launch): agent + STT + TTS + LLM in single API; WebSocket streaming + chunk audio + low-latency production; Twilio integration + telephony (PSTN) call routing. Turkish enterprise voice cloning + custom brand voice + KVKK biometric data compliance are shown practically.</p>

<p>The fifth module covers in detail the specialized voice-AI platforms of the 2024-2026 ecosystem. Cartesia Sonic 2: sub-100ms TTS thanks to state-space model (Mamba) architecture — the fastest production-grade TTS; multilingual (15+ languages incl. Turkish); voice cloning; WebSocket streaming + 384-sample chunks + ultra-low latency. Hume EVI 3 (Empathic Voice Interface): 24 emotions + prosody analysis; semantic + paralinguistic dual-channel understanding; customer-support empathy + mental-health applications + cognitive-behavioral-therapy bot. Sesame Maya (2025 launch): Conversational Speech Model (CSM-1B) architecture; natural pauses + filler word ('um', 'uh') generation; Maya + Miles voices; interruption-handling mastery. Speed + empathy + naturalness decision matrix (which is optimal in which scenario) is presented in detail.</p>

<p>The sixth module covers in detail Google's March 2025 launch of Gemini 2.5 Live API and Anthropic Claude Voice. Gemini 2.5 Live: native audio + video bidirectional streaming, Affective Dialog (emotion-aware response), Proactive Audio (selective listening — responds only to relevant input), 30+ voices, multi-language seamless code-switching (Turkish + English mix). Setup with google-genai SDK. Claude Voice (Claude Sonnet 4.6 voice mode 2025): natural conversation, document grounding (Projects + Skills voice integration), function calling, interruption. Multimodal scenarios: voice + screen sharing + camera input (native in Gemini Live). OpenAI Realtime vs Gemini Live vs Claude Voice benchmark comparison: pricing, latency, quality, multilingual support, agent capabilities.</p>

<p>The seventh module covers end to end voice-agent orchestration platforms. Vapi (YC W24): voice AI orchestrator, abstraction of 50+ STT/LLM/TTS providers, Twilio + Telnyx telephony, custom function tools, assistant config (JSON-based). Retell AI (YC S23): phone-agent specialist, $0.08/min flat pricing, Telnyx + Twilio + Vonage telephony, agent + voice + handover analytics. LiveKit Agents (open-source): WebRTC + agent framework, Python SDK, Anthropic + OpenAI + custom backend orchestration, Pipecat alternative. Pipecat (open-source Daily.co): real-time voice + video AI pipeline framework. PSTN telephony integration: Twilio Voice + Telnyx + Plivo + Sinch + Vonage; Turkish telecom operator integration (TT, Vodafone, Turkcell SIP trunk); 0850 numbers + 444 hotlines + IVR routing + queue management + handover to human.</p>

<p>The eighth module covers in detail the open-source voice-AI ecosystem. Moshi (Kyutai 2024): full-duplex S2S LLM, 7B parameters, 160ms theoretical / 200ms real latency, audio-token paradigm with Mimi codec (12.5 Hz audio token, RVQ neural audio codec). Higgs Audio v2 (2025): multi-speaker conversation generation, voice cloning, open-source. F5-TTS (NVIDIA 2024): diffusion-based ultra-high-quality TTS, 8 languages incl. Turkish, voice cloning. MeloTTS open-source multilingual, CosyVoice + ChatTTS (Alibaba 2024). Self-hosted deployment: Moshi Docker + Hugging Face weights; F5-TTS GPU inference; KVKK-compliant on-premise + banking BDDK + healthcare SBSGM data sovereignty. Cost analysis: OpenAI Realtime $0.24/min output vs self-hosted Moshi ~$0.02/min (10x cost reduction at enterprise scale).</p>

<p>The ninth module covers in detail the conversation-design discipline critical for voice agents to feel 'natural'. System prompt engineering: persona ('Mehmet, bank customer representative'), tone (formal/informal), formality (siz/sen, critical for Turkish), boundaries (refusal patterns), few-shot examples. Turn-taking strategy: server-side VAD (default 700ms silence threshold) vs client push-to-talk; end-of-utterance prediction (LLM-based vs acoustic); semantic VAD (sentence-completion detection). Interruption handling: graceful response.cancel + input_audio_buffer.clear + context preservation. Conversation flow state machine: greeting → discovery → resolution → confirmation → goodbye. Hallucination prevention: tool-grounded answer (function calling), RAG (knowledge-base query during conversation), refusal handling, escalation-to-human pattern. Turkish specifics: address rules (sayın/Mehmet bey/sen), cultural appropriateness, formal/informal tone switching, regional-accent handling.</p>

<p>The tenth module covers end to end the discipline of taking voice AI to production. Telephony stack: Twilio Voice + Telnyx + Plivo SIP trunk + 0850/444 number routing; Turkish Telecom + Vodafone + Turkcell SIP integration; IVR menu + queue management + skill-based routing + handover to human. Latency budget breakdown: STT 200ms + LLM 300ms + TTS 200ms = 700ms TTFB (acceptable classical pipeline), S2S LLM 500ms TTFB (excellent native paradigm). Geographic deployment: <100ms network latency via edge POPs (Istanbul AWS DC + GCP europe-west3 Frankfurt + Azure Turkey). Concurrent call scaling: WebSocket connection pool + Kubernetes HPA + 1000+ calls/sec scaling architecture. Reliability patterns: dropped-call handling + reconnect logic + fallback voice (in case of model rate limit or outage). KVKK + BDDK + banking voice-biometric compliance: call recording + transcription archive + S3 encryption + retention policy (6 months - 2 years) + audit log.</p>

<p>The eleventh module addresses the evaluation discipline that systematically measures voice-AI system quality. STT metrics: WER (Word Error Rate, target <5% for Turkish, calculation with jiwer library), CER (Character Error Rate, more sensitive for Turkish suffixes), domain-specific accuracy (banking IBAN format recognition, medical drug-name accuracy). TTS metrics: MOS (Mean Opinion Score 1-5 human rating), naturalness, similarity to reference voice; DNSMOS + UTMOS automatic perceptual quality estimation. Conversational metrics: TTFB (Time to First Byte, how quickly voice starts), turn-taking accuracy, interruption-handling success rate, task-completion rate, user satisfaction (NPS, post-call survey). End-to-end public benchmarks: Voice Arena (LMSYS Voice equivalent), ChatbotArena Voice. Custom Turkish benchmark production: banking IVR test set (100+ scenarios), healthcare triage scenarios, e-commerce support dialogues. A/B test framework + voice observability + production trace with Langfuse + Phoenix.</p>

<p>In the capstone module, each participant designs an end-to-end production Turkish voice-agent system for their own scenario: use-case selection (banking IVR retail / corporate, healthcare SBSGM triage, e-commerce Trendyol/Hepsiburada support, restaurant reservation agent, public 444 hotline + e-Government voice access, education student assistant); stack selection (OpenAI Realtime vs Vapi managed vs LiveKit + Cartesia vs self-hosted Moshi vs ElevenLabs Conversational AI); telephony integration (Twilio + Turkish telecom SIP trunk); conversation design + persona + tone + flow; KVKK + BDDK + sectoral compliance audit; cost + latency + quality benchmark; 90-day production deployment + scaling roadmap. By the end of the training, participants reach a level of technical competence to skillfully manage the paradigm shift from classical STT+LLM+TTS pipeline to native S2S LLM; make the right choice among OpenAI Realtime + Gemini Live + Claude Voice + Sesame Maya + Cartesia Sonic + ElevenLabs Conversational; build a Whisper v3 + faster-whisper + WhisperX + Azure Speech Turkish STT production pipeline; perform voice-agent orchestration with Vapi + Retell AI + LiveKit Agents + Pipecat; build production telephony integration with Twilio + Telnyx + Turkish telecom SIP; perform Moshi + F5-TTS + Higgs Audio v2 self-hosted KVKK-compliant deployment; apply conversation design + turn-taking + interruption + hallucination-prevention discipline; and measure production quality with WER + MOS + Voice Arena + custom Turkish benchmarks. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.</p>