Skip to content

Streaming ASR: faster-whisper + distil-whisper — Real-Time Latency Budget < 200ms

Whisper is fast offline (batch) but not optimized for streaming. Solution: **faster-whisper** (CTranslate2 + INT8), **distil-whisper** (50% layers reduced student). Latency budget < 200 ms first-token, 70× real-time. Turkish streaming setup on RTX 4090: chunking, VAD, partial hypotheses.

Şükrü Yusuf KAYA
24 min read
Advanced
Streaming ASR: faster-whisper + distil-whisper — Real-Time Latency Budget < 200ms
python
# === faster-whisper TR streaming ===
from faster_whisper import WhisperModel
 
model = WhisperModel(
"large-v3-turbo",
device="cuda", compute_type="int8_float16", # INT8 quantized
)
 
# Streaming generator
def transcribe_stream(audio_chunk):
segments, _ = model.transcribe(
audio_chunk,
language="tr",
beam_size=5,
vad_filter=True, # VAD aktif
vad_parameters=dict(min_silence_duration_ms=300),
word_timestamps=True,
)
for seg in segments:
yield seg.text, seg.start, seg.end
 
# RTX 4090 bench:
# - large-v3 official: 12× realtime (CUDA fp16)
# - faster-whisper large-v3 INT8: 80× realtime
# - faster-whisper turbo INT8: 110× realtime
# - distil-whisper large-v3: 90× realtime
faster-whisper TR streaming
✅ Teslim
  1. faster-whisper kur, mikrofon stream'ini transcribe et. 2) Latency ölç. 3) Sonraki ders: 7.5 — Audio LLM (Qwen2-Audio).

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content