Skip to content

Whisper Architecture: Log-Mel Spectrogram + Encoder-Decoder + Language Tokens

Whisper (OpenAI 2022) — speech recognition's gold standard. Anatomy: 80-bin log-mel spectrogram input, 12-32 layer encoder + decoder transformer, BPE tokenizer (50K + multilingual + tasks), language tokens, task tokens, timestamp tokens. Model variants: tiny (39M) → large-v3 (1.5B) → turbo (809M).

Şükrü Yusuf KAYA
30 min read
Advanced
Whisper Architecture: Log-Mel Spectrogram + Encoder-Decoder + Language Tokens

1. Whisper Pipeline#

audio (16kHz mono) → 30-second window → STFT → power spectrum → mel-filterbank (80 bins) → log → log-mel spectrogram → conv1d (downsample 2x) → conv1d → encoder transformer (12-32 layer) → cross-attention with decoder → decoder transformer (12-32 layer) → text tokens (BPE 50K) Special tokens: <|startoftranscript|> <|tr|> # language code (Turkish) <|transcribe|> / <|translate|> # task <|notimestamps|> / <|0.00|>... # timestamp tokens <|endoftext|>

Model Variants#

ModelParamsMultilingualRTL sesRTX 4090 inference
whisper-tiny39MOK (TR sınırlı)35× realtime
whisper-base74MOK30×
whisper-small244Miyi22×
whisper-medium769Miyi16×
whisper-large-v31.55Bçok iyi12×
whisper-large-v3-turbo809Mçok iyi14× (large-v3'ten 4× hızlı)
Cookbook tavsiyesi: Production'da whisper-large-v3-turbo — kaliteye yakın, 4× hızlı.
✅ Teslim
  1. Whisper Large-v3 ile bir TR ses dosyasını transcribe et. 2) Language token'ı manuel set et (\<|tr|\>). 3) Sonraki ders: 7.2 — Whisper TR FT (Common Voice + Bilkent).

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content