Whisper Architecture: Log-Mel Spectrogram + Encoder-Decoder + Language Tokens
Whisper (OpenAI 2022) — speech recognition's gold standard. Anatomy: 80-bin log-mel spectrogram input, 12-32 layer encoder + decoder transformer, BPE tokenizer (50K + multilingual + tasks), language tokens, task tokens, timestamp tokens. Model variants: tiny (39M) → large-v3 (1.5B) → turbo (809M).
Şükrü Yusuf KAYA
30 min read
Advanced1. Whisper Pipeline#
audio (16kHz mono) → 30-second window → STFT → power spectrum → mel-filterbank (80 bins) → log → log-mel spectrogram → conv1d (downsample 2x) → conv1d → encoder transformer (12-32 layer) → cross-attention with decoder → decoder transformer (12-32 layer) → text tokens (BPE 50K) Special tokens: <|startoftranscript|> <|tr|> # language code (Turkish) <|transcribe|> / <|translate|> # task <|notimestamps|> / <|0.00|>... # timestamp tokens <|endoftext|>
Model Variants#
| Model | Params | Multilingual | RTL ses | RTX 4090 inference |
|---|---|---|---|---|
| whisper-tiny | 39M | ✅ | OK (TR sınırlı) | 35× realtime |
| whisper-base | 74M | ✅ | OK | 30× |
| whisper-small | 244M | ✅ | iyi | 22× |
| whisper-medium | 769M | ✅ | iyi | 16× |
| whisper-large-v3 | 1.55B | ✅ | çok iyi | 12× |
| whisper-large-v3-turbo | 809M | ✅ | çok iyi | 14× (large-v3'ten 4× hızlı) |
Cookbook tavsiyesi: Production'da whisper-large-v3-turbo — kaliteye yakın, 4× hızlı.
✅ Teslim
- Whisper Large-v3 ile bir TR ses dosyasını transcribe et. 2) Language token'ı manuel set et (\<|tr|\>). 3) Sonraki ders: 7.2 — Whisper TR FT (Common Voice + Bilkent).
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations