Whisper Architecture: Log-Mel Spectrogram + Encoder-Decoder + Language Tokens

Whisper (OpenAI 2022) — speech recognition's gold standard. Anatomy: 80-bin log-mel spectrogram input, 12-32 layer encoder + decoder transformer, BPE tokenizer (50K + multilingual + tasks), language tokens, task tokens, timestamp tokens. Model variants: tiny (39M) → large-v3 (1.5B) → turbo (809M).

Şükrü Yusuf KAYA

30 min read

5/14/2026

Advanced

Whisper Architecture: Log-Mel Spectrogram + Encoder-Decoder + Language Tokens

1. Whisper Pipeline#

audio (16kHz mono) → 30-second window
  → STFT → power spectrum → mel-filterbank (80 bins) → log → log-mel spectrogram
  → conv1d (downsample 2x) → conv1d → encoder transformer (12-32 layer)
  → cross-attention with decoder
  → decoder transformer (12-32 layer)
  → text tokens (BPE 50K)

Special tokens:
  <|startoftranscript|>
  <|tr|>                              # language code (Turkish)
  <|transcribe|> / <|translate|>      # task
  <|notimestamps|> / <|0.00|>...      # timestamp tokens
  <|endoftext|>

Model Variants#

Model	Params	Multilingual	RTL ses	RTX 4090 inference
whisper-tiny	39M	✅	OK (TR sınırlı)	35× realtime
whisper-base	74M	✅	OK	30×
whisper-small	244M	✅	iyi	22×
whisper-medium	769M	✅	iyi	16×
whisper-large-v3	1.55B	✅	çok iyi	12×
whisper-large-v3-turbo	809M	✅	çok iyi	14× (large-v3'ten 4× hızlı)

Cookbook tavsiyesi: Production'da whisper-large-v3-turbo — kaliteye yakın, 4× hızlı.

✅ Teslim

Whisper Large-v3 ile bir TR ses dosyasını transcribe et. 2) Language token'ı manuel set et (\<|tr|\>). 3) Sonraki ders: 7.2 — Whisper TR FT (Common Voice + Bilkent).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Whisper Architecture: Log-Mel Spectrogram + Encoder-Decoder + Language Tokens

1. Whisper Pipeline#

Model Variants#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter