Skip to content

Whisper Large-v3 / Turbo TR FT: Common Voice + Bilkent + Mozilla TR + Custom Corpus

Turkish Whisper FT — comfortable on RTX 4090 (large-v3 ~6 GB, turbo ~3 GB). Common Voice TR (180h), Bilkent TR corpus, Mozilla TR. WER (Word Error Rate) measurement, TR-specific tokenize fixes. Baseline WER 12% → FT WER 6% (~2× improvement).

Şükrü Yusuf KAYA
32 min read
Advanced
Whisper Large-v3 / Turbo TR FT: Common Voice + Bilkent + Mozilla TR + Custom Corpus
python
# === Whisper Large-v3-Turbo TR FT — RTX 4090 ===
import torch
from transformers import (
WhisperForConditionalGeneration,
WhisperProcessor,
Seq2SeqTrainingArguments,
Seq2SeqTrainer,
DataCollatorForSeq2Seq,
)
from datasets import load_dataset, Audio
 
# 1. Model + processor
model = WhisperForConditionalGeneration.from_pretrained(
"openai/whisper-large-v3-turbo",
torch_dtype=torch.bfloat16,
device_map="cuda",
)
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3-turbo")
processor.tokenizer.set_prefix_tokens(language="turkish", task="transcribe")
 
# 2. Dataset — Common Voice TR
cv_tr = load_dataset("mozilla-foundation/common_voice_17_0", "tr", split="train+validation")
cv_tr = cv_tr.cast_column("audio", Audio(sampling_rate=16000))
 
# 3. Preprocess
def prepare(batch):
audio = batch["audio"]
inputs = processor.feature_extractor(audio["array"], sampling_rate=16000)
batch["input_features"] = inputs.input_features[0]
batch["labels"] = processor.tokenizer(batch["sentence"]).input_ids
return batch
 
cv_tr = cv_tr.map(prepare, num_proc=8, remove_columns=cv_tr.column_names)
 
# 4. Training args
args = Seq2SeqTrainingArguments(
output_dir="whisper-large-v3-turbo-tr",
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=1e-5, # Whisper FT için düşük lr
warmup_steps=500,
max_steps=5000,
bf16=True,
optim="adamw_torch",
predict_with_generate=True,
generation_max_length=225,
save_steps=500, eval_steps=500,
logging_steps=25,
report_to="wandb",
)
 
trainer = Seq2SeqTrainer(
model=model, args=args,
train_dataset=cv_tr,
data_collator=DataCollatorForSeq2Seq(processor),
tokenizer=processor.feature_extractor,
)
trainer.train()
 
# Bench (RTX 4090 + Common Voice TR 180h):
# - Wall-clock: 18-22 saat
# - Peak GB: 16-18
# - WER pre-FT: 12.3%
# - WER post-FT: 5.8%
Whisper Large-v3-Turbo TR FT — RTX 4090

1. WER Evaluation#

from evaluate import load wer_metric = load("wer") predictions = trainer.predict(test_dataset) pred_str = processor.batch_decode(predictions.predictions, skip_special_tokens=True) label_str = processor.batch_decode(test_dataset["labels"], skip_special_tokens=True) # Normalize (Whisper standardı) from whisper.normalizers import BasicTextNormalizer norm = BasicTextNormalizer() pred_str_norm = [norm(p) for p in pred_str] label_str_norm = [norm(l) for l in label_str] wer = wer_metric.compute(predictions=pred_str_norm, references=label_str_norm) print(f"WER: {wer * 100:.2f}%")
✅ Teslim
  1. Common Voice TR ile mini Whisper Turbo FT (1000 sample, 1 saat). 2) WER pre/post ölç. 3) Sonraki ders: 7.3 — Türkçe Lehçe FT.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content