Whisper Large-v3 / Turbo TR FT: Common Voice + Bilkent + Mozilla TR + Custom Corpus
Turkish Whisper FT — comfortable on RTX 4090 (large-v3 ~6 GB, turbo ~3 GB). Common Voice TR (180h), Bilkent TR corpus, Mozilla TR. WER (Word Error Rate) measurement, TR-specific tokenize fixes. Baseline WER 12% → FT WER 6% (~2× improvement).
Şükrü Yusuf KAYA
32 min read
Advancedpython
# === Whisper Large-v3-Turbo TR FT — RTX 4090 ===import torchfrom transformers import ( WhisperForConditionalGeneration, WhisperProcessor, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq,)from datasets import load_dataset, Audio # 1. Model + processormodel = WhisperForConditionalGeneration.from_pretrained( "openai/whisper-large-v3-turbo", torch_dtype=torch.bfloat16, device_map="cuda",)processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3-turbo")processor.tokenizer.set_prefix_tokens(language="turkish", task="transcribe") # 2. Dataset — Common Voice TRcv_tr = load_dataset("mozilla-foundation/common_voice_17_0", "tr", split="train+validation")cv_tr = cv_tr.cast_column("audio", Audio(sampling_rate=16000)) # 3. Preprocessdef prepare(batch): audio = batch["audio"] inputs = processor.feature_extractor(audio["array"], sampling_rate=16000) batch["input_features"] = inputs.input_features[0] batch["labels"] = processor.tokenizer(batch["sentence"]).input_ids return batch cv_tr = cv_tr.map(prepare, num_proc=8, remove_columns=cv_tr.column_names) # 4. Training argsargs = Seq2SeqTrainingArguments( output_dir="whisper-large-v3-turbo-tr", per_device_train_batch_size=8, gradient_accumulation_steps=2, learning_rate=1e-5, # Whisper FT için düşük lr warmup_steps=500, max_steps=5000, bf16=True, optim="adamw_torch", predict_with_generate=True, generation_max_length=225, save_steps=500, eval_steps=500, logging_steps=25, report_to="wandb",) trainer = Seq2SeqTrainer( model=model, args=args, train_dataset=cv_tr, data_collator=DataCollatorForSeq2Seq(processor), tokenizer=processor.feature_extractor,)trainer.train() # Bench (RTX 4090 + Common Voice TR 180h):# - Wall-clock: 18-22 saat# - Peak GB: 16-18# - WER pre-FT: 12.3%# - WER post-FT: 5.8%Whisper Large-v3-Turbo TR FT — RTX 4090
1. WER Evaluation#
from evaluate import load wer_metric = load("wer") predictions = trainer.predict(test_dataset) pred_str = processor.batch_decode(predictions.predictions, skip_special_tokens=True) label_str = processor.batch_decode(test_dataset["labels"], skip_special_tokens=True) # Normalize (Whisper standardı) from whisper.normalizers import BasicTextNormalizer norm = BasicTextNormalizer() pred_str_norm = [norm(p) for p in pred_str] label_str_norm = [norm(l) for l in label_str] wer = wer_metric.compute(predictions=pred_str_norm, references=label_str_norm) print(f"WER: {wer * 100:.2f}%")
✅ Teslim
- Common Voice TR ile mini Whisper Turbo FT (1000 sample, 1 saat). 2) WER pre/post ölç. 3) Sonraki ders: 7.3 — Türkçe Lehçe FT.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations