Skip to content

Speaker ID + Diarization FT: pyannote.audio + WavLM — Multi-Speaker Separation

Meeting/call center transcripts: 'who's speaking + what'. pyannote.audio (HF), WavLM speaker embeddings, diarization pipeline (VAD → embedding → clustering). Call center case: customer vs operator separation, FT on RTX 4090 + 100h TR call dataset.

Şükrü Yusuf KAYA
24 min read
Advanced
Speaker ID + Diarization FT: pyannote.audio + WavLM — Çoklu Konuşmacı Ayrımı
python
# === pyannote.audio + Whisper diarization pipeline ===
from pyannote.audio import Pipeline
import whisper
 
# Diarization
diar = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1",
use_auth_token="hf_xxx")
 
# ASR
asr = whisper.load_model("large-v3-turbo")
 
def transcribe_with_speakers(audio_path):
# 1. Diarization
diarization = diar(audio_path)
 
# 2. Per-segment ASR
result = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
# Extract audio slice
# ... segment WAV file generate ...
text = asr.transcribe(segment_audio, language="tr")["text"]
result.append({
"speaker": speaker,
"start": turn.start,
"end": turn.end,
"text": text,
})
return result
 
# Output:
# [{"speaker": "SPEAKER_00", "start": 0.0, "end": 3.4, "text": "Merhaba, nasıl yardımcı olabilirim?"},
# {"speaker": "SPEAKER_01", "start": 3.4, "end": 8.2, "text": "Telefonumda problem var."},
# ...]
pyannote + Whisper diarization pipeline
✅ Part VII tamamlandı
  1. Çağrı merkezi sample ses ile diarization + transcription. 2) Sonraki Part: Part VIII — Code Models & Repo-Level FT.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content