TR Quality Pipeline: KenLM Perplexity + Slur/PII Filter + Educational-Value
From raw TR corpus to quality FT data: KenLM 5-gram TR perplexity (gibberish/MT artifact filter), TR slur filter, TR PII detection (TC ID, phone, email), educational-value scorer (FineWeb adaptation). Clean 100GB TR corpus in 4h on RTX 4090.
Şükrü Yusuf KAYA
32 min read
Advanced1. 5-Aşamalı TR Pipeline#
Raw TR corpus (100 GB) ↓ 1. Length filter (50 < len < 100K) — %5 düşer ↓ 2. KenLM perplexity (< 250) — gibberish, machine-translated kötü TR → %15-20 düşer ↓ 3. TR slur/küfür filter — toxicity → %1-2 düşer ↓ 4. PII filter (TC kimlik, telefon, e-mail, IBAN) → mask edilir, %0 düşer ↓ 5. Edu-value scorer (>= 2.5) — %30-40 düşer ↓ Final: ~40-50 GB cleaned
bash
# === KenLM TR 5-gram language model train ===# Wikipedia TR + KAPAR'dan traingit clone https://github.com/kpu/kenlmcd kenlm; mkdir build; cd build; cmake ..; make -j8 # Tokenize + trainzcat wiki-tr.txt.gz | ./bin/lmplz -o 5 \ --skip_symbols --interpolate_unigrams 0 \ --discount_fallback \ > tr_5gram.arpa ./bin/build_binary -a 22 -q 8 -b 8 tr_5gram.arpa tr_5gram.bin# Final binary ~2.5 GBKenLM TR 5-gram training
python
# === TR PII Filter ===import re # TC Kimlik No (11 digit, formula check)def is_tc_kimlik_no(s): if not s.isdigit() or len(s) != 11: return False digits = [int(c) for c in s] if digits[0] == 0: return False sum1 = sum(digits[i] for i in range(0, 9, 2)) * 7 sum2 = sum(digits[i] for i in range(1, 9, 2)) check10 = (sum1 - sum2) % 10 if digits[9] != check10: return False check11 = sum(digits[:10]) % 10 return digits[10] == check11 def detect_and_mask_tr_pii(text): """TR PII tespit + mask.""" # TC kimlik text = re.sub(r"\\b\\d{11}\\b", lambda m: "[TC_KIMLIK_MASKED]" if is_tc_kimlik_no(m.group()) else m.group(), text) # Telefon (TR formatlar: 05XX-XXX-XXXX, +90 5XX XXX XX XX, vs) text = re.sub(r"(?:\\+90\\s?)?(?:0\\s?)?5\\d{2}[\\s\\-]?\\d{3}[\\s\\-]?\\d{2}[\\s\\-]?\\d{2}", "[PHONE_MASKED]", text) # E-mail text = re.sub(r"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b", "[EMAIL_MASKED]", text) # IBAN (TR98 ile başlar, 26 character total) text = re.sub(r"TR\\d{2}[\\s\\-]?(?:\\d{4}[\\s\\-]?){5}\\d{2}", "[IBAN_MASKED]", text) # Plate (örn. 34 ABC 123) text = re.sub(r"\\b\\d{2}\\s?[A-Z]{1,3}\\s?\\d{2,4}\\b", "[PLATE_MASKED]", text) return text # Testsample = "Ahmet'in numarası 0532 123 45 67, e-postası ahmet@example.com, TC kimlik 12345678901."masked = detect_and_mask_tr_pii(sample)print(masked)# "Ahmet'in numarası [PHONE_MASKED], e-postası [EMAIL_MASKED], TC kimlik 12345678901."# (TC kimlik formula check fails — masked değil)TR PII tespit + masking
✅ Teslim
- Yukarıdaki PII filter'i test et. 2) KenLM TR'yi Wiki+KAPAR ile train et. 3) 1GB sample corpus'u tam pipeline'a sok. 4) Sonraki ders: 9.3 — Tokenizer Extension Lab.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations