Skip to content

TR Quality Pipeline: KenLM Perplexity + Slur/PII Filter + Educational-Value

From raw TR corpus to quality FT data: KenLM 5-gram TR perplexity (gibberish/MT artifact filter), TR slur filter, TR PII detection (TC ID, phone, email), educational-value scorer (FineWeb adaptation). Clean 100GB TR corpus in 4h on RTX 4090.

Şükrü Yusuf KAYA
32 min read
Advanced
TR Quality Pipeline: KenLM Perplexity + Slur/PII Filter + Educational-Value

1. 5-Aşamalı TR Pipeline#

Raw TR corpus (100 GB) ↓ 1. Length filter (50 < len < 100K) — %5 düşer ↓ 2. KenLM perplexity (< 250) — gibberish, machine-translated kötü TR → %15-20 düşer ↓ 3. TR slur/küfür filter — toxicity → %1-2 düşer ↓ 4. PII filter (TC kimlik, telefon, e-mail, IBAN) → mask edilir, %0 düşer ↓ 5. Edu-value scorer (>= 2.5) — %30-40 düşer ↓ Final: ~40-50 GB cleaned
bash
# === KenLM TR 5-gram language model train ===
# Wikipedia TR + KAPAR'dan train
git clone https://github.com/kpu/kenlm
cd kenlm; mkdir build; cd build; cmake ..; make -j8
 
# Tokenize + train
zcat wiki-tr.txt.gz | ./bin/lmplz -o 5 \
--skip_symbols --interpolate_unigrams 0 \
--discount_fallback \
> tr_5gram.arpa
 
./bin/build_binary -a 22 -q 8 -b 8 tr_5gram.arpa tr_5gram.bin
# Final binary ~2.5 GB
KenLM TR 5-gram training
python
# === TR PII Filter ===
import re
 
# TC Kimlik No (11 digit, formula check)
def is_tc_kimlik_no(s):
if not s.isdigit() or len(s) != 11:
return False
digits = [int(c) for c in s]
if digits[0] == 0:
return False
sum1 = sum(digits[i] for i in range(0, 9, 2)) * 7
sum2 = sum(digits[i] for i in range(1, 9, 2))
check10 = (sum1 - sum2) % 10
if digits[9] != check10:
return False
check11 = sum(digits[:10]) % 10
return digits[10] == check11
 
def detect_and_mask_tr_pii(text):
"""TR PII tespit + mask."""
# TC kimlik
text = re.sub(r"\\b\\d{11}\\b",
lambda m: "[TC_KIMLIK_MASKED]" if is_tc_kimlik_no(m.group()) else m.group(),
text)
 
# Telefon (TR formatlar: 05XX-XXX-XXXX, +90 5XX XXX XX XX, vs)
text = re.sub(r"(?:\\+90\\s?)?(?:0\\s?)?5\\d{2}[\\s\\-]?\\d{3}[\\s\\-]?\\d{2}[\\s\\-]?\\d{2}",
"[PHONE_MASKED]", text)
 
# E-mail
text = re.sub(r"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b",
"[EMAIL_MASKED]", text)
 
# IBAN (TR98 ile başlar, 26 character total)
text = re.sub(r"TR\\d{2}[\\s\\-]?(?:\\d{4}[\\s\\-]?){5}\\d{2}",
"[IBAN_MASKED]", text)
 
# Plate (örn. 34 ABC 123)
text = re.sub(r"\\b\\d{2}\\s?[A-Z]{1,3}\\s?\\d{2,4}\\b",
"[PLATE_MASKED]", text)
 
return text
 
# Test
sample = "Ahmet'in numarası 0532 123 45 67, e-postası ahmet@example.com, TC kimlik 12345678901."
masked = detect_and_mask_tr_pii(sample)
print(masked)
# "Ahmet'in numarası [PHONE_MASKED], e-postası [EMAIL_MASKED], TC kimlik 12345678901."
# (TC kimlik formula check fails — masked değil)
TR PII tespit + masking
✅ Teslim
  1. Yukarıdaki PII filter'i test et. 2) KenLM TR'yi Wiki+KAPAR ile train et. 3) 1GB sample corpus'u tam pipeline'a sok. 4) Sonraki ders: 9.3 — Tokenizer Extension Lab.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content