TR Corpus Building: mC4-TR + OSCAR-TR + KAPAR + Wikipedia + Common Crawl + Library Scraping

Collecting 100GB+ Turkish corpus: mC4-TR (35GB), OSCAR-TR (45GB), KAPAR (parliamentary transcripts), Wikipedia TR (2GB), Common Crawl filter (50-200GB potential), library scraping (TR State Library, open works). License and KVKK attention. Practical download/tokenize pipeline.

Şükrü Yusuf KAYA

36 min read

5/14/2026

Advanced

TR Corpus İnşası: mC4-TR + OSCAR-TR + KAPAR + Wikipedia + Common Crawl + Kütüphane Scraping

1. Açık TR Corpus Kaynakları#

Kaynak	Boyut (raw)	Kalite	Lisans	Notlar
mC4-TR (Google)	35 GB	orta	ODC-BY 1.0	multilingual mC4 TR subset
OSCAR-TR (INRIA)	45 GB	orta	CC0	Common Crawl filtered
CulturaX-TR (Nguyen et al.)	60 GB	yüksek	Apache 2.0	dedupe + filter
Wikipedia TR	2 GB	yüksek	CC-BY-SA	en kaliteli az miktarlı
KAPAR (TBMM tutanak)	1.5 GB	orta	TBMM kamu	parlamento tutanakları
Common Crawl TR (filter)	50-200 GB potansiyel	düşük-orta	CC ToS	manuel filter zorunlu
TR books (açık)	0.5 GB	yüksek	varies	telif aşan eserler
TR news (open)	5 GB	yüksek	varies	bazıları lisans yok
Sosyal medya (Twitter API)	varies	düşük	KVKK riskli	uyarı
GitHub TR comments	1 GB	yüksek (code domain)	varies	code+TR hybrid

Toplam (cookbook varsayılan mix): 60-100 GB filtered.

python

# === TR Corpus Download Pipeline ===
from datasets import load_dataset
 
# 1. mC4-TR (35 GB raw → ~25 GB cleaned)
mc4_tr = load_dataset("allenai/c4", "tr", split="train",
                      streaming=True)   # streaming için 5TB disk gerek olmaz
 
# 2. OSCAR-TR
oscar_tr = load_dataset("oscar-corpus/OSCAR-2301", "tr",
                        split="train", streaming=True)
 
# 3. CulturaX-TR (en kaliteli)
culturax_tr = load_dataset("uonlp/CulturaX", "tr",
                           split="train", streaming=True)
 
# 4. Wikipedia TR
wiki_tr = load_dataset("wikipedia", "20231101.tr", split="train")
 
# 5. KAPAR (TBMM)
# Manuel scrape gerek — https://www.tbmm.gov.tr/develop/owa/tutanak_g.tutanaklar
# Cookbook tarafından sağlanan script: scripts/scrape_kapar.py
 
# Combine + shard
import webdataset as wds
import json, gzip
 
def to_shards(streams, shard_path, max_per_shard=10000):
    shard_idx = 0
    count = 0
    writer = wds.TarWriter(f"{shard_path}/tr-{shard_idx:06d}.tar")
    for s in streams:
        for example in s:
            text = example["text"]
            if 50 < len(text) < 100000:
                writer.write({"__key__": f"{shard_idx:06d}-{count}",
                            "text.txt": text})
                count += 1
                if count >= max_per_shard:
                    writer.close()
                    shard_idx += 1
                    count = 0
                    writer = wds.TarWriter(f"{shard_path}/tr-{shard_idx:06d}.tar")
    writer.close()
 
to_shards([mc4_tr, oscar_tr, culturax_tr, wiki_tr], "/data/tr-shards/")

TR corpus download + shard pipeline

2. KVKK & Lisans Dikkati#

Kaynak türü	KVKK risk	Eylem
Wikipedia / KAPAR / Kamu	düşük	direkt kullan
mC4-TR / OSCAR-TR	düşük (genel web)	PII filter ek
Common Crawl	orta	aggressive filter
Twitter scraping	yüksek	KVKK uyumsuz; kaçın
Reddit/forum scraping	yüksek	KVKK uyumsuz
Telegram/WhatsApp	yasadışı	hayır
Kişisel blog / sosyal medya	yüksek	rıza yoksa kaçın

Cookbook'un kuralı:

Açık kamu kaynakları (Wikipedia, TBMM, kütüphane) → direkt OK
Web crawl (mC4/OSCAR/CulturaX) → cookbook PII filter (Ders 9.2) zorunlu
Sosyal medya → kaçın veya açık rıza ile sınırlı

✅ Teslim

mC4-TR + Wikipedia TR + KAPAR download. 2) WebDataset shard'lara böl. 3) Toplam boyutu raporla. 4) Sonraki ders: 9.2 — TR Quality Pipeline.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

TR Corpus Building: mC4-TR + OSCAR-TR + KAPAR + Wikipedia + Common Crawl + Library Scraping

1. Açık TR Corpus Kaynakları#

2. KVKK & Lisans Dikkati#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter