# TR Corpus Building: mC4-TR + OSCAR-TR + KAPAR + Wikipedia + Common Crawl + Library Scraping

> Source: https://sukruyusufkaya.com/en/learn/fine-tuning-cookbook/ftc-tr-corpus-insasi-multi-source
> Updated: 2026-05-14T14:42:56.025Z
> Category: Fine-Tuning Cookbook (Model-by-Model)
> Module: Part IX — Turkish-First & Localization Engineering
**TLDR:** Collecting 100GB+ Turkish corpus: mC4-TR (35GB), OSCAR-TR (45GB), KAPAR (parliamentary transcripts), Wikipedia TR (2GB), Common Crawl filter (50-200GB potential), library scraping (TR State Library, open works). License and KVKK attention. Practical download/tokenize pipeline.

