Can I do mini pre-training myself?

1B param model: ~$10K compute (4xH100 1 week). 100M param: $1K (single GPU 1 week). OK for educational. For production quality 7B+ ($1M+).

Can Turkish-specific pre-training corpus be collected?

Yes. 30 GB Turkish corpus (Wikipedia + OSCAR + news) sufficient. 1-3B param Turkish model pre-training: $10K-50K compute. Quality close to fine-tuned Llama-3-7B Turkish.

Pre-training Pipeline End-to-End: Corpus → Tokenize → Pack → Train — Llama-3 Production Recipe

All stages of pre-training pipeline: corpus collection (Common Crawl, Wikipedia, code), data cleaning (deduplication, language filtering, quality scoring), tokenization batching, sequence packing strategy, document boundary handling. Llama-3 production recipe: 15T tokens, 24K H100 days compute, 70 days training.

Şükrü Yusuf KAYA

75 min read

5/13/2026

Advanced

Pre-training Pipeline End-to-End: Corpus → Tokenize → Pack → Train — Llama-3 Production Recipe

🏗️ Pre-training pipeline — modern LLM'in inşaat sahası

Llama-3-8B'i sıfırdan eğitmek istiyorsun. Compute: 24K H100 günü ($6M). Data: 15T token. Süre: 70 gün. Bu ölçek production AI'da ne kadar büyük olduğunu zor anlatılır. Ama compute'tan önce pipeline: corpus toplama, cleaning, dedup, tokenize, pack, shuffle. Her aşama hata yaparsan training silver ya çok yavaş ya da kötü model üretir. 75 dakika sonra: 24K H100 günü compute'unun nasıl efficient kullanıldığını, Llama-3 production recipe'sinin detaylarını, kendi mini pre-training pipeline'ını kuracak yeteneğine sahip olacaksın.

Ders Haritası (12 Bölüm)#

Pre-training overview — high-level pipeline diyagramı
Corpus collection — kaynaklar (Common Crawl, Wikipedia, GitHub)
Quality filtering — heuristic + classifier-based
Deduplication — exact + fuzzy (MinHash + LSH)
PII removal — privacy compliance
Tokenization — corpus → token IDs
Sequence packing — multiple docs into max_seq_len
Document boundary — separator tokens, attention mask
Shuffle + epochs — data ordering strategy
Llama-3 corpus mix — actual proportions
Llama-3 training recipe — hyperparameters, schedule
Cost economics — $M-level compute budget

1. Pre-training Pipeline#

1.1 High-level flow#

[Stage 1: Data Collection]
  ↓ Common Crawl (10B+ pages)
  ↓ Wikipedia (60+ languages)
  ↓ GitHub (code)
  ↓ Books (scientific, literature)
  ↓ ~100TB raw text

[Stage 2: Quality Filter]
  ↓ Language detection (filter target languages)
  ↓ Heuristic filters (length, structure, gibberish)
  ↓ Classifier-based quality scoring (FastText)
  ↓ ~50TB clean

[Stage 3: Deduplication]
  ↓ Exact dedup (hash-based)
  ↓ Fuzzy dedup (MinHash + LSH)
  ↓ ~30TB unique

[Stage 4: PII Removal]
  ↓ Email, phone, SSN scrubbing
  ↓ ~30TB sanitized

[Stage 5: Tokenization]
  ↓ HF tokenizer (Llama-3 128K vocab)
  ↓ 15T tokens (50% compression)

[Stage 6: Sequence Packing]
  ↓ Pack into 8192-token sequences
  ↓ ~2B sequences

[Stage 7: Shuffle + Epoch]
  ↓ Distributed shuffle
  ↓ 1.5 epochs typical (Llama-3)

[Stage 8: Training]
  ↓ 24K H100 GPU days
  ↓ Final Llama-3-8B model

Her aşama optimization gerektiriyor — bottleneck olmamalı.

1.2 Time budget#

Production Llama-3 pipeline:

Corpus collection: 2-3 ay (crawl, license review)
Cleaning + dedup: 2-3 hafta (massive parallel processing)
Tokenization: 1 hafta (parallel)
Training: 70 gün
Total: ~6 ay

1.3 Compute breakdown#

Llama-3-8B (Meta paper):

Total: 1.3M GPU hours H100
= 24K H100 days
= ~ $6M cloud cost (spot rate$ 2.5/hour)
= ~$3M on-prem (depending on amortization)

Llama-3-70B: 10x daha pahalı, ~

60M. Llama-3-405B: 100x daha pahalı, ~

500M+.

2. Corpus Collection#

2.1 Common Crawl#

Web crawl archive — public dataset.

250B+ web pages
Updated monthly
~50 TB compressed per snapshot
~100 TB extracted text

Download: https://commoncrawl.org

from datasets import load_dataset
cc = load_dataset("allenai/c4", "en", split="train", streaming=True)

2.2 Wikipedia#

60+ languages
High quality text
~30 GB compressed (all languages)
~200 GB extracted text

wiki = load_dataset("wikipedia", "20240401.tr", split="train")

2.3 GitHub (code)#

100M+ repositories
Permissive licenses (MIT, Apache, BSD)
Code: Python, JavaScript, C++, Rust, Go
~5 TB code text

2.4 Books#

Project Gutenberg (public domain)
Scientific papers (arXiv, PubMed)
~500 GB

2.5 Türkçe corpus#

Turkish Wikipedia (4 GB)
OSCAR Turkish (10-15 GB)
Türkçe news (1-2 GB)
BounWebCorpus (academic)
~30 GB total Turkish

Llama-3 multilingual: 5%+ non-English content.

2.6 License compliance#

Kritik: training data legal:

Common Crawl: Robots.txt compliance
Wikipedia: CC-BY-SA (attribution required)
GitHub: permissive licenses only
Books: public domain or licensed

EU AI Act 2024: transparency requirements — training data composition disclosed.

3-4. Quality + Deduplication#

3.1 Heuristic filters#

Min line length: 5 words
Max line length: 1000 words
Min ratio of alphanumeric: 80%
No more than 30% punctuation
Detect gibberish: random character sequences
Detect lists/tables (not natural text)

def is_quality(text):
    words = text.split()
    if len(words) < 10 or len(words) > 10000:
        return False
    alnum = sum(1 for c in text if c.isalnum())
    if alnum / len(text) < 0.7:
        return False
    return True

3.2 Classifier-based filtering#

FastText classifier: high-quality vs low-quality text. Training: Wikipedia (high) vs Common Crawl raw (low). Score > 0.5 keep.

Llama-3 paper: classifier-based filtering significantly improves model quality.

3.3 Exact deduplication#

Hash each document, drop duplicates.

import hashlib
seen = set()
def dedupe_exact(docs):
    for doc in docs:
        h = hashlib.sha256(doc.encode()).hexdigest()
        if h not in seen:
            seen.add(h)
            yield doc

Common Crawl raw: %30 exact duplicate. Dramatic reduction.

3.4 Fuzzy dedup (MinHash + LSH)#

Near-duplicates (paraphrases, copy-pasted articles). MinHash + Locality Sensitive Hashing:

from datasketch import MinHash, MinHashLSH

lsh = MinHashLSH(threshold=0.8, num_perm=128)
for i, doc in enumerate(docs):
    m = MinHash(num_perm=128)
    for shingle in get_shingles(doc, k=5):
        m.update(shingle.encode())
    lsh.insert(i, m)
# Query for near-duplicates

3.5 Llama-3 dedup outcome#

Meta reports:

Common Crawl raw: 100 TB
After heuristic filter: 50 TB
After classifier filter: 30 TB
After exact dedup: 20 TB
After fuzzy dedup: 15 TB

%85 data dropped. Quality > quantity.

7-9. Sequence Packing#

7.1 Naive: one doc per sequence#

seq_1 = doc_1 padded to max_seq_len
seq_2 = doc_2 padded to max_seq_len
...

Problem: padding wastes compute. Avg doc length 500 tokens, max_seq=8192 → %94 padding waste.

7.2 Sequence packing#

Multiple docs pack into one sequence:

seq_1 = doc_1 + <|sep|> + doc_2 + <|sep|> + doc_3 + ...

Fill max_seq_len fully. No padding waste.

7.3 Document boundary handling#

Key: attention shouldn't cross document boundaries.

Option A: special separator token + attention mask

[doc1_tokens] [<|endoftext|>] [doc2_tokens] [<|endoftext|>] ...

attention_mask: doc1 sadece doc1, doc2 sadece doc2.

Option B: position reset

position_ids = [0, 1, ..., len(doc1)-1, 0, 1, ..., len(doc2)-1, ...]

Llama-3 yaklaşımı: position_ids reset (RoPE-friendly).

7.4 Pack algorithm#

def pack_sequences(docs, max_seq_len, eot_token):
    sequences = []
    current = []
    for doc in docs:
        if len(current) + len(doc) + 1 > max_seq_len:
            # Flush current sequence
            sequences.append(current)
            current = []
        current.extend(doc)
        current.append(eot_token)
    if current:
        sequences.append(current)
    return sequences

7.5 Greedy vs optimal packing#

Greedy: fill until cannot fit more. %95+ efficiency. Optimal: bin packing problem (NP-hard) — exact %100 difficult.

7.6 Shuffle strategy#

Global shuffle ideal:

import random
random.shuffle(sequences)

Büyük corpus: chunk-level shuffle + within-chunk shuffle (memory-efficient).

10-12. Llama-3 Training Recipe#

10.1 Corpus mix (Llama-3 paper)#

Web (filtered Common Crawl): %60
Books: %15
Code (GitHub): %15
Wikipedia + scientific: %5
Multilingual: %5

Total: 15T tokens unique.

10.2 Tokenization config#

Tokenizer: tiktoken-style BPE, 128K vocab
Pre-train sequence length: 8192
Total sequences: ~1.8B (15T / 8192)

10.3 Optimizer#

AdamW:

lr: 3e-4 (peak)
β_1: 0.9, β_2: 0.95
weight_decay: 0.1
ε: 1e-5
gradient_clip: 1.0

10.4 Learning rate schedule#

Cosine decay with warmup:

Warmup: 2000 steps linear 0 → 3e-4
Cosine decay: 3e-4 → 3e-5
Total: 1.4M steps

10.5 Batch size#

Global batch: 16M tokens
= 16M / 8192 = ~2000 sequences
Per GPU batch: 32 sequences (64 GPUs)
Gradient accumulation: ~1

10.6 Distributed setup#

24K H100 GPUs (1024 nodes × 24 GPUs/node, hypothetical)
Tensor parallelism: 8
Pipeline parallelism: 16
Data parallelism: 192 (24K / 128)
ZeRO-3 sharding

10.7 Training schedule#

Start: lr warmup, fp32 master weights, bf16 forward
Day 1-7: stability checks, frequent checkpoints
Day 8-60: stable training
Day 61-70: lr decay phase
Total: 1.4M steps, 70 days

10.8 Cost#

24K H100 × 70 days × 24 hours × $4/hour =$ 161M
Discounted enterprise rate: ~$60M
Meta on-prem amortized: ~$30M (DC + GPU amortization)

Very rough estimate. Meta actual cost not public.

10.9 Output#

Llama-3-8B base model checkpoint
16 GB (bf16)
Used downstream for instruct fine-tune, multiple variants

✅ Ders 11.1 Özeti — Pre-training Pipeline

Pre-training pipeline 6 ay süren büyük operasyon. Corpus (Common Crawl, Wikipedia, code) → quality filter (heuristic + classifier) → dedup (exact + fuzzy MinHash) → PII removal → tokenize → sequence pack (max_seq 8192) → shuffle → train. Llama-3 recipe: 15T tokens, AdamW (lr 3e-4 peak, cosine decay), batch 16M tokens, 70 days on 24K H100 GPUs, $30M-60M cost. Quality > quantity: %85 raw data dropped through filters. Türkçe %5'lik multilingual slice'da. Ders 11.2'de AdamW optimizer math'e geçeceğiz.

Sıradaki Ders: AdamW Optimizer Math#

Ders 11.2: AdamW (Adam + weight decay decoupled, Loshchilov 2019), momentum + variance estimates, niye β1=0.9 β2=0.95, learning rate schedules (cosine, linear, warmup), gradient clipping.

Frequently Asked Questions

NO. Meta did not make exact training data composition public. General breakdown in paper (%60 web, %15 books etc.) but actual data files closed. EU AI Act 2024 transparency requirements may change this.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...