Pre-training Pipeline End-to-End: Corpus → Tokenize → Pack → Train — Llama-3 Production Recipe
All stages of pre-training pipeline: corpus collection (Common Crawl, Wikipedia, code), data cleaning (deduplication, language filtering, quality scoring), tokenization batching, sequence packing strategy, document boundary handling. Llama-3 production recipe: 15T tokens, 24K H100 days compute, 70 days training.
Şükrü Yusuf KAYA
75 min read
Advanced🏗️ Pre-training pipeline — modern LLM'in inşaat sahası
Llama-3-8B'i sıfırdan eğitmek istiyorsun. Compute: 24K H100 günü ($6M). Data: 15T token. Süre: 70 gün. Bu ölçek production AI'da ne kadar büyük olduğunu zor anlatılır. Ama compute'tan önce pipeline: corpus toplama, cleaning, dedup, tokenize, pack, shuffle. Her aşama hata yaparsan training silver ya çok yavaş ya da kötü model üretir. 75 dakika sonra: 24K H100 günü compute'unun nasıl efficient kullanıldığını, Llama-3 production recipe'sinin detaylarını, kendi mini pre-training pipeline'ını kuracak yeteneğine sahip olacaksın.
Ders Haritası (12 Bölüm)#
- Pre-training overview — high-level pipeline diyagramı
- Corpus collection — kaynaklar (Common Crawl, Wikipedia, GitHub)
- Quality filtering — heuristic + classifier-based
- Deduplication — exact + fuzzy (MinHash + LSH)
- PII removal — privacy compliance
- Tokenization — corpus → token IDs
- Sequence packing — multiple docs into max_seq_len
- Document boundary — separator tokens, attention mask
- Shuffle + epochs — data ordering strategy
- Llama-3 corpus mix — actual proportions
- Llama-3 training recipe — hyperparameters, schedule
- Cost economics — $M-level compute budget
1. Pre-training Pipeline#
1.1 High-level flow#
[Stage 1: Data Collection] ↓ Common Crawl (10B+ pages) ↓ Wikipedia (60+ languages) ↓ GitHub (code) ↓ Books (scientific, literature) ↓ ~100TB raw text [Stage 2: Quality Filter] ↓ Language detection (filter target languages) ↓ Heuristic filters (length, structure, gibberish) ↓ Classifier-based quality scoring (FastText) ↓ ~50TB clean [Stage 3: Deduplication] ↓ Exact dedup (hash-based) ↓ Fuzzy dedup (MinHash + LSH) ↓ ~30TB unique [Stage 4: PII Removal] ↓ Email, phone, SSN scrubbing ↓ ~30TB sanitized [Stage 5: Tokenization] ↓ HF tokenizer (Llama-3 128K vocab) ↓ 15T tokens (50% compression) [Stage 6: Sequence Packing] ↓ Pack into 8192-token sequences ↓ ~2B sequences [Stage 7: Shuffle + Epoch] ↓ Distributed shuffle ↓ 1.5 epochs typical (Llama-3) [Stage 8: Training] ↓ 24K H100 GPU days ↓ Final Llama-3-8B model
Her aşama optimization gerektiriyor — bottleneck olmamalı.
1.2 Time budget#
Production Llama-3 pipeline:
- Corpus collection: 2-3 ay (crawl, license review)
- Cleaning + dedup: 2-3 hafta (massive parallel processing)
- Tokenization: 1 hafta (parallel)
- Training: 70 gün
- Total: ~6 ay
1.3 Compute breakdown#
Llama-3-8B (Meta paper):
- Total: 1.3M GPU hours H100
- = 24K H100 days
- = ~2.5/hour)
- = ~$3M on-prem (depending on amortization)
Llama-3-70B: 10x daha pahalı, ~500M+.
2. Corpus Collection#
2.1 Common Crawl#
Web crawl archive — public dataset.
- 250B+ web pages
- Updated monthly
- ~50 TB compressed per snapshot
- ~100 TB extracted text
Download: https://commoncrawl.org
from datasets import load_dataset cc = load_dataset("allenai/c4", "en", split="train", streaming=True)
2.2 Wikipedia#
- 60+ languages
- High quality text
- ~30 GB compressed (all languages)
- ~200 GB extracted text
wiki = load_dataset("wikipedia", "20240401.tr", split="train")
2.3 GitHub (code)#
- 100M+ repositories
- Permissive licenses (MIT, Apache, BSD)
- Code: Python, JavaScript, C++, Rust, Go
- ~5 TB code text
2.4 Books#
- Project Gutenberg (public domain)
- Scientific papers (arXiv, PubMed)
- ~500 GB
2.5 Türkçe corpus#
- Turkish Wikipedia (4 GB)
- OSCAR Turkish (10-15 GB)
- Türkçe news (1-2 GB)
- BounWebCorpus (academic)
- ~30 GB total Turkish
Llama-3 multilingual: 5%+ non-English content.
2.6 License compliance#
Kritik: training data legal:
- Common Crawl: Robots.txt compliance
- Wikipedia: CC-BY-SA (attribution required)
- GitHub: permissive licenses only
- Books: public domain or licensed
EU AI Act 2024: transparency requirements — training data composition disclosed.
3-4. Quality + Deduplication#
3.1 Heuristic filters#
- Min line length: 5 words
- Max line length: 1000 words
- Min ratio of alphanumeric: 80%
- No more than 30% punctuation
- Detect gibberish: random character sequences
- Detect lists/tables (not natural text)
def is_quality(text): words = text.split() if len(words) < 10 or len(words) > 10000: return False alnum = sum(1 for c in text if c.isalnum()) if alnum / len(text) < 0.7: return False return True
3.2 Classifier-based filtering#
FastText classifier: high-quality vs low-quality text.
Training: Wikipedia (high) vs Common Crawl raw (low). Score > 0.5 keep.
Llama-3 paper: classifier-based filtering significantly improves model quality.
3.3 Exact deduplication#
Hash each document, drop duplicates.
import hashlib seen = set() def dedupe_exact(docs): for doc in docs: h = hashlib.sha256(doc.encode()).hexdigest() if h not in seen: seen.add(h) yield doc
Common Crawl raw: %30 exact duplicate. Dramatic reduction.
3.4 Fuzzy dedup (MinHash + LSH)#
Near-duplicates (paraphrases, copy-pasted articles). MinHash + Locality Sensitive Hashing:
from datasketch import MinHash, MinHashLSH lsh = MinHashLSH(threshold=0.8, num_perm=128) for i, doc in enumerate(docs): m = MinHash(num_perm=128) for shingle in get_shingles(doc, k=5): m.update(shingle.encode()) lsh.insert(i, m) # Query for near-duplicates
3.5 Llama-3 dedup outcome#
Meta reports:
- Common Crawl raw: 100 TB
- After heuristic filter: 50 TB
- After classifier filter: 30 TB
- After exact dedup: 20 TB
- After fuzzy dedup: 15 TB
%85 data dropped. Quality > quantity.
7-9. Sequence Packing#
7.1 Naive: one doc per sequence#
seq_1 = doc_1 padded to max_seq_len seq_2 = doc_2 padded to max_seq_len ...
Problem: padding wastes compute. Avg doc length 500 tokens, max_seq=8192 → %94 padding waste.
7.2 Sequence packing#
Multiple docs pack into one sequence:
seq_1 = doc_1 + <|sep|> + doc_2 + <|sep|> + doc_3 + ...
Fill max_seq_len fully. No padding waste.
7.3 Document boundary handling#
Key: attention shouldn't cross document boundaries.
Option A: special separator token + attention mask
[doc1_tokens] [<|endoftext|>] [doc2_tokens] [<|endoftext|>] ...
attention_mask: doc1 sadece doc1, doc2 sadece doc2.
Option B: position reset
position_ids = [0, 1, ..., len(doc1)-1, 0, 1, ..., len(doc2)-1, ...]
Llama-3 yaklaşımı: position_ids reset (RoPE-friendly).
7.4 Pack algorithm#
def pack_sequences(docs, max_seq_len, eot_token): sequences = [] current = [] for doc in docs: if len(current) + len(doc) + 1 > max_seq_len: # Flush current sequence sequences.append(current) current = [] current.extend(doc) current.append(eot_token) if current: sequences.append(current) return sequences
7.5 Greedy vs optimal packing#
Greedy: fill until cannot fit more. %95+ efficiency.
Optimal: bin packing problem (NP-hard) — exact %100 difficult.
7.6 Shuffle strategy#
Global shuffle ideal:
import random random.shuffle(sequences)
Büyük corpus: chunk-level shuffle + within-chunk shuffle (memory-efficient).
10-12. Llama-3 Training Recipe#
10.1 Corpus mix (Llama-3 paper)#
- Web (filtered Common Crawl): %60
- Books: %15
- Code (GitHub): %15
- Wikipedia + scientific: %5
- Multilingual: %5
Total: 15T tokens unique.
10.2 Tokenization config#
- Tokenizer: tiktoken-style BPE, 128K vocab
- Pre-train sequence length: 8192
- Total sequences: ~1.8B (15T / 8192)
10.3 Optimizer#
AdamW:
- lr: 3e-4 (peak)
- β_1: 0.9, β_2: 0.95
- weight_decay: 0.1
- ε: 1e-5
- gradient_clip: 1.0
10.4 Learning rate schedule#
Cosine decay with warmup:
- Warmup: 2000 steps linear 0 → 3e-4
- Cosine decay: 3e-4 → 3e-5
- Total: 1.4M steps
10.5 Batch size#
- Global batch: 16M tokens
- = 16M / 8192 = ~2000 sequences
- Per GPU batch: 32 sequences (64 GPUs)
- Gradient accumulation: ~1
10.6 Distributed setup#
- 24K H100 GPUs (1024 nodes × 24 GPUs/node, hypothetical)
- Tensor parallelism: 8
- Pipeline parallelism: 16
- Data parallelism: 192 (24K / 128)
- ZeRO-3 sharding
10.7 Training schedule#
- Start: lr warmup, fp32 master weights, bf16 forward
- Day 1-7: stability checks, frequent checkpoints
- Day 8-60: stable training
- Day 61-70: lr decay phase
- Total: 1.4M steps, 70 days
10.8 Cost#
- 24K H100 × 70 days × 24 hours × 161M
- Discounted enterprise rate: ~$60M
- Meta on-prem amortized: ~$30M (DC + GPU amortization)
Very rough estimate. Meta actual cost not public.
10.9 Output#
- Llama-3-8B base model checkpoint
- 16 GB (bf16)
- Used downstream for instruct fine-tune, multiple variants
✅ Ders 11.1 Özeti — Pre-training Pipeline
Pre-training pipeline 6 ay süren büyük operasyon. Corpus (Common Crawl, Wikipedia, code) → quality filter (heuristic + classifier) → dedup (exact + fuzzy MinHash) → PII removal → tokenize → sequence pack (max_seq 8192) → shuffle → train. Llama-3 recipe: 15T tokens, AdamW (lr 3e-4 peak, cosine decay), batch 16M tokens, 70 days on 24K H100 GPUs, $30M-60M cost. Quality > quantity: %85 raw data dropped through filters. Türkçe %5'lik multilingual slice'da. Ders 11.2'de AdamW optimizer math'e geçeceğiz.
Sıradaki Ders: AdamW Optimizer Math#
Ders 11.2: AdamW (Adam + weight decay decoupled, Loshchilov 2019), momentum + variance estimates, niye β1=0.9 β2=0.95, learning rate schedules (cosine, linear, warmup), gradient clipping.
Frequently Asked Questions
NO. Meta did not make exact training data composition public. General breakdown in paper (%60 web, %15 books etc.) but actual data files closed. EU AI Act 2024 transparency requirements may change this.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup