Skip to content

Pre-training Pipeline End-to-End: Corpus → Tokenize → Pack → Train — Llama-3 Production Recipe

All stages of pre-training pipeline: corpus collection (Common Crawl, Wikipedia, code), data cleaning (deduplication, language filtering, quality scoring), tokenization batching, sequence packing strategy, document boundary handling. Llama-3 production recipe: 15T tokens, 24K H100 days compute, 70 days training.

Şükrü Yusuf KAYA
75 min read
Advanced
Pre-training Pipeline End-to-End: Corpus → Tokenize → Pack → Train — Llama-3 Production Recipe
🏗️ Pre-training pipeline — modern LLM'in inşaat sahası
Llama-3-8B'i sıfırdan eğitmek istiyorsun. Compute: 24K H100 günü ($6M). Data: 15T token. Süre: 70 gün. Bu ölçek production AI'da ne kadar büyük olduğunu zor anlatılır. Ama compute'tan önce pipeline: corpus toplama, cleaning, dedup, tokenize, pack, shuffle. Her aşama hata yaparsan training silver ya çok yavaş ya da kötü model üretir. 75 dakika sonra: 24K H100 günü compute'unun nasıl efficient kullanıldığını, Llama-3 production recipe'sinin detaylarını, kendi mini pre-training pipeline'ını kuracak yeteneğine sahip olacaksın.

Ders Haritası (12 Bölüm)#

  1. Pre-training overview — high-level pipeline diyagramı
  2. Corpus collection — kaynaklar (Common Crawl, Wikipedia, GitHub)
  3. Quality filtering — heuristic + classifier-based
  4. Deduplication — exact + fuzzy (MinHash + LSH)
  5. PII removal — privacy compliance
  6. Tokenization — corpus → token IDs
  7. Sequence packing — multiple docs into max_seq_len
  8. Document boundary — separator tokens, attention mask
  9. Shuffle + epochs — data ordering strategy
  10. Llama-3 corpus mix — actual proportions
  11. Llama-3 training recipe — hyperparameters, schedule
  12. Cost economics — $M-level compute budget

1. Pre-training Pipeline#

1.1 High-level flow#

[Stage 1: Data Collection] ↓ Common Crawl (10B+ pages) ↓ Wikipedia (60+ languages) ↓ GitHub (code) ↓ Books (scientific, literature) ↓ ~100TB raw text [Stage 2: Quality Filter] ↓ Language detection (filter target languages) ↓ Heuristic filters (length, structure, gibberish) ↓ Classifier-based quality scoring (FastText) ↓ ~50TB clean [Stage 3: Deduplication] ↓ Exact dedup (hash-based) ↓ Fuzzy dedup (MinHash + LSH) ↓ ~30TB unique [Stage 4: PII Removal] ↓ Email, phone, SSN scrubbing ↓ ~30TB sanitized [Stage 5: Tokenization] ↓ HF tokenizer (Llama-3 128K vocab) ↓ 15T tokens (50% compression) [Stage 6: Sequence Packing] ↓ Pack into 8192-token sequences ↓ ~2B sequences [Stage 7: Shuffle + Epoch] ↓ Distributed shuffle ↓ 1.5 epochs typical (Llama-3) [Stage 8: Training] ↓ 24K H100 GPU days ↓ Final Llama-3-8B model
Her aşama optimization gerektiriyor — bottleneck olmamalı.

1.2 Time budget#

Production Llama-3 pipeline:
  • Corpus collection: 2-3 ay (crawl, license review)
  • Cleaning + dedup: 2-3 hafta (massive parallel processing)
  • Tokenization: 1 hafta (parallel)
  • Training: 70 gün
  • Total: ~6 ay

1.3 Compute breakdown#

Llama-3-8B (Meta paper):
  • Total: 1.3M GPU hours H100
  • = 24K H100 days
  • = ~6Mcloudcost(spotrate6M cloud cost (spot rate 2.5/hour)
  • = ~$3M on-prem (depending on amortization)
Llama-3-70B: 10x daha pahalı, ~60M.Llama3405B:100xdahapahalı, 60M. Llama-3-405B: 100x daha pahalı, ~500M+.

2. Corpus Collection#

2.1 Common Crawl#

Web crawl archive — public dataset.
  • 250B+ web pages
  • Updated monthly
  • ~50 TB compressed per snapshot
  • ~100 TB extracted text
from datasets import load_dataset cc = load_dataset("allenai/c4", "en", split="train", streaming=True)

2.2 Wikipedia#

  • 60+ languages
  • High quality text
  • ~30 GB compressed (all languages)
  • ~200 GB extracted text
wiki = load_dataset("wikipedia", "20240401.tr", split="train")

2.3 GitHub (code)#

  • 100M+ repositories
  • Permissive licenses (MIT, Apache, BSD)
  • Code: Python, JavaScript, C++, Rust, Go
  • ~5 TB code text

2.4 Books#

  • Project Gutenberg (public domain)
  • Scientific papers (arXiv, PubMed)
  • ~500 GB

2.5 Türkçe corpus#

  • Turkish Wikipedia (4 GB)
  • OSCAR Turkish (10-15 GB)
  • Türkçe news (1-2 GB)
  • BounWebCorpus (academic)
  • ~30 GB total Turkish
Llama-3 multilingual: 5%+ non-English content.

2.6 License compliance#

Kritik: training data legal:
  • Common Crawl: Robots.txt compliance
  • Wikipedia: CC-BY-SA (attribution required)
  • GitHub: permissive licenses only
  • Books: public domain or licensed
EU AI Act 2024: transparency requirements — training data composition disclosed.

3-4. Quality + Deduplication#

3.1 Heuristic filters#

  • Min line length: 5 words
  • Max line length: 1000 words
  • Min ratio of alphanumeric: 80%
  • No more than 30% punctuation
  • Detect gibberish: random character sequences
  • Detect lists/tables (not natural text)
def is_quality(text): words = text.split() if len(words) < 10 or len(words) > 10000: return False alnum = sum(1 for c in text if c.isalnum()) if alnum / len(text) < 0.7: return False return True

3.2 Classifier-based filtering#

FastText classifier: high-quality vs low-quality text. Training: Wikipedia (high) vs Common Crawl raw (low). Score > 0.5 keep.
Llama-3 paper: classifier-based filtering significantly improves model quality.

3.3 Exact deduplication#

Hash each document, drop duplicates.
import hashlib seen = set() def dedupe_exact(docs): for doc in docs: h = hashlib.sha256(doc.encode()).hexdigest() if h not in seen: seen.add(h) yield doc
Common Crawl raw: %30 exact duplicate. Dramatic reduction.

3.4 Fuzzy dedup (MinHash + LSH)#

Near-duplicates (paraphrases, copy-pasted articles). MinHash + Locality Sensitive Hashing:
from datasketch import MinHash, MinHashLSH lsh = MinHashLSH(threshold=0.8, num_perm=128) for i, doc in enumerate(docs): m = MinHash(num_perm=128) for shingle in get_shingles(doc, k=5): m.update(shingle.encode()) lsh.insert(i, m) # Query for near-duplicates

3.5 Llama-3 dedup outcome#

Meta reports:
  • Common Crawl raw: 100 TB
  • After heuristic filter: 50 TB
  • After classifier filter: 30 TB
  • After exact dedup: 20 TB
  • After fuzzy dedup: 15 TB
%85 data dropped. Quality > quantity.

7-9. Sequence Packing#

7.1 Naive: one doc per sequence#

seq_1 = doc_1 padded to max_seq_len seq_2 = doc_2 padded to max_seq_len ...
Problem: padding wastes compute. Avg doc length 500 tokens, max_seq=8192 → %94 padding waste.

7.2 Sequence packing#

Multiple docs pack into one sequence:
seq_1 = doc_1 + <|sep|> + doc_2 + <|sep|> + doc_3 + ...
Fill max_seq_len fully. No padding waste.

7.3 Document boundary handling#

Key: attention shouldn't cross document boundaries.
Option A: special separator token + attention mask
[doc1_tokens] [<|endoftext|>] [doc2_tokens] [<|endoftext|>] ...
attention_mask: doc1 sadece doc1, doc2 sadece doc2.
Option B: position reset
position_ids = [0, 1, ..., len(doc1)-1, 0, 1, ..., len(doc2)-1, ...]
Llama-3 yaklaşımı: position_ids reset (RoPE-friendly).

7.4 Pack algorithm#

def pack_sequences(docs, max_seq_len, eot_token): sequences = [] current = [] for doc in docs: if len(current) + len(doc) + 1 > max_seq_len: # Flush current sequence sequences.append(current) current = [] current.extend(doc) current.append(eot_token) if current: sequences.append(current) return sequences

7.5 Greedy vs optimal packing#

Greedy: fill until cannot fit more. %95+ efficiency. Optimal: bin packing problem (NP-hard) — exact %100 difficult.

7.6 Shuffle strategy#

Global shuffle ideal:
import random random.shuffle(sequences)
Büyük corpus: chunk-level shuffle + within-chunk shuffle (memory-efficient).

10-12. Llama-3 Training Recipe#

10.1 Corpus mix (Llama-3 paper)#

  • Web (filtered Common Crawl): %60
  • Books: %15
  • Code (GitHub): %15
  • Wikipedia + scientific: %5
  • Multilingual: %5
Total: 15T tokens unique.

10.2 Tokenization config#

  • Tokenizer: tiktoken-style BPE, 128K vocab
  • Pre-train sequence length: 8192
  • Total sequences: ~1.8B (15T / 8192)

10.3 Optimizer#

AdamW:
  • lr: 3e-4 (peak)
  • β_1: 0.9, β_2: 0.95
  • weight_decay: 0.1
  • ε: 1e-5
  • gradient_clip: 1.0

10.4 Learning rate schedule#

Cosine decay with warmup:
  • Warmup: 2000 steps linear 0 → 3e-4
  • Cosine decay: 3e-4 → 3e-5
  • Total: 1.4M steps

10.5 Batch size#

  • Global batch: 16M tokens
  • = 16M / 8192 = ~2000 sequences
  • Per GPU batch: 32 sequences (64 GPUs)
  • Gradient accumulation: ~1

10.6 Distributed setup#

  • 24K H100 GPUs (1024 nodes × 24 GPUs/node, hypothetical)
  • Tensor parallelism: 8
  • Pipeline parallelism: 16
  • Data parallelism: 192 (24K / 128)
  • ZeRO-3 sharding

10.7 Training schedule#

  • Start: lr warmup, fp32 master weights, bf16 forward
  • Day 1-7: stability checks, frequent checkpoints
  • Day 8-60: stable training
  • Day 61-70: lr decay phase
  • Total: 1.4M steps, 70 days

10.8 Cost#

  • 24K H100 × 70 days × 24 hours × 4/hour=4/hour = 161M
  • Discounted enterprise rate: ~$60M
  • Meta on-prem amortized: ~$30M (DC + GPU amortization)
Very rough estimate. Meta actual cost not public.

10.9 Output#

  • Llama-3-8B base model checkpoint
  • 16 GB (bf16)
  • Used downstream for instruct fine-tune, multiple variants
✅ Ders 11.1 Özeti — Pre-training Pipeline
Pre-training pipeline 6 ay süren büyük operasyon. Corpus (Common Crawl, Wikipedia, code) → quality filter (heuristic + classifier) → dedup (exact + fuzzy MinHash) → PII removal → tokenize → sequence pack (max_seq 8192) → shuffle → train. Llama-3 recipe: 15T tokens, AdamW (lr 3e-4 peak, cosine decay), batch 16M tokens, 70 days on 24K H100 GPUs, $30M-60M cost. Quality > quantity: %85 raw data dropped through filters. Türkçe %5'lik multilingual slice'da. Ders 11.2'de AdamW optimizer math'e geçeceğiz.

Sıradaki Ders: AdamW Optimizer Math#

Ders 11.2: AdamW (Adam + weight decay decoupled, Loshchilov 2019), momentum + variance estimates, niye β1=0.9 β2=0.95, learning rate schedules (cosine, linear, warmup), gradient clipping.

Frequently Asked Questions

NO. Meta did not make exact training data composition public. General breakdown in paper (%60 web, %15 books etc.) but actual data files closed. EU AI Act 2024 transparency requirements may change this.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content