Skip to content

Long Context Extrapolation: NTK-Aware Scaling + YaRN + LongRoPE — Journey from 8K to 1M Tokens

Extending RoPE to long context: NTK-aware scaling intuition, YaRN (Peng 2023) — comprehensive solution + temperature scaling, LongRoPE (Microsoft 2024) — 2M token context. Llama-3-8B base 8K → 128K extension recipes, Gemini 1.5 1M token tricks, fine-tune protocol.

Şükrü Yusuf KAYA
70 min read
Advanced
Long Context Extrapolation: NTK-Aware Scaling + YaRN + LongRoPE — 8K'dan 1M Token'a Yolculuk
🌌 8K'dan 1M'e — context length frontieri
2023 başı: çoğu model 4K-8K context. 2026: Gemini 1.5 1M token, Claude 200K, GPT-4o 128K, Llama-3 128K. Bu sıçramayı mümkün kılan RoPE scaling techniques: NTK-aware, YaRN, LongRoPE. Bu teknikler modeli sıfırdan train etmeden existing RoPE base model'i daha uzun context'e taşır. Llama-3-8B native 8K context, sonradan 128K'a extend edildi — 100M+ token training data ile sadece 1000 step. 70 dakika sonra: NTK-aware intuition'ını, YaRN matematiksel anatomisini, LongRoPE evolutionary search'ünü, production extension recipe'lerini derinlemesine kavramış olacaksın.

Ders Haritası (10 Bölüm)#

  1. Context length problem — niye naive RoPE long context'te break
  2. NTK-aware scaling intuition — frequency rescaling
  3. Position interpolation — Chen 2023, simpler approach
  4. YaRN (Peng 2023) — comprehensive solution
  5. YaRN math — frequency-dependent scaling + temperature
  6. LongRoPE (Microsoft 2024) — evolutionary search
  7. Fine-tune protocol — kaç step, ne kadar data
  8. Llama-3-8B 8K → 128K — production recipe
  9. Gemini 1.5 1M token — Google'ın yaklaşımı
  10. Future — 10M+ context görünür mü

1-5. NTK-Aware + YaRN Deep Dive#

1.1 Naive RoPE long context#

RoPE training'de seq_len = 8K. Inference seq_len = 32K denersen?
  • Position indices 8K beyond → frequencies model görmemiş
  • Attention pattern broken
  • Perplexity patlar (10x+ degradation)

1.2 Position Interpolation (Chen 2023)#

Meta paper: 'Extending Context Window of Large Language Models via Positional Interpolation'.
Fikir: position values'leri compress et. Native 8K context'i 32K'a 4x stretch için:
position_scaled = position / scale_factor (e.g., 4 for 4x stretch)
Sonuç: model gördüğü frequency range'inde yine. Fine-tune 1000 step yeter.
Problem: short-range patterns degrade (positions 1, 2 → 0.25, 0.5 — model bunu rare gördü).

1.3 NTK-aware scaling (LocalLlama community 2023)#

Neural Tangent Kernel insight: high-frequency components long-range pattern için critical, low-frequency components short-range için.
Fikir: high freq aynı kalsın, sadece low freq'i rescale et.
base_new = base × scale_factor^(d/(d-2))
Llama-2 base 10K, scale 8x: base_new ≈ 80K. High-freq dimensions barely change. Low-freq dimensions stretch.
Result: short-range preserved, long-range supported. Fine-tune fewer steps.

1.4 YaRN (Peng 2023)#

'YaRN: Efficient Context Window Extension of Large Language Models'.
NTK'ı iyileştir + ek tricks:
  1. Frequency-dependent scaling: low/mid/high frequencies farklı strategy
  2. Temperature scaling: attention logits scale (compensates for longer context softmax dilution)
  3. Ramp function: smoothing between scaling regions
Formula sketch:
λ_d = piecewise function of dimension d θ_d_new = θ_d × λ_d Attention scaled by sqrt(temperature)
Result: even fewer fine-tune steps, better perplexity, longer effective context.

1.5 YaRN empirical (Peng 2023)#

Llama-2 7B base 4K, extended to:
  • 16K: 200 fine-tune steps
  • 32K: 400 steps
  • 64K: 600 steps
  • 128K: 1000 steps
Perplexity on long passages: comparable to native 128K model trained from scratch (which would cost 100x compute).

1.6 Why YaRN works#

NTK preservation + temperature scaling + smooth ramping combinations. Empirical sweet spot. Multiple paper variants try variations.

1.7 LongRoPE (Microsoft 2024)#

'LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens'.
Key innovation: per-RoPE-frequency scaling — each frequency optimized via evolutionary search.
Evolutionary algorithm:
  • Population: candidate frequency rescalings
  • Fitness: validation perplexity on long context
  • Selection: top-K survives
  • Crossover/mutation: explore search space
1000-2000 evolutionary iterations → optimal frequency profile.
Result: Llama-2 7B extended to 2M tokens with 8K training data.

1.8 Frequency profile insight#

LongRoPE found: low frequencies (long-range) need different scaling than mid frequencies (medium-range). Per-frequency tuning critical.

7-8. Production Extension Recipe#

7.1 Llama-3-8B 8K → 128K#

Meta's official Llama-3 Instruct 128K extension:
  1. Base model: Llama-3-8B trained on 8K context, RoPE base 500000
  2. Stage 1 (4K → 32K): 400 fine-tune steps, batch=4, lr=2e-5, ~100M tokens
  3. Stage 2 (32K → 128K): 600 fine-tune steps, batch=2 (memory), ~200M tokens
  4. Validation: needle-in-haystack benchmark
Total cost: ~30KAWScompute.Vssıfırdantraining:30K AWS compute. Vs sıfırdan training: 10M+.

7.2 YaRN-based extension code#

from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "meta-llama/Meta-Llama-3-8B", torch_dtype=torch.bfloat16, rope_scaling={ "type": "yarn", "factor": 16.0, # 8K → 128K (16x) "original_max_position_embeddings": 8192, "beta_fast": 32, "beta_slow": 1, }, ) # Fine-tune with long-context data # ...

7.3 Open-source extensions#

  • Yi-34B-200K: YaRN-based 200K
  • Llama-3-Long: community-driven 256K extensions
  • Mistral-Nemo-128K: native YaRN-compatible

7.4 Needle-in-haystack benchmark#

Long context quality test:
  • Random information in long document ('Joe's favorite color is purple')
  • Ask question depending on info
  • Measure recall accuracy across context length
Real Llama-3 results:
  • Native 8K: 100% recall at 8K
  • YaRN 32K extension: 99% at 32K
  • YaRN 128K extension: 92% at 64K, 80% at 128K (degradation)

7.5 Gemini 1.5 1M token#

Google DeepMind 2024: Gemini 1.5 native 1M+ context. Details closed-source. Tahmini techniques:
  • Mixture of Experts (sparse compute)
  • Custom attention pattern (sliding window + dense)
  • Massive long-context training data
  • TPU-specific optimization
Not pure RoPE extension — significant architecture changes.
✅ Ders 9.4 Özeti — Long Context Extrapolation
RoPE'in long context'e genişletilmesi: NTK-aware scaling (frequency-aware), YaRN (Peng 2023, comprehensive + temperature scaling), LongRoPE (Microsoft 2024, evolutionary search 2M tokens). Production: Llama-3-8B 8K → 128K extension 1000 fine-tune steps ile mümkün. Cost: 30Kcomputevs30K compute vs 10M sıfırdan training. Needle-in-haystack benchmark quality assessment. Gemini 1.5 1M token native — pure RoPE değil, architecture innovation. Ders 9.5'te capstone: Llama-3 RoPE'i 50 satırda implement + position visualization.

Sıradaki Ders: Capstone — Llama-3 RoPE Implementation#

Ders 9.5 (Modül 9 capstone): RoPE'i sıfırdan 50 satırda implement et, Llama-3 weights ile uyumlu, position pattern visualize, fine-tune script ile 8K → 32K extension demosu.

Frequently Asked Questions

NTK-aware scaling works zero-shot but with quality loss. Fine-tune (100M+ tokens long data) recovers. Production: fine-tune required — zero-shot %50+ degradation.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content