RoPE in Depth: Mathematical Anatomy of Rotary Position Embedding — From Su 2021 to Llama-3
Mathematical anatomy of RoPE: complex number rotation interpretation, why applied to Q and K, relative position implicit derivation. Llama-3 RoPE implementation line by line, base frequency 10000, pair-wise rotation. PyTorch implementation, RoPE vs sinusoidal/learned comparison, reason for widespread adoption in modern models.
Şükrü Yusuf KAYA
75 min read
Advanced🔄 RoPE — modern LLM'in pozisyon devrimi
Su et al. 2021'de 'RoFormer: Enhanced Transformer with Rotary Position Embedding' paper'ını yayınladı. Initial reaction: ilginç ama belirsiz. 2023'te Llama-2 RoPE ile yayınladı, sonrasında her modern LLM (Llama-3, Mistral, Mixtral, GPT-4, Qwen, DeepSeek) RoPE'i adopt etti. Niye? 3 sebep: (1) Relative position info implicit, (2) Better long-context extrapolation, (3) Attention'da inject — embedding'e eklenmez. Matematiksel olarak elegant: 2D rotation matrix kullanır, kompleks sayı çarpımı interpretation'ı var. 75 dakika sonra: RoPE'in matematiksel anatomisini, Llama-3 implementation'ını, niye modern fiili standard olduğunu derinlemesine kavramış olacaksın.
Ders Haritası (12 Bölüm)#
- RoPE intuition — Q ve K'yi position'a göre rotate et
- 2D rotation matematik — Euler formula, complex numbers
- Pair-wise rotation — d_model dimensions'ı çift olarak grupla
- Frequency hierarchy — sinusoidal'a benzer ama farklı
- Relative position derivation — niye 'relative' implicit
- Attention computation — Q_rot · K_rot mathematics
- Llama-3 implementation — satır satır
- Base frequency 10000 — niye, ne işe yarar
- PyTorch from-scratch — production-grade
- RoPE vs sinusoidal/learned — empirical karşılaştırma
- Long context concerns — base ile genişletme
- Modern model preferences — fiili standard sebebi
1-3. RoPE Intuition + Math#
1.1 Core idea#
Standard attention:
attn_score(i, j) = q_i · k_j
RoPE: Q ve K'yi position'a göre rotate et:
q_i' = R_{Θ_i} q_i k_j' = R_{Θ_j} k_j attn_score(i, j) = q_i' · k_j' = q_i^T R_{Θ_i}^T R_{Θ_j} k_j = q_i^T R_{Θ_j - Θ_i} k_j [rotation composition]
Key result: attention score sadece (j - i) position difference'ına bağlı — yani relative position!
Training'de model bu relative pattern'i öğrenir, extrapolation daha doğal.
1.2 2D rotation matrix#
Basic 2D rotation:
R_θ = [[cos(θ), -sin(θ)], [sin(θ), cos(θ)]]
1.3 Complex number interpretation#
2D vector (x, y) = x + iy complex number.
Rotation by θ: multiply by e^{iθ}.
e^{iθ} = cos(θ) + i sin(θ) [Euler's formula] (x + iy) × e^{iθ} = (x cos(θ) - y sin(θ)) + i(x sin(θ) + y cos(θ))
Elegant!
1.4 Pair-wise rotation in d_model dim#
d_model = 4096 yüksek dim. Her dimension için ayrı rotation olur mu? Hayır — d_model'i çiftler halinde grupla, her çift için bir rotation:
Q vector: [q_0, q_1, q_2, q_3, ..., q_{d-2}, q_{d-1}] Pair 0: (q_0, q_1) → 2D rotation by θ_0 Pair 1: (q_2, q_3) → 2D rotation by θ_1 ... Pair d/2-1: (q_{d-2}, q_{d-1}) → 2D rotation by θ_{d/2-1}
d_model / 2 farklı rotation angle.
1.5 Frequency hierarchy#
Her çift için farklı frequency:
θ_i = pos × 10000^{-2i/d_model}
- i=0: frequency 1 (fastest oscillation)
- i=d/2-1: frequency 10000^{-1} (slowest)
Sinusoidal PE ile aynı frequency structure, ama uygulanış farklı:
- Sinusoidal: PE eklenir embedding'e
- RoPE: Q ve K rotate edilir attention'da
python
import torchimport torch.nn as nn class RotaryEmbedding(nn.Module): def __init__(self, dim, base=10000, max_seq_len=8192): super().__init__() self.dim = dim # Frequency for each pair (d_model/2 frequencies) inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim)) self.register_buffer('inv_freq', inv_freq) # Precompute cos/sin for all positions self._build_cache(max_seq_len) def _build_cache(self, max_seq_len): t = torch.arange(max_seq_len, dtype=torch.float) freqs = torch.outer(t, self.inv_freq) # [seq, dim/2] # Duplicate for cos/sin pairs emb = torch.cat([freqs, freqs], dim=-1) # [seq, dim] self.register_buffer('cos_cached', emb.cos()) self.register_buffer('sin_cached', emb.sin()) def forward(self, x, position_ids): # x: [batch, heads, seq, head_dim] cos = self.cos_cached[position_ids] sin = self.sin_cached[position_ids] return cos, sin def rotate_half(x): """Rotate the last dim by 90 degrees in pairs.""" x1, x2 = x[..., :x.shape[-1] // 2], x[..., x.shape[-1] // 2:] return torch.cat([-x2, x1], dim=-1) def apply_rotary_pos_emb(q, k, cos, sin): """Apply RoPE to Q and K.""" q_rot = (q * cos) + (rotate_half(q) * sin) k_rot = (k * cos) + (rotate_half(k) * sin) return q_rot, k_rot # Usage in attention layerdim = 128 # head_dimrope = RotaryEmbedding(dim, base=10000, max_seq_len=8192) batch, n_heads, seq = 2, 32, 1024Q = torch.randn(batch, n_heads, seq, dim)K = torch.randn(batch, n_heads, seq, dim)position_ids = torch.arange(seq).unsqueeze(0).expand(batch, -1) cos, sin = rope(Q, position_ids)Q_rot, K_rot = apply_rotary_pos_emb(Q, K, cos, sin)print(f"Q_rot shape: {Q_rot.shape}") # [2, 32, 1024, 128]RoPE production-grade PyTorch implementation (Llama-3 style)
5-6. Relative Position + Attention#
5.1 Relative position derivation#
Let's prove:
(R_{Θ_m} q)^T (R_{Θ_n} k) = q^T R_{Θ_n - Θ_m} k
Matrix products:
(R_{Θ_m} q)^T (R_{Θ_n} k) = q^T R_{Θ_m}^T R_{Θ_n} k = q^T R_{-Θ_m} R_{Θ_n} k [R^T = R^{-1} = R_{-θ}] = q^T R_{Θ_n - Θ_m} k
Sonuç sadece (n - m) position difference'ına bağlı.
5.2 Attention score interpretation#
attn(m, n) = (q_m R_{Θ_m})^T (k_n R_{Θ_n}) = q_m^T R_{Θ_n - Θ_m} k_n
'm pozisyonundaki query' ile 'n pozisyonundaki key' arasındaki attention sadece distance (n-m) ile değişiyor — absolute m, n değil.
Bu, transformer'ın translation-invariant olmasına benzer property.
5.3 Decay with distance (intuitive)#
RoPE rotation angles farklı frequencies'de. Uzak position'lar daha çok rotation farkı → attention score doğal olarak küçülür distance arttıkça.
Graph: attention(0, k) as function of k:
- k=0: max (same position)
- k=10: high
- k=100: medium
- k=1000: low oscillation
- k=10000: very oscillatory
Uzun-distance attention zayıflar — pratik olarak iyi (long-range pattern weak signal).
5.4 Niye Q ve K — V niye değil#
Attention score = QK^T. Sadece Q ve K position info'ya ihtiyaç duyar — score'u şekillendiren onlar.
V value bilgisini taşıyor — position-invariant olmalı. V rotate edilse attention output kayar.
5.5 Llama-3 RoPE config#
# Llama-3 model config rope_theta = 500000.0 # base frequency (default 10000 from RoFormer) # Llama-3 increased base for better long-context max_position_embeddings = 8192 # base context # RoPE scaling for 128K context: NTK-aware (Ders 9.4)
Llama-3'te base 500K (Llama-2 yaptığı 10K'dan büyük) — long context support için.
5.6 Llama-3 transformer integration#
class LlamaAttention(nn.Module): def __init__(self, config): super().__init__() self.q_proj = nn.Linear(...) self.k_proj = nn.Linear(...) self.v_proj = nn.Linear(...) self.rotary_emb = RotaryEmbedding( self.head_dim, base=config.rope_theta, max_seq_len=config.max_position_embeddings, ) def forward(self, hidden_states, position_ids): Q = self.q_proj(hidden_states) K = self.k_proj(hidden_states) V = self.v_proj(hidden_states) cos, sin = self.rotary_emb(V, position_ids) Q, K = apply_rotary_pos_emb(Q, K, cos, sin) # V unchanged! # Standard attention attn_output = scaled_dot_product_attention(Q, K, V, ...) return attn_output
Embedding katmanında pozisyon eklenmez. Position info attention'da inject edilir.
10-12. Empirical + Modern Preferences#
10.1 RoPE vs Sinusoidal/Learned#
Metrics on long-context tasks (16K+ context):
| Method | Perplexity | Extrapolation |
|---|---|---|
| Sinusoidal | 5.2 | Bad (training distribution dışı) |
| Learned absolute | 5.0 | None (fixed max_seq) |
| RoPE | 4.8 | Good (relative position) |
| RoPE + YaRN scaling | 4.5 | Excellent (Ders 9.4) |
RoPE consistently better quality + better extrapolation.
10.2 Niye RoPE fiili standard#
- Relative position implicit — model bu pattern'i kolayca öğrenir
- No additional embedding params — sadece Q/K rotation
- Long-context friendly — base frequency adjustment ile extend
- Mathematical elegance — complex number rotation interpretation
- Empirical strong — every benchmark RoPE üstün
10.3 Modern model preferences (2026)#
- Llama-3, Llama-3.1, Llama-3.2: RoPE (base 500K)
- Mistral 7B, Mixtral: RoPE (base 10K)
- Mistral-Nemo: RoPE (base 1M)
- Qwen 2: RoPE
- DeepSeek-V3: RoPE
- GPT-4 (tahmini): RoPE variant
- Claude (tahmini): RoPE variant
Fiili 2024-2026 standard RoPE.
10.4 ALiBi alternative#
Press 2021 — RoPE'a alternative attention bias-based positional. Detay Ders 9.3'te. RoPE'a göre simpler ama empirical perplexity slightly worse.
10.5 Türkçe için RoPE#
No Turkish-specific advantage/disadvantage. RoPE Türkçe için de optimal — relative position bilgisi morfolojik dil patterns ile uyumlu çalışır.
✅ Ders 9.2 Özeti — RoPE Derinlemesine
RoPE (Su 2021): Q ve K'yi position'a göre 2D pair-wise rotate et. Complex number rotation interpretation. Key property: attention score (m, n) sadece relative position (n-m)'a bağlı. d_model çiftler halinde gruplandırılır, her çift farklı frequency'de (sinusoidal'a benzer hierarchy). Llama-3 RoPE base 500K (long context için yükseltildi). No embedding parameters, attention'da inject. Empirical RoPE üstün: perplexity, extrapolation, training stability. 2024-2026 modern LLM fiili standard. Ders 9.3'te ALiBi alternative'ine geçeceğiz.
Sıradaki Ders: ALiBi — Attention Bias Positional#
Ders 9.3: ALiBi (Press 2021), attention score'a linear bias ekleyerek position injection. RoPE'a göre simpler implementation, Mistral'da kullanım, empirical karşılaştırma.
Frequently Asked Questions
Q and K vectors are rotated by d_model/2 different angles (one rotation per pair). 'Rotary' = rotation-based. Mathematical: uses 2D rotation matrix, equivalent to complex number multiplication.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup