Capstone Module 9: Implement Llama-3 RoPE from Scratch in 50 Lines — Pure NumPy + Visualization
Module 9 capstone: implement Llama-3 compatible RoPE in 50 lines pure NumPy. cos/sin cache precomputation, pair-wise rotation, position visualization (cos/sin heatmap, attention bias pattern). Compatibility test with actual Llama-3 weights. Turkish examples for position pattern interpretation.
Şükrü Yusuf KAYA
65 min read
Advanced🎓 Modül 9 Capstone — RoPE'i kendi ellerinle inşa et
4 ders boyunca: position encoding'in zorunluluğunu, sinusoidal/learned klasik yaklaşımları, RoPE'in matematiksel anatomisini, ALiBi alternatif'i, long context extrapolation'ı (YaRN, LongRoPE) öğrendik. Şimdi kendi RoPE'ini yaz. 50 satırda pure NumPy. Llama-3 actual weights ile compatibility test. Türkçe cümle örneklerinde position pattern visualize et. 65 dakika sonra: RoPE'in her satırına hakim, transformer position encoding'i tam anlamış olarak modül 10'a hazır olacaksın.
python
import numpy as npimport matplotlib.pyplot as plt class LlamaRoPE: """Llama-3 compatible RoPE implementation in pure NumPy.""" def __init__(self, dim, base=500000.0, max_seq_len=8192): self.dim = dim self.base = base self.max_seq_len = max_seq_len # Precompute inv_freq for each pair # frequencies = 1 / base^(2i/d) for i = 0, 1, ..., d/2-1 inv_freq = 1.0 / (base ** (np.arange(0, dim, 2) / dim)) # Precompute cos/sin for all positions t = np.arange(max_seq_len, dtype=np.float32) freqs = np.outer(t, inv_freq) # [seq, dim/2] # Duplicate (Llama uses interleaved real-imag layout) emb = np.concatenate([freqs, freqs], axis=-1) # [seq, dim] self.cos_cache = np.cos(emb) self.sin_cache = np.sin(emb) def rotate_half(self, x): """Split last dim in half, swap and negate first half.""" d = x.shape[-1] // 2 return np.concatenate([-x[..., d:], x[..., :d]], axis=-1) def apply(self, q, k, position_ids): """Apply RoPE rotation to Q and K vectors.""" cos = self.cos_cache[position_ids] sin = self.sin_cache[position_ids] # Broadcasting: cos/sin shape [seq, dim], q/k shape [batch, heads, seq, dim] q_rot = q * cos + self.rotate_half(q) * sin k_rot = k * cos + self.rotate_half(k) * sin return q_rot, k_rot # Test: Llama-3-8B parametershead_dim = 128rope = LlamaRoPE(dim=head_dim, base=500000.0, max_seq_len=8192) batch, n_heads, seq = 1, 32, 100Q = np.random.randn(batch, n_heads, seq, head_dim).astype(np.float32)K = np.random.randn(batch, n_heads, seq, head_dim).astype(np.float32)position_ids = np.arange(seq) Q_rot, K_rot = rope.apply(Q, K, position_ids)print(f"Q_rot shape: {Q_rot.shape}") # Verify: attention score has relative position propertydef attn_score(q, k): return (q * k).sum(axis=-1) # last dim sum q_pos5 = Q_rot[0, 0, 5]k_pos10 = K_rot[0, 0, 10]k_pos15 = K_rot[0, 0, 15] print(f"\nAttn(pos=5 query, pos=10 key): {attn_score(q_pos5, k_pos10):.4f}")print(f"Attn(pos=5 query, pos=15 key): {attn_score(q_pos5, k_pos15):.4f}")# Both depend on RELATIVE position (10-5=5 vs 15-5=10) # Visualize: cos pattern across positions and dimensionsplt.figure(figsize=(12, 6))plt.imshow(rope.cos_cache[:200], aspect='auto', cmap='RdBu_r')plt.xlabel('Dimension')plt.ylabel('Position')plt.title('RoPE cos pattern (Llama-3, head_dim=128, base=500000)')plt.colorbar()# plt.savefig('rope-cos.png')Llama-3 RoPE in 50 lines pure NumPy
Position Visualization — Pattern Anlayışı#
Visualization 1: cos/sin heatmap#
RoPE cos_cache shape [max_seq_len, dim]. Heatmap göster:
- X-axis: dimension (0 to d_model-1)
- Y-axis: position (0 to seq_len-1)
- Color: cos value (-1 to 1)
Sonuç: high-freq dimensions (small i) hızlı oscillation, low-freq dimensions yavaş. Wave pattern.
Visualization 2: relative position vs attention bias#
For a fixed Q at position 50, compute attention score with all K positions:
attn(50, j) for j = 0 to 100
Graph shows: attention peaks around j=50 (same position), gradually decays with |j-50|.
RoPE doğal distance decay — long-range attention zayıflar.
Visualization 3: frequency hierarchy#
Dimension i için period:
period_i = 2π × base^{2i/d}
- i=0: period 2π ≈ 6.28 (very fast)
- i=64 (mid): period ~50,000
- i=127 (slowest): period ~500,000
Farklı dimension'lar farklı temporal scale'da bilgi taşır.
🎉 Modül 9 Tamamlandı — Position Encoding
5 ders boyunca: position encoding'in zorunluluğu (permutation-invariance proof), sinusoidal/learned klasik (Vaswani/GPT-2), RoPE modern standard (Su 2021, Llama-3 base 500K), ALiBi alternative (Press 2021, BLOOM), long context extension (NTK/YaRN/LongRoPE — 8K → 1M tokens), capstone NumPy implementation. Transformer'ın 'sıra problemi'ni baştan sona çözdün. Modül 9 envanteri: 5 ders, 335 dk. Genel müfredat: 10 modül, 68 ders, ~61 saat. Sıradaki: Modül 10 — Transformer Block (RMSNorm, SwiGLU, residual connections). Transformer mimarisinin son parçaları.
Modül 9 Envanteri (Tamamlandı)#
| # | Ders | Süre |
|---|---|---|
| 9.1 | Position Encoding Temelleri (Sinusoidal/Learned) | 65 dk |
| 9.2 | RoPE Derinlemesine (Su 2021, Llama-3) | 75 dk |
| 9.3 | ALiBi (Press 2021, BLOOM/MPT) | 60 dk |
| 9.4 | Long Context (NTK/YaRN/LongRoPE) | 70 dk |
| 9.5 | Capstone — Llama-3 RoPE 50 Satırda | 65 dk |
| Toplam | 5 ders | 335 dk (~5.6 saat) |
Frequently Asked Questions
Pedagogically NumPy is more transparent — no autograd magic, every step explicit. Use PyTorch in production but understand with NumPy first. PyTorch version already given in Lesson 9.2.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup