Skip to content

Capstone Module 9: Implement Llama-3 RoPE from Scratch in 50 Lines — Pure NumPy + Visualization

Module 9 capstone: implement Llama-3 compatible RoPE in 50 lines pure NumPy. cos/sin cache precomputation, pair-wise rotation, position visualization (cos/sin heatmap, attention bias pattern). Compatibility test with actual Llama-3 weights. Turkish examples for position pattern interpretation.

Şükrü Yusuf KAYA
65 min read
Advanced
Capstone Modül 9: Llama-3 RoPE'i 50 Satırda Sıfırdan Implement Et — Pure NumPy + Visualization
🎓 Modül 9 Capstone — RoPE'i kendi ellerinle inşa et
4 ders boyunca: position encoding'in zorunluluğunu, sinusoidal/learned klasik yaklaşımları, RoPE'in matematiksel anatomisini, ALiBi alternatif'i, long context extrapolation'ı (YaRN, LongRoPE) öğrendik. Şimdi kendi RoPE'ini yaz. 50 satırda pure NumPy. Llama-3 actual weights ile compatibility test. Türkçe cümle örneklerinde position pattern visualize et. 65 dakika sonra: RoPE'in her satırına hakim, transformer position encoding'i tam anlamış olarak modül 10'a hazır olacaksın.
python
import numpy as np
import matplotlib.pyplot as plt
 
class LlamaRoPE:
"""Llama-3 compatible RoPE implementation in pure NumPy."""
def __init__(self, dim, base=500000.0, max_seq_len=8192):
self.dim = dim
self.base = base
self.max_seq_len = max_seq_len
# Precompute inv_freq for each pair
# frequencies = 1 / base^(2i/d) for i = 0, 1, ..., d/2-1
inv_freq = 1.0 / (base ** (np.arange(0, dim, 2) / dim))
# Precompute cos/sin for all positions
t = np.arange(max_seq_len, dtype=np.float32)
freqs = np.outer(t, inv_freq) # [seq, dim/2]
# Duplicate (Llama uses interleaved real-imag layout)
emb = np.concatenate([freqs, freqs], axis=-1) # [seq, dim]
self.cos_cache = np.cos(emb)
self.sin_cache = np.sin(emb)
def rotate_half(self, x):
"""Split last dim in half, swap and negate first half."""
d = x.shape[-1] // 2
return np.concatenate([-x[..., d:], x[..., :d]], axis=-1)
def apply(self, q, k, position_ids):
"""Apply RoPE rotation to Q and K vectors."""
cos = self.cos_cache[position_ids]
sin = self.sin_cache[position_ids]
# Broadcasting: cos/sin shape [seq, dim], q/k shape [batch, heads, seq, dim]
q_rot = q * cos + self.rotate_half(q) * sin
k_rot = k * cos + self.rotate_half(k) * sin
return q_rot, k_rot
 
 
# Test: Llama-3-8B parameters
head_dim = 128
rope = LlamaRoPE(dim=head_dim, base=500000.0, max_seq_len=8192)
 
batch, n_heads, seq = 1, 32, 100
Q = np.random.randn(batch, n_heads, seq, head_dim).astype(np.float32)
K = np.random.randn(batch, n_heads, seq, head_dim).astype(np.float32)
position_ids = np.arange(seq)
 
Q_rot, K_rot = rope.apply(Q, K, position_ids)
print(f"Q_rot shape: {Q_rot.shape}")
 
# Verify: attention score has relative position property
def attn_score(q, k):
return (q * k).sum(axis=-1) # last dim sum
 
q_pos5 = Q_rot[0, 0, 5]
k_pos10 = K_rot[0, 0, 10]
k_pos15 = K_rot[0, 0, 15]
 
print(f"\nAttn(pos=5 query, pos=10 key): {attn_score(q_pos5, k_pos10):.4f}")
print(f"Attn(pos=5 query, pos=15 key): {attn_score(q_pos5, k_pos15):.4f}")
# Both depend on RELATIVE position (10-5=5 vs 15-5=10)
 
# Visualize: cos pattern across positions and dimensions
plt.figure(figsize=(12, 6))
plt.imshow(rope.cos_cache[:200], aspect='auto', cmap='RdBu_r')
plt.xlabel('Dimension')
plt.ylabel('Position')
plt.title('RoPE cos pattern (Llama-3, head_dim=128, base=500000)')
plt.colorbar()
# plt.savefig('rope-cos.png')
Llama-3 RoPE in 50 lines pure NumPy

Position Visualization — Pattern Anlayışı#

Visualization 1: cos/sin heatmap#

RoPE cos_cache shape [max_seq_len, dim]. Heatmap göster:
  • X-axis: dimension (0 to d_model-1)
  • Y-axis: position (0 to seq_len-1)
  • Color: cos value (-1 to 1)
Sonuç: high-freq dimensions (small i) hızlı oscillation, low-freq dimensions yavaş. Wave pattern.

Visualization 2: relative position vs attention bias#

For a fixed Q at position 50, compute attention score with all K positions:
attn(50, j) for j = 0 to 100
Graph shows: attention peaks around j=50 (same position), gradually decays with |j-50|.
RoPE doğal distance decay — long-range attention zayıflar.

Visualization 3: frequency hierarchy#

Dimension i için period:
period_i = 2π × base^{2i/d}
  • i=0: period 2π ≈ 6.28 (very fast)
  • i=64 (mid): period ~50,000
  • i=127 (slowest): period ~500,000
Farklı dimension'lar farklı temporal scale'da bilgi taşır.
🎉 Modül 9 Tamamlandı — Position Encoding
5 ders boyunca: position encoding'in zorunluluğu (permutation-invariance proof), sinusoidal/learned klasik (Vaswani/GPT-2), RoPE modern standard (Su 2021, Llama-3 base 500K), ALiBi alternative (Press 2021, BLOOM), long context extension (NTK/YaRN/LongRoPE — 8K → 1M tokens), capstone NumPy implementation. Transformer'ın 'sıra problemi'ni baştan sona çözdün. Modül 9 envanteri: 5 ders, 335 dk. Genel müfredat: 10 modül, 68 ders, ~61 saat. Sıradaki: Modül 10 — Transformer Block (RMSNorm, SwiGLU, residual connections). Transformer mimarisinin son parçaları.

Modül 9 Envanteri (Tamamlandı)#

#DersSüre
9.1Position Encoding Temelleri (Sinusoidal/Learned)65 dk
9.2RoPE Derinlemesine (Su 2021, Llama-3)75 dk
9.3ALiBi (Press 2021, BLOOM/MPT)60 dk
9.4Long Context (NTK/YaRN/LongRoPE)70 dk
9.5Capstone — Llama-3 RoPE 50 Satırda65 dk
Toplam5 ders335 dk (~5.6 saat)

Frequently Asked Questions

Pedagogically NumPy is more transparent — no autograd magic, every step explicit. Use PyTorch in production but understand with NumPy first. PyTorch version already given in Lesson 9.2.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content