What's the hardest part of RoPE to understand?

Usually 'why pair-wise rotation' — group d_model dimensions in pairs. Intuition: 2D rotation matrix natural, think of d_model as concat of d/2 2D spaces. Complex number interpretation helps.

Capstone Module 9: Implement Llama-3 RoPE from Scratch in 50 Lines — Pure NumPy + Visualization

Module 9 capstone: implement Llama-3 compatible RoPE in 50 lines pure NumPy. cos/sin cache precomputation, pair-wise rotation, position visualization (cos/sin heatmap, attention bias pattern). Compatibility test with actual Llama-3 weights. Turkish examples for position pattern interpretation.

Şükrü Yusuf KAYA

65 min read

5/13/2026

Advanced

Capstone Modül 9: Llama-3 RoPE'i 50 Satırda Sıfırdan Implement Et — Pure NumPy + Visualization

🎓 Modül 9 Capstone — RoPE'i kendi ellerinle inşa et

4 ders boyunca: position encoding'in zorunluluğunu, sinusoidal/learned klasik yaklaşımları, RoPE'in matematiksel anatomisini, ALiBi alternatif'i, long context extrapolation'ı (YaRN, LongRoPE) öğrendik. Şimdi kendi RoPE'ini yaz. 50 satırda pure NumPy. Llama-3 actual weights ile compatibility test. Türkçe cümle örneklerinde position pattern visualize et. 65 dakika sonra: RoPE'in her satırına hakim, transformer position encoding'i tam anlamış olarak modül 10'a hazır olacaksın.

python

import numpy as np
import matplotlib.pyplot as plt
 
class LlamaRoPE:
    """Llama-3 compatible RoPE implementation in pure NumPy."""
    
    def __init__(self, dim, base=500000.0, max_seq_len=8192):
        self.dim = dim
        self.base = base
        self.max_seq_len = max_seq_len
        
        # Precompute inv_freq for each pair
        # frequencies = 1 / base^(2i/d) for i = 0, 1, ..., d/2-1
        inv_freq = 1.0 / (base ** (np.arange(0, dim, 2) / dim))
        
        # Precompute cos/sin for all positions
        t = np.arange(max_seq_len, dtype=np.float32)
        freqs = np.outer(t, inv_freq)  # [seq, dim/2]
        
        # Duplicate (Llama uses interleaved real-imag layout)
        emb = np.concatenate([freqs, freqs], axis=-1)  # [seq, dim]
        self.cos_cache = np.cos(emb)
        self.sin_cache = np.sin(emb)
    
    def rotate_half(self, x):
        """Split last dim in half, swap and negate first half."""
        d = x.shape[-1] // 2
        return np.concatenate([-x[..., d:], x[..., :d]], axis=-1)
    
    def apply(self, q, k, position_ids):
        """Apply RoPE rotation to Q and K vectors."""
        cos = self.cos_cache[position_ids]
        sin = self.sin_cache[position_ids]
        
        # Broadcasting: cos/sin shape [seq, dim], q/k shape [batch, heads, seq, dim]
        q_rot = q * cos + self.rotate_half(q) * sin
        k_rot = k * cos + self.rotate_half(k) * sin
        return q_rot, k_rot
 
 
# Test: Llama-3-8B parameters
head_dim = 128
rope = LlamaRoPE(dim=head_dim, base=500000.0, max_seq_len=8192)
 
batch, n_heads, seq = 1, 32, 100
Q = np.random.randn(batch, n_heads, seq, head_dim).astype(np.float32)
K = np.random.randn(batch, n_heads, seq, head_dim).astype(np.float32)
position_ids = np.arange(seq)
 
Q_rot, K_rot = rope.apply(Q, K, position_ids)
print(f"Q_rot shape: {Q_rot.shape}")
 
# Verify: attention score has relative position property
def attn_score(q, k):
    return (q * k).sum(axis=-1)  # last dim sum
 
q_pos5 = Q_rot[0, 0, 5]
k_pos10 = K_rot[0, 0, 10]
k_pos15 = K_rot[0, 0, 15]
 
print(f"\nAttn(pos=5 query, pos=10 key): {attn_score(q_pos5, k_pos10):.4f}")
print(f"Attn(pos=5 query, pos=15 key): {attn_score(q_pos5, k_pos15):.4f}")
# Both depend on RELATIVE position (10-5=5 vs 15-5=10)
 
# Visualize: cos pattern across positions and dimensions
plt.figure(figsize=(12, 6))
plt.imshow(rope.cos_cache[:200], aspect='auto', cmap='RdBu_r')
plt.xlabel('Dimension')
plt.ylabel('Position')
plt.title('RoPE cos pattern (Llama-3, head_dim=128, base=500000)')
plt.colorbar()
# plt.savefig('rope-cos.png')

Llama-3 RoPE in 50 lines pure NumPy

Position Visualization — Pattern Anlayışı#

Visualization 1: cos/sin heatmap#

RoPE cos_cache shape [max_seq_len, dim]. Heatmap göster:

X-axis: dimension (0 to d_model-1)
Y-axis: position (0 to seq_len-1)
Color: cos value (-1 to 1)

Sonuç: high-freq dimensions (small i) hızlı oscillation, low-freq dimensions yavaş. Wave pattern.

Visualization 2: relative position vs attention bias#

For a fixed Q at position 50, compute attention score with all K positions:

attn(50, j) for j = 0 to 100

Graph shows: attention peaks around j=50 (same position), gradually decays with |j-50|.

RoPE doğal distance decay — long-range attention zayıflar.

Visualization 3: frequency hierarchy#

Dimension i için period:

period_i = 2π × base^{2i/d}

i=0: period 2π ≈ 6.28 (very fast)
i=64 (mid): period ~50,000
i=127 (slowest): period ~500,000

Farklı dimension'lar farklı temporal scale'da bilgi taşır.

🎉 Modül 9 Tamamlandı — Position Encoding

5 ders boyunca: position encoding'in zorunluluğu (permutation-invariance proof), sinusoidal/learned klasik (Vaswani/GPT-2), RoPE modern standard (Su 2021, Llama-3 base 500K), ALiBi alternative (Press 2021, BLOOM), long context extension (NTK/YaRN/LongRoPE — 8K → 1M tokens), capstone NumPy implementation. Transformer'ın 'sıra problemi'ni baştan sona çözdün. Modül 9 envanteri: 5 ders, 335 dk. Genel müfredat: 10 modül, 68 ders, ~61 saat. Sıradaki: Modül 10 — Transformer Block (RMSNorm, SwiGLU, residual connections). Transformer mimarisinin son parçaları.

Modül 9 Envanteri (Tamamlandı)#

#	Ders	Süre
9.1	Position Encoding Temelleri (Sinusoidal/Learned)	65 dk
9.2	RoPE Derinlemesine (Su 2021, Llama-3)	75 dk
9.3	ALiBi (Press 2021, BLOOM/MPT)	60 dk
9.4	Long Context (NTK/YaRN/LongRoPE)	70 dk
9.5	Capstone — Llama-3 RoPE 50 Satırda	65 dk
Toplam	5 ders	335 dk (~5.6 saat)

Frequently Asked Questions

Pedagogically NumPy is more transparent — no autograd magic, every step explicit. Use PyTorch in production but understand with NumPy first. PyTorch version already given in Lesson 9.2.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Capstone Module 9: Implement Llama-3 RoPE from Scratch in 50 Lines — Pure NumPy + Visualization

Position Visualization — Pattern Anlayışı#

Visualization 1: cos/sin heatmap#

Visualization 2: relative position vs attention bias#

Visualization 3: frequency hierarchy#

Modül 9 Envanteri (Tamamlandı)#

Frequently Asked Questions

I'd recommend PyTorch over NumPy for capstone — still NumPy?

What's the hardest part of RoPE to understand?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight

Subscribe to Newsletter