Skip to content

ALiBi: Attention with Linear Biases — Press 2021's Simple Solution and Extrapolation Advantage

ALiBi (Press 2021): inject position information by adding linear bias to attention score without position embedding. Math: attention[i,j] += m × (j-i). Per-head slopes hierarchy (m_h = 2^{-8h/H}). Strengths: zero parameters, train-short eval-long extrapolation, simple implementation. Comparison with RoPE, Mistral and BLOOM usage.

Şükrü Yusuf KAYA
60 min read
Advanced
ALiBi: Attention with Linear Biases — Press 2021'in Sade Çözümü ve Extrapolation Avantajı
📐 ALiBi — sadelikteki güç, extrapolation'da kraliyet
Press, Smith, Lewis 2021'de 'Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation' paper'ını yayınladı. Devrimsel iddia: position embedding tamamen at, attention score'a basit bir linear bias ekle. Train 1024 token'da, test 2048 token'da çalışsın. Bu, RoPE'tan 2 yıl önce keşfedildi. Empirik: extrapolation'da RoPE'tan daha iyi (sometimes), perplexity'de slight worse. Production: BLOOM, MPT (MosaicML), Replit Code — hepsi ALiBi. 60 dakika sonra: ALiBi'nin sade matematiği, per-head slope hierarchy'si, RoPE'la karşılaştırması — pratik anlayışla bitirmiş olacaksın.

Ders Haritası (8 Bölüm)#

  1. Core idea — position embedding yok, attention bias var
  2. Math — attention[i,j] += m × (j-i)
  3. Per-head slope — geometric sequence (m_h = 2^{-8h/H})
  4. Why this works — linear decay'in pozisyon information injection'ı
  5. Extrapolation magic — train short test long
  6. PyTorch implementation — basit
  7. RoPE vs ALiBi — empirical karşılaştırma
  8. Production kullanım — BLOOM, MPT, Mistral

1-4. ALiBi Math#

1.1 Core formula#

Standard attention:
attn_score(i, j) = q_i · k_j / sqrt(d_k)
ALiBi attention:
attn_score(i, j) = q_i · k_j / sqrt(d_k) + m × (j - i)
Key: m × (j-i) bias terimi. j < i için negative (past), j > i için positive (future) — causal mask'le birlikte negative-only.

1.2 Causal context#

Decoder-only model'de j ≤ i (causal). Yani bias always non-positive:
  • j = i (same token): bias 0
  • j = i-1 (önceki token): bias -m × 1 = -m
  • j = i-2: bias -2m
  • ...
  • j = i-k: bias -k × m
Daha uzak token → daha küçük (more negative) attention score. Distance decay.

1.3 Per-head slopes#

Not all heads same slope m. Geometric sequence:
m_h = 2^{-8h/H} h = 1, 2, ..., H (head index)
8 head için:
m_1 = 2^{-1} = 0.5 m_2 = 2^{-2} = 0.25 m_3 = 2^{-3} = 0.125 ... m_8 = 2^{-8} = 0.0039
Different heads farklı distance decay rates — bazıları local (büyük m, fast decay), bazıları global (küçük m, slow decay).

1.4 Niye geometric#

Empirical. Linear ve exponential variants denendi, geometric 2^{-8h/H} sweet spot.

1.5 No parameters#

m slopes fixed — train edilmez. Heuristic, modelin yapısal özelliği.
Vs RoPE: RoPE no extra params da ama Q/K rotate eder. ALiBi sadece attention score'a ek.

1.6 Intuition#

Q ve K linear bias'ı bağımsız değil — model rapidly farklı head'lerin farklı 'attention range'lerinde olduğunu öğrenir.
Head 1 (m=0.5): sadece son 4-5 token'a güçlü attention → local syntactic patterns Head 8 (m=0.004): tüm sequence'a yakın attention → global semantic

1.7 Position embedding YOK#

Önemli: ALiBi'de input embedding'lere position bilgisi eklenmez. Sadece attention bias ile injection.
python
import torch
import math
 
def get_alibi_slopes(n_heads):
"""Geometric sequence of slopes for ALiBi."""
def get_slopes_power_of_2(n):
start = 2 ** (-(2 ** -(math.log2(n) - 3)))
ratio = start
return [start * ratio ** i for i in range(n)]
if math.log2(n_heads).is_integer():
return get_slopes_power_of_2(n_heads)
else:
# Handle non-power-of-2 heads (BLOOM extension)
closest_power = 2 ** math.floor(math.log2(n_heads))
slopes = get_slopes_power_of_2(closest_power)
slopes += get_slopes_power_of_2(2 * closest_power)[0::2][:n_heads - closest_power]
return slopes
 
 
def get_alibi_bias(n_heads, seq_len, device='cpu'):
"""Build ALiBi bias matrix."""
slopes = torch.tensor(get_alibi_slopes(n_heads), device=device)
# Distance matrix: bias[i,j] = j - i (negative for past)
distances = torch.arange(seq_len, device=device).unsqueeze(0) - \
torch.arange(seq_len, device=device).unsqueeze(1)
# [n_heads, seq, seq]
bias = slopes.view(-1, 1, 1) * distances.unsqueeze(0).float()
return bias
 
 
class ALiBiAttention(torch.nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.n_heads = n_heads
self.d_head = d_model // n_heads
self.qkv_proj = torch.nn.Linear(d_model, 3 * d_model)
self.out_proj = torch.nn.Linear(d_model, d_model)
def forward(self, x, alibi_bias=None):
batch, seq, _ = x.shape
qkv = self.qkv_proj(x).view(batch, seq, 3, self.n_heads, self.d_head)
Q, K, V = qkv.unbind(2)
Q, K, V = Q.transpose(1, 2), K.transpose(1, 2), V.transpose(1, 2)
scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_head)
# Add ALiBi bias
if alibi_bias is not None:
scores = scores + alibi_bias
# Causal mask
mask = torch.triu(torch.ones(seq, seq), diagonal=1).bool().to(x.device)
scores = scores.masked_fill(mask, float('-inf'))
weights = torch.softmax(scores, dim=-1)
out = weights @ V
out = out.transpose(1, 2).contiguous().view(batch, seq, -1)
return self.out_proj(out)
 
 
# Usage
model = ALiBiAttention(d_model=4096, n_heads=32)
alibi_bias = get_alibi_bias(32, 2048) # [32, 2048, 2048]
print(f"Slopes (first 4): {get_alibi_slopes(32)[:4]}")
print(f"ALiBi bias shape: {alibi_bias.shape}")
ALiBi production PyTorch implementation

5. Extrapolation Magic#

5.1 Train short, test long#

Paper'ın iddiası: ALiBi train 1024 token'da, test 2048+ token'da çalışır.
Perplexity comparison (1024 train, varying test length):
Method102420484096
Sinusoidal18.641.2 (overflow)87
Learned18.542.8overflow
RoPE18.620.126.5
ALiBi18.618.719.0
ALiBi extrapolation'da clear winner.

5.2 Niye işe yarıyor#

Linear bias scale-invariant. Hangi sequence length olursa olsun, mathematical structure aynı.
Vs sinusoidal: training distribution dışında attention patterns broken — random behavior. Vs RoPE: better than sinusoidal ama frequency hierarchy training dependent. Vs ALiBi: linear bias sabit, generalize natural.

5.3 Empirical caveat#

ALiBi extrapolation strong ama in-distribution perplexity RoPE'tan slight worse (~0.1-0.3 PPL).
Trade-off: extrapolation vs in-distribution. RoPE in-distribution wins, ALiBi extrapolation wins.

5.4 Modern era status#

2024-2026: RoPE + NTK/YaRN scaling (Ders 9.4) ALiBi extrapolation gap'i kapadı. Modern modeller RoPE prefer.
ALiBi historical importance ve specific production use cases (BLOOM, MPT).

7-8. ALiBi vs RoPE + Production#

7.1 Karşılaştırma matrisi#

PropertyRoPEALiBi
Parameter count00
Implementation complexityMedium (rotation)Simple (add bias)
In-distribution PPLBestSlight worse
ExtrapolationGood with scaling (YaRN)Excellent
Long-context (128K)YaRN dictatesNative
Standard adoption (2026)DominantNiche

7.2 Pratik tercih (2026)#

  • Modern training: RoPE + scaling (Llama-3, Mistral, GPT-4)
  • Train-short test-long (research): ALiBi
  • Production legacy: BLOOM, MPT-7B, MPT-30B continue ALiBi
  • Future trend: RoPE consolidation, ALiBi marginal use

7.3 Hybrid approaches#

Some models combine: RoPE'a slight ALiBi-like decay ekle. Empirical experimentation devam ediyor.

7.4 BLOOM kullanım#

BLOOM 176B (2022): ALiBi. Multilingual + extrapolation iddiası — ALiBi natural fit. BLOOM aftermath: most BLOOM-derivative modeller RoPE'e geçti.

7.5 MPT MosaicML#

MPT-7B, MPT-30B: ALiBi. Extrapolation feature prominent marketing. MosaicML 2024-2025 transition: RoPE'a yöneliyor.

7.6 Türkçe için ALiBi#

No Turkish-specific advantage. RoPE practical choice 2026'da Türkçe modeller için.
✅ Ders 9.3 Özeti — ALiBi
ALiBi (Press 2021): attention score'a linear bias ekleyerek pozisyon inject. Per-head geometric slopes (m_h = 2^{-8h/H}) — heads farklı distance decay rates. Zero parameters, simple implementation. Strong extrapolation: train-short test-long (1024 → 4096). RoPE'tan in-distribution PPL slight worse, extrapolation better. Production: BLOOM, MPT — ALiBi niche kullanım. Modern era (2024-2026): RoPE + YaRN scaling ALiBi gap'i kapadı. Ders 9.4'te long-context extrapolation techniques'e (NTK-aware, YaRN, LongRoPE) geçeceğiz.

Sıradaki Ders: Long Context Extrapolation#

Ders 9.4: NTK-aware scaling, YaRN (Peng 2023), LongRoPE (Microsoft 2024) — RoPE'i 128K+ context'e genişletme teknikleri.

Frequently Asked Questions

Slight worse in-distribution perplexity. Modern models prioritize in-distribution quality + extrapolation problem solved with RoPE + scaling. ALiBi niche use: extrapolation-critical scenarios specifically.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content