İçeriğe geç

Capstone Modül 8: Quadratic Attention'a Alternatifler — Linear Attention, RetNet, Mamba (State Space Models)

Modül 8 capstone: quadratic attention'ın alternatifleri. Linear Attention (Katharopoulos 2020) — kernel trick + recurrent form. RetNet (Sun 2023) — retention mechanism Microsoft. Mamba (Gu Dao 2023) — selective state space models. Hangi sub-quadratic mimari hangi senaryo için, GPT-4 vs Mamba karşılaştırma, hibrit modeller (Jamba), gelecek trendleri.

Şükrü Yusuf KAYA
80 dakikalık okuma
İleri
Capstone Modül 8: Quadratic Attention'a Alternatifler — Linear Attention, RetNet, Mamba (State Space Models)
🌌 Quadratic'in ötesine — attention'ın gelecek mimarileri
FlashAttention quadratic memory'i çözdü ama compute hâlâ quadratic — seq² × d. 128K context için 100B compute, 1M context için 1T. Fundamental limit. Çözüm? Mimari değiştir. 2020'den beri sub-quadratic mimari arayışı: Linear Attention (kernel trick), RetNet (Microsoft retention), Mamba (Gu Dao state space models). 2024-2026: bu mimariler küçük ölçekte transformer kalitesine yaklaşıyor, long context'te transformer'ı yener. Hibrit modeller (Jamba) hem attention hem Mamba kullanıyor. 80 dakika sonra: quadratic-sonrası attention dünyasının haritasını çıkarmış, hangi alternatif hangi senaryo için optimal — pratik anlayışla bitirmiş olacaksın. Bu, Modül 8'in kapanışı.

Capstone Akışı (8 Aşama)#

  1. Quadratic problem — fundamental limit recap
  2. Linear Attention (Katharopoulos 2020) — kernel trick
  3. RetNet (Sun 2023) — Microsoft retention mechanism
  4. State Space Models — Gu Dao 2023, S4 → Mamba
  5. Mamba detail — selective scan, hardware-aware
  6. Empirical comparison — quality vs efficiency
  7. Hybrid models — Jamba (attention + Mamba)
  8. 2026 forecast — gelecek mimariler

1. Quadratic Problem Recap#

1.1 Standard attention complexity#

Memory: O(seq²) Compute: O(seq² × d)
seq=128K, d=4096:
  • Memory (without FA): 32 GB per head
  • Compute: 6.7 × 10^13 FLOP

1.2 FlashAttention'ın çözdüğü#

Memory: O(seq²) → O(seq). Compute: değişmedi — hâlâ O(seq² × d).
Long context (1M+) için compute hâlâ engelleyici.

1.3 Sub-quadratic ne demek#

  • Linear: O(seq × d)
  • Log-linear: O(seq × log(seq) × d)
  • Sparse: O(k × seq × d) with k attention pattern density
Sub-quadratic mimarilerde compute long context'te dramatic less. Trade-off: usually quality biraz lower.

2. Linear Attention — Kernel Trick#

2.1 Katharopoulos 2020#

'Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention'.
Key insight: softmax(QK^T) = exp(QK^T) / Z. Replace exp ile kernel function φ:
Attention(Q, K, V) = φ(Q) (φ(K)^T V) / (φ(Q) φ(K)^T 1)
Niye işe yarar:
  • φ(K)^T V önce hesaplanır: [d, d] matrix
  • Sonra φ(Q) ile çarpılır: [seq, d] @ [d, d] = [seq, d]
  • Toplam compute: O(seq × d²) — linear in seq!

2.2 Recurrent form#

Linear attention recurrent olarak yazılabilir:
S_t = S_{t-1} + φ(k_t) v_t^T # state update q_t = φ(q_t) output_t = q_t S_t / (q_t Σ φ(k_i))
RNN-like ama transformer-trainable. Best of both worlds.

2.3 φ kernel choices#

  • φ(x) = elu(x) + 1 — Katharopoulos default
  • φ(x) = exp(x) — closer to softmax
  • φ(x) = x² — Performer FAVOR+
Quality: φ choice empirically matters. elu+1 reasonable trade-off.

2.4 Quality concern#

Linear attention small models'da transformer-comparable. Large scale (10B+) quality gap belirgin.
General pattern: sub-quadratic mimari training cheaper, infer cheaper, ama scaling laws daha az favorable.

4-5. Mamba — Selective State Space Models#

4.1 State Space Models (SSM) tarihçesi#

Kontrol teorisinden (1960s): system dynamics ODE ile model. NLP'ye uyarlama:
  • S4 (Gu 2021): efficient long-range dependencies
  • S5: simpler implementation
  • Mamba (Gu Dao 2023): selective + hardware-aware

4.2 Mamba core idea#

SSM equation:
h_t = A h_{t-1} + B x_t y_t = C h_t + D x_t
A, B, C, D matrices learnable. State h_t fixed dim — no quadratic explosion.
Mamba selective: A, B, C input-dependent (her token için farklı). Transformer'ın 'attention selection'una analog.

4.3 Selective scan algorithm#

Key innovation: parallel scan algorithm (Blelloch 1990) GPU'da efficient implementation.
Complexity:
Memory: O(seq × d_state) Compute: O(seq × d × d_state)
Linear in seq! Same scaling as Linear Attention but with better quality empirically.

4.4 Mamba vs Transformer scaling#

Gu Dao 2023 paper benchmarks:
  • Same params, same training tokens
  • Small models (~3B): comparable quality
  • Long context (16K, 64K): Mamba dramatic better
  • Mamba inference 5x faster (no KV cache!)

4.5 No KV cache!#

Mamba state h_t fixed dim — no per-token cache growth. Inference memory constant regardless of context length.
Llama-3 128K context: 16 GB KV cache. Mamba: ~256 MB (constant).

4.6 Limitations#

In-context learning weaker — Mamba scaled up'a kadar GPT-4 quality'ye ulaşmadı. Specific reasoning tasks transformer hâlâ üstün.

4.7 Production status (2026)#

  • Falcon-Mamba 7B (Technology Innovation Institute)
  • Mistral-NeMo Mamba variants
  • Codestral Mamba (Mistral)
  • Hybrid: Jamba (AI21 Labs) — Mamba + Attention layers alternating

7-8. Hibrit Modeller + 2026 Forecast#

7.1 Jamba (AI21 Labs 2024)#

İlk production-grade hybrid model: 1 attention layer + 7 Mamba layer pattern.
[Mamba × 4] → [Attention × 1] → [Mamba × 4] → [Attention × 1] → ...
Results:
  • Memory: %30 of pure transformer
  • Quality: comparable Llama-2-70B
  • Long context (256K): native, KV cache çoğunlukla Mamba'da

7.2 2026 trend: Mamba/SSM mainstreaming#

  • Falcon-Mamba production-ready
  • MoE + Mamba combo trials
  • Multi-modal Mamba (vision + text)

7.3 Transformer'ın savunması#

  • ICL (in-context learning) Mamba'dan iyi
  • Mature ecosystem (PyTorch, vLLM, HF integration)
  • Pre-training data abundance
Muhtemelen uzun süre coexistence.

7.4 Pragmatic advice (2026)#

  • General-purpose LLM: Transformer (Llama-3, GPT-4)
  • Long context (100K+): Hybrid (Jamba) veya pure Mamba
  • Real-time inference: Mamba (no KV cache)
  • Research: explore both

7.5 Türkçe için#

Specific advantage yok. Tokenizer ve training data daha önemli. Mamba/Transformer choice mostly architecture/efficiency considerations.
🎉 Modül 8 Tamamlandı — Attention Mathematics
5 ders boyunca: scaled dot-product attention (Vaswani 2017 kalbi), multi-head + GQA/MQA (Llama-3 modern), FlashAttention (memory-IO efficient), KV cache + paged attention (vLLM production serving), attention alternatives (Linear/RetNet/Mamba sub-quadratic). Transformer'ın kalp ve damar sistemini baştan sona kavradın. Modül 8 envanteri: 5 ders, 370 dk (~6 saat). Genel müfredat: 9 modül, 63 ders, ~55 saat. Sıradaki: Modül 9 — Position Encoding. Absolute vs relative, sinusoidal vs learned, RoPE (Llama-3 modern), ALiBi (Mistral).

Modül 8 Envanteri (Tamamlandı)#

#DersSüre
8.1Scaled Dot-Product Attention75 dk
8.2Multi-Head Attention + GQA/MQA70 dk
8.3FlashAttention IO-Aware75 dk
8.4KV Cache + Paged Attention + vLLM70 dk
8.5Capstone: Linear/RetNet/Mamba Alternatives80 dk
Toplam5 ders370 dk (~6 saat)

Sık Sorulan Sorular

Genelde hayır — coexistence muhtemel. Transformer ICL ve mature ecosystem'le hakim. Mamba long context + efficient inference scenario'larda kazanıyor. Hybrid modeller (Jamba) ikisinin avantajlarını birleştiriyor.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler