İçeriğe geç

SwiGLU Activation: SiLU + GLU = Modern FFN'in Kalbi — Shazeer 2020'den Llama-3'e

SwiGLU activation function'ın anatomi: SiLU (Sigmoid-weighted Linear Unit) base + Gated Linear Unit mechanism. Shazeer 2020 'GLU Variants Improve Transformer'. ReLU/GeLU karşılaştırma, niye modern modellerin tercihi. FFN dimensions (d_ff = 8/3 × d_model Llama-3 tercihi), parameter math, Llama-3 implementation.

Şükrü Yusuf KAYA
65 dakikalık okuma
İleri
SwiGLU Activation: SiLU + GLU = Modern FFN'in Kalbi — Shazeer 2020'den Llama-3'e
🚪 SwiGLU — modern FFN'in gizli kahramanı
Transformer'ın FFN katmanı için 2017'de ReLU, 2018'de GeLU (Gaussian Error Linear Unit), 2020'de SwiGLU. Noam Shazeer 2020'de 'GLU Variants Improve Transformer' paper'ında 9 variant'ı denedi. SwiGLU kazandı. Niye? SiLU (smooth sigmoid-weighted linear) + Gated Linear Unit mekanizması — model 'hangi bilginin geçeceğine' karar veriyor. Modern Llama-3, Mistral, Mixtral, GPT-4 hepsi SwiGLU. Performance: SwiGLU FFN ~%1-2 perplexity boost vs GeLU. 65 dakika sonra: SwiGLU matematiksel anatomisini, FFN dimensions (d_ff = 8/3 × d_model Llama-3 tercihi), production implementasyonunu derinlemesine kavramış olacaksın.

Ders Haritası (10 Bölüm)#

  1. FFN'in role — niye attention sonrası FFN
  2. Activation function evrim — ReLU → GeLU → SwiGLU
  3. SiLU (Swish) — formül + intuition
  4. GLU (Gated Linear Unit) — Dauphin 2017
  5. SwiGLU formula — SiLU + GLU combination
  6. Shazeer 2020 — 9 variant karşılaştırma
  7. Llama-3 FFN architecture — d_ff = 8/3 × d_model
  8. Parameter count — 3 matrix vs 2 matrix
  9. Implementation — production-grade PyTorch
  10. Empirical performance — quality vs compute

1-5. FFN + Activation Function Evrim#

1.1 FFN'in role#

Transformer block: Attention + FFN. FFN her token için independent computation (no attention).
FFN(x) = activation(x W_1) W_2
FFN'in görevi:
  • Attention contextualized representation'ı transform et
  • Non-linear transformation (model capacity)
  • Token-wise (parallel possible)

1.2 ReLU (Vaswani 2017)#

Original transformer:
FFN(x) = max(0, xW_1 + b_1) W_2 + b_2
ReLU = max(0, x). Simple, hızlı.

1.3 GeLU (BERT 2018, GPT-2)#

Hendrycks 2016: Gaussian Error Linear Unit.
GeLU(x) = x × Φ(x) # Φ = standard normal CDF ≈ 0.5x(1 + tanh(sqrt(2/π)(x + 0.044715x³)))
Smooth (ReLU'un sert açısı yok). BERT, GPT-2 standard.

1.4 SiLU / Swish (Ramachandran 2017, Hendrycks 2016)#

Swish:
SiLU(x) = x × σ(x) # σ = sigmoid
Smoother than GeLU. Often called 'Swish' (Google) or 'SiLU' (newer name).
GeLU-comparable quality, slightly faster to compute.

1.5 GLU (Gated Linear Unit, Dauphin 2017)#

Language modeling paper. Gating mechanism:
GLU(x) = σ(x W_v) ⊙ (x W_u)
İki linear projection:
  • W_v: 'gate' (sigmoid filter)
  • W_u: 'value' (information to pass)
  • Element-wise product: gate × value
Model öğrenir: 'hangi value'lar geçecek'.

1.6 SwiGLU (Shazeer 2020)#

SiLU + GLU kombinasyonu:
SwiGLU(x) = SiLU(x W_g) ⊙ (x W_u) = (x W_g × σ(x W_g)) ⊙ (x W_u)
GLU'da sigmoid yerine SiLU. Daha 'smooth' gate.

1.7 Empirical: Shazeer 2020#

T5 base model, 9 activation variant:
  • ReLU: baseline
  • GeLU: +0.3 PPL
  • Swish: +0.4 PPL
  • GLU: +0.8 PPL
  • ReGLU: +1.0 PPL
  • GeGLU: +1.3 PPL
  • SwiGLU: +1.5 PPL (best)
SwiGLU clear winner.

7-9. Llama-3 FFN Architecture#

7.1 Llama-3 FFN formula#

FFN(x) = (SiLU(x W_gate) ⊙ (x W_up)) W_down
3 linear matrices:
  • W_gate: d_model → d_ff
  • W_up: d_model → d_ff
  • W_down: d_ff → d_model

7.2 d_ff dimension choice#

Vaswani 2017: d_ff = 4 × d_model (heuristic).
SwiGLU sebep: 3 matrix var, eşit total params için d_ff azalt. Llama-3:
d_ff = 8/3 × d_model ≈ 2.67 × d_model
Llama-3-8B: d_model=4096, d_ff=11008 (Llama-3 actual).
Param count: 3 × 4096 × 11008 ≈ 135M per FFN block.

7.3 Parameter accounting#

3 matrix vs 2 matrix tradeoff:
  • ReLU/GeLU FFN: 2 × d_model × d_ff = 2 × 4096 × 16384 = 134M (d_ff=4×d_model)
  • SwiGLU FFN: 3 × d_model × d_ff = 3 × 4096 × 11008 = 135M (d_ff=8/3×d_model)
Equal parameter budget! Sadece capacity organization farklı.

7.4 Llama-3 production implementation#

class LlamaFFN(nn.Module): def __init__(self, d_model, d_ff): super().__init__() self.gate_proj = nn.Linear(d_model, d_ff, bias=False) self.up_proj = nn.Linear(d_model, d_ff, bias=False) self.down_proj = nn.Linear(d_ff, d_model, bias=False) def forward(self, x): gate = F.silu(self.gate_proj(x)) up = self.up_proj(x) return self.down_proj(gate * up)
bias=False — modern Llama-3 convention (no bias terms in linear layers).

7.5 SwiGLU memory footprint#

FFN intermediate activations (forward + backward):
  • d_ff = 11008, batch=32, seq=2048: 32 × 2048 × 11008 × 2 byte (bf16) = 1.4 GB per layer
  • 32 layers: 45 GB total intermediate (just FFN)
Memory-intensive — checkpointing or FlashAttention-style optimization needed.
✅ Ders 10.2 Özeti — SwiGLU
SwiGLU activation: SiLU (smooth Swish) + GLU (gated linear unit) kombinasyonu. Shazeer 2020 9 variant'ta winner — +1.5 PPL boost. Modern Llama-3, Mistral, GPT-4 standard. 3 matrix (gate, up, down) vs ReLU/GeLU 2 matrix — eşit params için d_ff = 8/3 × d_model (vs 4 × d_model). Llama-3-8B FFN params: 135M per block. Production: bias=False, fused kernels. Ders 10.3'te residual connections + transformer block bütününe geçeceğiz.

Sıradaki Ders: Transformer Block Bütünü#

Ders 10.3: Attention + FFN + RMSNorm + Pre-LN + Residual — modern transformer block'un tüm parçalarını birleştirme. Llama-3 architecture diyagramı, forward pass, gradient flow.

Sık Sorulan Sorular

Marjinal — ~1-2 perplexity puanı kayıp. Production'da SwiGLU prefer ama legacy modellerde GeLU OK. Fine-tune'da swap edemezsin (architecture change).

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler