d_ff niye 8/3 × d_model ve niye 4 × d_model değil?

SwiGLU 3 matrix kullanır (gate, up, down). Eşit params için d_ff azalt: 4 × d_model × 2 (ReLU) = 8 × d_model² params, 4 × d_model × 3 (SwiGLU) = 12. Eşit için 4 × d_model × 2 / 3 = 8/3 × d_model.

SwiGLU Activation: SiLU + GLU = Modern FFN'in Kalbi — Shazeer 2020'den Llama-3'e

SwiGLU activation function'ın anatomi: SiLU (Sigmoid-weighted Linear Unit) base + Gated Linear Unit mechanism. Shazeer 2020 'GLU Variants Improve Transformer'. ReLU/GeLU karşılaştırma, niye modern modellerin tercihi. FFN dimensions (d_ff = 8/3 × d_model Llama-3 tercihi), parameter math, Llama-3 implementation.

Şükrü Yusuf KAYA

65 dakikalık okuma

13.05.2026

İleri

SwiGLU Activation: SiLU + GLU = Modern FFN'in Kalbi — Shazeer 2020'den Llama-3'e

🚪 SwiGLU — modern FFN'in gizli kahramanı

Transformer'ın FFN katmanı için 2017'de ReLU, 2018'de GeLU (Gaussian Error Linear Unit), 2020'de SwiGLU. Noam Shazeer 2020'de 'GLU Variants Improve Transformer' paper'ında 9 variant'ı denedi. SwiGLU kazandı. Niye? SiLU (smooth sigmoid-weighted linear) + Gated Linear Unit mekanizması — model 'hangi bilginin geçeceğine' karar veriyor. Modern Llama-3, Mistral, Mixtral, GPT-4 hepsi SwiGLU. Performance: SwiGLU FFN ~%1-2 perplexity boost vs GeLU. 65 dakika sonra: SwiGLU matematiksel anatomisini, FFN dimensions (d_ff = 8/3 × d_model Llama-3 tercihi), production implementasyonunu derinlemesine kavramış olacaksın.

Ders Haritası (10 Bölüm)#

FFN'in role — niye attention sonrası FFN
Activation function evrim — ReLU → GeLU → SwiGLU
SiLU (Swish) — formül + intuition
GLU (Gated Linear Unit) — Dauphin 2017
SwiGLU formula — SiLU + GLU combination
Shazeer 2020 — 9 variant karşılaştırma
Llama-3 FFN architecture — d_ff = 8/3 × d_model
Parameter count — 3 matrix vs 2 matrix
Implementation — production-grade PyTorch
Empirical performance — quality vs compute

1-5. FFN + Activation Function Evrim#

1.1 FFN'in role#

Transformer block: Attention + FFN. FFN her token için independent computation (no attention).

FFN(x) = activation(x W_1) W_2

FFN'in görevi:

Attention contextualized representation'ı transform et
Non-linear transformation (model capacity)
Token-wise (parallel possible)

1.2 ReLU (Vaswani 2017)#

Original transformer:

FFN(x) = max(0, xW_1 + b_1) W_2 + b_2

ReLU = max(0, x). Simple, hızlı.

1.3 GeLU (BERT 2018, GPT-2)#

Hendrycks 2016: Gaussian Error Linear Unit.

GeLU(x) = x × Φ(x)        # Φ = standard normal CDF
        ≈ 0.5x(1 + tanh(sqrt(2/π)(x + 0.044715x³)))

Smooth (ReLU'un sert açısı yok). BERT, GPT-2 standard.

1.4 SiLU / Swish (Ramachandran 2017, Hendrycks 2016)#

Swish:

SiLU(x) = x × σ(x)        # σ = sigmoid

Smoother than GeLU. Often called 'Swish' (Google) or 'SiLU' (newer name).

GeLU-comparable quality, slightly faster to compute.

1.5 GLU (Gated Linear Unit, Dauphin 2017)#

Language modeling paper. Gating mechanism:

GLU(x) = σ(x W_v) ⊙ (x W_u)

İki linear projection:

W_v: 'gate' (sigmoid filter)
W_u: 'value' (information to pass)
Element-wise product: gate × value

Model öğrenir: 'hangi value'lar geçecek'.

1.6 SwiGLU (Shazeer 2020)#

SiLU + GLU kombinasyonu:

SwiGLU(x) = SiLU(x W_g) ⊙ (x W_u)
          = (x W_g × σ(x W_g)) ⊙ (x W_u)

GLU'da sigmoid yerine SiLU. Daha 'smooth' gate.

1.7 Empirical: Shazeer 2020#

T5 base model, 9 activation variant:

ReLU: baseline
GeLU: +0.3 PPL
Swish: +0.4 PPL
GLU: +0.8 PPL
ReGLU: +1.0 PPL
GeGLU: +1.3 PPL
SwiGLU: +1.5 PPL (best)

SwiGLU clear winner.

7-9. Llama-3 FFN Architecture#

7.1 Llama-3 FFN formula#

FFN(x) = (SiLU(x W_gate) ⊙ (x W_up)) W_down

3 linear matrices:

W_gate: d_model → d_ff
W_up: d_model → d_ff
W_down: d_ff → d_model

7.2 d_ff dimension choice#

Vaswani 2017: d_ff = 4 × d_model (heuristic).

SwiGLU sebep: 3 matrix var, eşit total params için d_ff azalt. Llama-3:

d_ff = 8/3 × d_model ≈ 2.67 × d_model

Llama-3-8B: d_model=4096, d_ff=11008 (Llama-3 actual).

Param count: 3 × 4096 × 11008 ≈ 135M per FFN block.

7.3 Parameter accounting#

3 matrix vs 2 matrix tradeoff:

ReLU/GeLU FFN: 2 × d_model × d_ff = 2 × 4096 × 16384 = 134M (d_ff=4×d_model)
SwiGLU FFN: 3 × d_model × d_ff = 3 × 4096 × 11008 = 135M (d_ff=8/3×d_model)

Equal parameter budget! Sadece capacity organization farklı.

7.4 Llama-3 production implementation#

class LlamaFFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.gate_proj = nn.Linear(d_model, d_ff, bias=False)
        self.up_proj = nn.Linear(d_model, d_ff, bias=False)
        self.down_proj = nn.Linear(d_ff, d_model, bias=False)
    
    def forward(self, x):
        gate = F.silu(self.gate_proj(x))
        up = self.up_proj(x)
        return self.down_proj(gate * up)

bias=False — modern Llama-3 convention (no bias terms in linear layers).

7.5 SwiGLU memory footprint#

FFN intermediate activations (forward + backward):

d_ff = 11008, batch=32, seq=2048: 32 × 2048 × 11008 × 2 byte (bf16) = 1.4 GB per layer
32 layers: 45 GB total intermediate (just FFN)

Memory-intensive — checkpointing or FlashAttention-style optimization needed.

✅ Ders 10.2 Özeti — SwiGLU

SwiGLU activation: SiLU (smooth Swish) + GLU (gated linear unit) kombinasyonu. Shazeer 2020 9 variant'ta winner — +1.5 PPL boost. Modern Llama-3, Mistral, GPT-4 standard. 3 matrix (gate, up, down) vs ReLU/GeLU 2 matrix — eşit params için d_ff = 8/3 × d_model (vs 4 × d_model). Llama-3-8B FFN params: 135M per block. Production: bias=False, fused kernels. Ders 10.3'te residual connections + transformer block bütününe geçeceğiz.

Sıradaki Ders: Transformer Block Bütünü#

Ders 10.3: Attention + FFN + RMSNorm + Pre-LN + Residual — modern transformer block'un tüm parçalarını birleştirme. Llama-3 architecture diyagramı, forward pass, gradient flow.

Sık Sorulan Sorular

Marjinal — ~1-2 perplexity puanı kayıp. Production'da SwiGLU prefer ama legacy modellerde GeLU OK. Fine-tune'da swap edemezsin (architecture change).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Modül 0: Kurs Çerçevesi ve Atölye Kurulumu