sqrt(d_k) ile bölmek yerine başka scaling factor olabilir mi?

Vaswani paper'ında sqrt(d_k) empirical olarak belirlendi (variance argument). Alternative: log(d_k), d_k, hiç scaling — hepsi denendi. sqrt(d_k) sweet spot — softmax saturation önler, gradient flow korur. Modern modeller (Llama-3, GPT-4) hâlâ sqrt(d_k).

Attention weight matrix interpret edilebilir mi?

Kısmen evet. Her satır 'i'inci token'ın hangi token'lara dikkat ettiğini gösterir. Türkçe örnek: 'İstanbul'un başkenti' cümlesinde 'başkenti' tokenı 'İstanbul'a yüksek attention verir. Ama dikkat: attention weights causal explanation değil — model 'açıklamak' için kullanmıyor, sadece compute akış.

FlashAttention naive attention'dan ne kadar hızlı?

Memory-bound workload'da 2-4x speedup. Long context (8K+) için daha dramatic — naive seq² memory bandwidth bottleneck'ı kaldırır. H100 GPU'da Llama-3-8B forward pass 1.5x daha hızlı. Detay Ders 8.4.

KV cache niye gerekli? Önceki attention sonuçlarını cache etmek niye yetmez?

Output O = softmax(QK^T/sqrt(d_k)) V. Output sadece güncel token için lazım — eski token output'larını cache etmek 'yetmez' çünkü her step yeni Q gelir. K ve V ise eski token'lar için sabit — cache mantıklı. Yeni token sadece yeni K[i+1], V[i+1] hesaplar, eski K[0:i], V[0:i] reuse.

Scaled Dot-Product Attention: Vaswani 2017'nin Kalbi Satır Satır — Query, Key, Value Üçlüsünün Anatomisi

Q: Niye QK^T (transpose), QK değil?

Matrix multiplication için boyut uyumu. Q: [seq, d_k], K: [seq, d_k]. QK direkt çarpılamaz. QK^T: [seq, d_k] @ [d_k, seq] = [seq, seq]. Dot product per pair. Transpose, geometrik 'inner product' for compatibility.

Transformer'ın temel taşı — scaled dot-product attention'ın matematiksel anatomisi: Query/Key/Value üçlüsü, dot product similarity, softmax normalize, sqrt(d_k) scaling justification, causal mask (autoregressive), attention weights interpretation. PyTorch implementation, FLOP analizi, numerical stability concerns, Türkçe örneklerle attention pattern görselleştirme.

Şükrü Yusuf KAYA

75 dakikalık okuma

13.05.2026

İleri

Scaled Dot-Product Attention: Vaswani 2017'nin Kalbi Satır Satır — Query, Key, Value Üçlüsünün Anatomisi

💎 Transformer'ın kalbi — 7 satır kod, sonsuz derinlik

Vaswani et al. 2017 'Attention Is All You Need' paper'ının özünde 7 satır matematik vardır. `softmax(QK^T / sqrt(d_k)) V`. Bu formül, 100M dolar maliyetli GPT-4'ten 600 milyar parametrelik DeepSeek-V3'e kadar her LLM'in temelini oluşturuyor. Q (Query), K (Key), V (Value) — üç matris ve bir scale faktörü. Her token'ın 'şu anda neye dikkat etmem gerek' kararını veren mekanizma. 75 dakika sonra: Q/K/V üçlüsünün matematiksel anatomisini, niye sqrt(d_k) ile bölündüğünü, causal mask'in autoregressive generation için niye şart olduğunu, attention weights'in Türkçe cümlelerde nasıl yorumlanacağını derinlemesine kavramış olacaksın. Bu, transformer'ın kalbi.

Ders Haritası (13 Bölüm)#

Niye attention — RNN/LSTM'in sınırı + Vaswani 2017 motivasyonu
Q, K, V intuisyonu — kütüphane benzetmesi
Dot product similarity — vector benzerliğinin lineer cebiri
Softmax normalization — neden, nasıl, temperature
sqrt(d_k) scaling — variance correction matematiksel justification
Causal mask — autoregressive için bilgi kaçışı önleme
Padding mask — variable-length sequences
PyTorch implementation — line by line
FLOP + memory analizi — quadratic complexity
Numerical stability — fp16/bf16 attention overflow
Attention weights interpretation — Türkçe örnek
Edge cases — empty input, single token, max seq length
Modern alternatifler önizleme — FlashAttention'a köprü

1. Niye Attention — RNN/LSTM Sınırları#

1.1 RNN/LSTM tarihsel role#

2014-2017: NLP'de hakim mimari. Sequence-to-sequence (encoder-decoder) Bahdanau 2014, Sutskever 2014.

RNN: h_t = f(h_{t-1}, x_t)

Her step bir önceki hidden state'e bağımlı. Sequential.

1.2 Problem 1: Long-range dependency#

Long-range bilgi taşıması zor. h_100 = h_99 = ... = h_1. Bilgi gradient vanishing/exploding ile kaybolur.

LSTM/GRU gate mechanism'le mitigate ediyor ama tam çözüm değil.

1.3 Problem 2: Sequential computation#

h_t hesaplamak için h_{t-1} bekleme zorunluluğu → parallelization yok. GPU verimsiz kullanılır.

1.4 Problem 3: Information bottleneck#

Encoder-decoder modellerde tüm sequence tek vector'e sıkışır (context vector). Translation gibi long-sentence tasks'ta bilgi kaybı.

1.5 Bahdanau 2014: attention mechanism#

İlk attention: decoder her step'te encoder hidden states'inin weighted sum'ını kullansın.

context_t = Σ α_{t,i} × h_i^enc
α_{t,i} = softmax(score(h_t^dec, h_i^enc))

Decoder hangi encoder position'una 'dikkat edeceğini' öğrenir. Translation quality dramatic improvement.

1.6 Vaswani 2017: 'Attention Is All You Need'#

Devrim: RNN/LSTM tamamen at, sadece attention kullan.

x_1, x_2, ..., x_n → Tüm token'lar paralel attention compute

Key insights:

Parallelization: tüm sequence aynı anda işlenir
Direct long-range: token 1 ↔ token 1000 doğrudan attention edebilir
No information bottleneck: full sequence available her layer

Result: 6 weeks training vs RNN/LSTM aylar süren state-of-art. BLEU score record.

1.7 Transformer'ın etkisi#

2018: BERT (encoder-only transformer)
2018: GPT (decoder-only transformer)
2019-2026: GPT-2, GPT-3, GPT-4, Llama, Mistral, Claude — hepsi transformer
100B+ parameter modeller — sadece attention parallelization sayesinde mümkün

Attention, modern AI'in arşimet noktası.

2. Q, K, V Üçlüsü — Kütüphane Benzetmesi#

2.1 En etkili intuition: kütüphanede arama#

Senaryo: 'Türkiye'nin başkenti' hakkında bilgi arıyorsun.

Query (Q): Senin sorgun. 'Türkiye'nin başkenti?'
Key (K): Her kitabın etiketi. ['ülke', 'başkent', 'tarih', 'matematik', ...]
Value (V): Her kitabın içeriği. ['Türkiye 1923 cumhuriyet', 'Ankara başkent', 'Osmanlı tarihi', 'Fibonacci dizisi', ...]

İşlem:

Q'ı her K ile karşılaştır (similarity score). Yüksek score → relevant book.
Score'ları normalize (softmax). Probability distribution oluştur.
V'leri normalized score'larla weighted sum. Sonuç: relevance-weighted information.

2.2 Transformer'a aktarma#

Her token bir 'arayıcı' (query). Aynı zamanda 'bulunan kitap' (key+value).

Matris formunda:

Q: [seq_len, d_k] — sequence'deki her token'ın query vector'ü
K: [seq_len, d_k] — her token'ın key vector'ü
V: [seq_len, d_v] — her token'ın value vector'ü

2.3 d_k vs d_v#

Genelde d_k = d_v = d_model / n_heads.

Llama-3-8B: d_model = 4096, n_heads = 32 → d_k = d_v = 128.

2.4 Q, K, V nereden geliyor#

Linear projection from input x:

Q = x @ W_Q    # W_Q: [d_model, d_k]
K = x @ W_K    # W_K: [d_model, d_k]
V = x @ W_V    # W_V: [d_model, d_v]

W_Q, W_K, W_V learned parameters. Aynı input x'ten farklı linear projection'la 3 farklı role yaratılır.

2.5 Self-attention vs cross-attention#

Self-attention: Q, K, V hepsi aynı sequence'ten. Token kendi sequence'i içinde dikkat eder.
Cross-attention: Q bir sequence'ten (decoder), K+V başka sequence'ten (encoder). Translation modelinde.

GPT/Llama decoder-only: sadece self-attention.

2.6 Niye 3 ayrı projection#

'Same input from 3 angles':

W_Q öğrenir: 'bu token ne arıyor'
W_K öğrenir: 'bu token ne sunabilir' (etiket olarak)
W_V öğrenir: 'bu token'ın içeriği nedir'

Farklı projection'lar farklı 'role'leri öğrenmeyi mümkün kılar. Tek matrix yerine 3'ü daha esnek.

3-5. Attention Formulü Satır Satır#

3.1 Tam formül#

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Adım adım:

Step 1: scores = Q @ K^T        # [seq, seq]
Step 2: scaled = scores / sqrt(d_k)
Step 3: weights = softmax(scaled, dim=-1)  # row-wise softmax
Step 4: output = weights @ V    # [seq, d_v]

3.2 Step 1 detay: QK^T#

Q shape: [seq, d_k]. K^T shape: [d_k, seq]. Sonuç: [seq, seq] matrix.

scores[i, j] = Q[i] · K[j] (dot product).

Anlam: 'i'inci token'ın query'sinin 'j'inci token'ın key'siyle similarity'si.

3.3 Step 2 detay: sqrt(d_k) scaling#

Niye? Variance kontrolü.

Q ve K standard initialized (mean 0, variance 1) ise:

(Q[i] · K[j]) variance = d_k (sum of d_k independent products)
Without scaling: scores have variance d_k
Softmax saturation tehlikesi: büyük scores → softmax 0/1'e yakın → gradient vanish

sqrt(d_k) ile böl: variance = 1, healthy gradient flow.

3.4 d_k = 128 örnek#

Without scaling: typical score ~10 (sqrt(128) ≈ 11.3). Softmax(10) = 0.9999... — extreme. With scaling: typical score ~1. Softmax(1) = 0.27 — gradient flow OK.

3.5 Step 3 detay: softmax#

softmax(x_i) = exp(x_i) / Σ_j exp(x_j)

Row-wise (dim=-1): her query token'ı için softmax. Her satır 1'e topluyor.

Result shape: [seq, seq]. weights[i, j] = 'token i' s attention on 'token j'.

3.6 Step 4 detay: weights @ V#

weights shape: [seq, seq]. V shape: [seq, d_v]. Sonuç: [seq, d_v].

Her i için: output[i] = Σ_j weights[i, j] × V[j].

Weighted sum of V vectors, weights from query-key similarity.

python

import torch
import torch.nn.functional as F
import math
 
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: shape [batch, seq, d_k]
    mask: optional [seq, seq] — True = blocked (causal mask)
    Returns: [batch, seq, d_v]
    """
    d_k = Q.size(-1)
    
    # Step 1: QK^T
    scores = Q @ K.transpose(-2, -1)   # [batch, seq, seq]
    
    # Step 2: scale
    scores = scores / math.sqrt(d_k)
    
    # Optional: causal mask
    if mask is not None:
        scores = scores.masked_fill(mask, float('-inf'))
    
    # Step 3: softmax
    weights = F.softmax(scores, dim=-1)
    
    # Step 4: weighted V
    output = weights @ V                # [batch, seq, d_v]
    return output, weights
 
 
# Test
batch, seq, d_k = 2, 10, 64
Q = torch.randn(batch, seq, d_k)
K = torch.randn(batch, seq, d_k)
V = torch.randn(batch, seq, d_k)
 
out, w = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {out.shape}")    # [2, 10, 64]
print(f"Weights shape: {w.shape}")     # [2, 10, 10]
print(f"Weights sum (row): {w[0, 0].sum().item():.4f}")  # ~1.0
 
# Causal mask
def causal_mask(seq_len):
    return torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
 
mask = causal_mask(seq)
out_causal, w_causal = scaled_dot_product_attention(Q, K, V, mask=mask)
print(f"\nCausal weights[0, 5]: {w_causal[0, 5][:6]}")  # Only first 6 non-zero
print(f"Causal weights[0, 5][6:]: {w_causal[0, 5][6:]}")  # Zero after diagonal

Scaled dot-product attention — pure PyTorch implementation

6. Causal Mask — Autoregressive İçin Hayati#

6.1 Niye causal mask#

Autoregressive generation: model token i+1'i tahmin ederken sadece [1, 2, ..., i]'yi görmeli. Future token'lara 'kopyalama' yapamamalı.

Training'de tüm sequence paralel işlenir ama mask ile future bloklanır.

6.2 Mask matris#

Upper triangular (diagonal'in üstünde) True:

For seq_len = 5:
mask = [[False True  True  True  True],
        [False False True  True  True],
        [False False False True  True],
        [False False False False True],
        [False False False False False]]

True = blocked. Position i, j: i < j ise blocked.

6.3 -inf trick#

scaled_scores = scaled_scores.masked_fill(mask, float('-inf'))
weights = softmax(scaled_scores)

-inf → exp(-inf) = 0 → softmax output 0 for masked positions. Effectively zero attention to future.

6.4 Niye -inf, niye 0 değil#

Direct 0 set → softmax denominator yanlış (e^0 = 1 != 0). Sum 1'i bozar.

-inf → exp(-inf) = 0 → denominator'da 0 terim, sum aynı kalır, valid distribution.

6.5 PyTorch optimized: scaled_dot_product_attention#

PyTorch 2.0+ native function:

out = F.scaled_dot_product_attention(
    Q, K, V,
    attn_mask=None,
    is_causal=True,    # auto-causal mask
    dropout_p=0.0,
)

is_causal=True → FlashAttention'a benzer kernel kullanır (CUDA).

6.6 Bidirectional (BERT) vs unidirectional (GPT)#

BERT: no causal mask, bidirectional attention
GPT/Llama/Mistral: causal mask, unidirectional
T5: encoder bidirectional, decoder unidirectional

Mask varlığı/yokluğu modelin yapısal kararı.

6.7 Padding mask#

Variable-length sequences için. Batch 5 cümle, max_len=20 ise bazıları padded:

padding_mask = (token_ids == PAD_ID)

Masked positions attention'da -inf. Padding 'görünmez' olur.

6.8 Causal + padding mask birlikte#

final_mask = causal_mask | padding_mask.unsqueeze(1)

İkisinin OR'u — her ikisini de blokla.

6.9 Sliding window mask (modern)#

Mistral, Longformer: token sadece son K token'a attention (her token kendi 'window'unda).

For window_size W:
mask[i, j] = True if i - j > W or j > i

Long context (100K+) için memory tasarrufu. Detay Modül 12'de.

9. FLOP + Memory Analizi#

9.1 Quadratic complexity#

Attention'ın temel maliyet: O(seq_len^2) memory + compute.

QK^T: [seq, d_k] × [d_k, seq] = O(seq² × d_k) FLOP.

seq=2048: 2048² × 128 = 537M FLOP per attention layer. seq=128000 (long context): 128K² × 128 = 2.1T FLOP. Yıkıcı.

9.2 Memory: attention matrix#

weights = [batch, n_heads, seq, seq]. fp16:

batch=1, heads=32, seq=2048: 32 × 4M × 2 byte = 256 MB
seq=8192: 4 GB!
seq=128K: 64 GB — single GPU'ya sığmaz

9.3 FlashAttention çözümü#

Dao et al. 2022: attention matrix'i hiç RAM'e materialize etme. Online softmax + tile-based computation.

Memory: O(seq) instead of O(seq²). Speed: 2-4x faster (because memory bandwidth less bottleneck).

Detay Ders 8.4'te.

9.4 GPU memory bandwidth bottleneck#

A100 H100 GPU'ları FLOP-rich (310/989 TFLOPS BF16). Ama memory bandwidth limited (2/3.35 TB/s).

Attention naive implementation: memory-bound. FlashAttention: compute-bound. 4x speedup.

9.5 KV cache (inference)#

Autoregressive generation: her step'te attention recompute. Naive: token i için tüm sequence Q@K^T tekrar hesapla.

Optimization: K ve V'i her step için cache et. Sonraki token sadece yeni K[i+1], V[i+1] hesaplar, eskileri reuse eder.

Memory cost: KV cache büyür. Llama-3-8B 128K context:

KV cache: 2 × 32 layers × 32 heads × 128 × 128K × 2 byte = 34 GB per request

Multi-user serving için ek optimizasyon gerekir (paged attention, vLLM).

✅ Ders 8.1 Özeti — Scaled Dot-Product Attention

Vaswani 2017 transformer kalbi: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V. Q/K/V üçlüsü kütüphane benzetmesi: query arar, key etiket, value içerik. Dot product similarity → softmax normalize → weighted V sum. sqrt(d_k) scaling variance correction için (softmax saturation önleme). Causal mask autoregressive generation'da future token block — -inf trick. Quadratic complexity O(seq²) temel sınır — FlashAttention/KV cache ile mitigate. PyTorch 2.0+ `F.scaled_dot_product_attention` native optimal. Ders 8.2'de multi-head attention'a geçeceğiz: tek attention katmanını N parallel head'e böl, farklı pattern'ler yakala.

Sıradaki Ders: Multi-Head Attention#

Ders 8.2: tek attention'ı niye N parallel head'e bölüyoruz, her head ne öğreniyor (syntactic, semantic, positional), concat + output projection, head pruning experiments, Llama-3 grouped-query attention (GQA), multi-query attention (MQA). Türkçe örneklerle attention head visualization.

Sık Sorulan Sorular