Why did they put RoPE in attention instead of embedding?

Position info added to embedding (sinusoidal/learned) fades in early layers. Since RoPE is injected in attention, position info is **re-enforced in every layer**. Critical for longer context (e.g., 128K context). Empirical: RoPE > additive positional in long context.

Do multimodal embeddings have to be same d_model?

In practice YES — for concatenation. Image patches from different dimensional space (e.g., 512d ViT features) reduced to d_model via projection. Linear layer learned for this cross-modal alignment.

Is training stable without embedding scaling?

YES, stable in modern models — std=0.02 init + RMSNorm + RoPE combination regularizes activation magnitudes. Original Vaswani 2017 scaling no longer necessary.

What's modern embedding optimization for Turkish?

Llama-3 base TR fine-tune: freeze embedding (since vocab unchanged). If using custom Turkish vocab (TurkTokenizer-tr - Module 6.10), train embedding from scratch. Activate embedding tying in all cases — param savings.

Modern LLM Embedding Layer + Embedding Tying: Input/Output Sharing and Scaling

Embedding layer in modern transformer architecture: nn.Embedding initialization (Llama-3 style), embedding tying (input/output sharing) — mathematical justification and memory savings, embedding scaling before pre-layernorm (sqrt(d_model) or not), no position addition before RoPE, multimodal embeddings (vision + audio tokens). Architectural differences between Llama-3, GPT-4o, Claude-3.

Şükrü Yusuf KAYA

70 min read

5/13/2026

Advanced

Modern LLM Embedding Katmanı + Embedding Tying: Input/Output Paylaşımı ve Scaling

🧬 Modern LLM embedding — 'iki kafalı' katmanın matematiği

Modern transformer mimarisinde embedding katmanı iki kafalı: input'ta token ID → vector, output'ta vector → token logits. Tarihsel olarak ayrı iki matristi (E_input ve E_output). 2016'da Press & Wolf, Inan et al. ayrı ayrı keşfetti: bu iki matris paylaşılabilir ("weight tying"). Sonuç: parametre tasarrufu, generalization artışı, perplexity %3-5 daha iyi. GPT-2'den beri tüm modern modeller bunu kullanıyor — Llama-3 bile (subtle exception: gpt-3 untied). 70 dakika sonra: modern LLM embedding'in tüm anatomisini — initialization, tying, scaling, RoPE öncesi, multimodal extensions — Llama-3 ve GPT-4o üzerinden kavramış olacaksın.

Ders Haritası (12 Bölüm)#

Modern transformer'da embedding'in yeri — input layer
Llama-3 embedding initialization — exact specification
Embedding tying — matematiksel justification
Tied vs untied — empirical comparison
GPT-3 anomaly — niye untied
Embedding scaling — sqrt(d_model) ya da yok
Pre-LayerNorm + embedding — modern architecture
Position embedding — eski (sinusoidal) vs RoPE (modern)
GPT-4o multimodal embedding — vision + audio tokens
Llama-3 vs GPT-4o vs Claude-3 — embedding architecture farkları
Embedding finetune — domain adaptation pratiği
Edge cases — out-of-vocab token, frozen embedding

1. Modern Transformer'da Embedding'in Yeri#

1.1 High-level architecture (Llama-3)#

Input: token_ids [batch, seq]
     ↓
[1] Embedding lookup       → hidden [batch, seq, d_model]
     ↓
[2] (Position info embedded via RoPE in attention — no add here)
     ↓
[3] Transformer blocks ×N (attention + FFN + RMSNorm)
     ↓
[4] Final RMSNorm           → hidden
     ↓
[5] LM head (output projection) → logits [batch, seq, vocab_size]
     ↓
[6] Softmax                  → token probabilities

1.2 [1] Embedding lookup (current focus)#

emb = nn.Embedding(vocab_size, d_model)
hidden = emb(token_ids)

Dimensions:

Llama-3 8B: vocab=128K, d_model=4096
Llama-3 70B: vocab=128K, d_model=8192
GPT-4o (tahmini): vocab=200K, d_model=12288

1.3 [5] LM head (output projection)#

logits = hidden @ output_weight.T   # [batch, seq, vocab_size]

output_weight shape: [vocab_size, d_model]. Aynı boyut input embedding ile.

1.4 Crucial observation#

Input embedding [V, d] ve output projection [V, d] aynı shape'te. Bu, embedding tying için fundamental observation.

1.5 Niye 'tek' embedding katmanı denir#

İlk bakışta embedding 'sadece input'ta'. Ama LM head aslında transpose'lanmış output embedding. Tying ile gerçekten tek matris kullanılır.

1.6 Pre-LayerNorm modern pattern#

Modern LLM (Llama-3, GPT-4): pre-LN. Yani RMSNorm her bloğun girişinde uygulanır, embedding output'una doğrudan RMSNorm gelmiyor — ilk transformer bloğunun ilk işlemi RMSNorm.

3. Embedding Tying — Matematiksel Justification#

3.1 Tied vs untied#

Untied (eski GPT, GPT-3):

self.input_emb = nn.Embedding(V, d)         # E_in
self.output_emb = nn.Linear(d, V, bias=False)  # E_out
# E_in ≠ E_out (ayrı params)

Tied (modern):

self.emb = nn.Embedding(V, d)
# Output: hidden @ self.emb.weight.T
# Aynı matris hem input lookup hem output projection

3.2 Press & Wolf 2016 paper'ı#

'Using the Output Embedding to Improve Language Models'.

Intuition: input embedding ve output projection aynı semantic role'da:

Input: 'bu token hangi anlama gelir' (ID → vector)
Output: 'hangi anlamsal vektör hangi token'a karşılık gelir' (vector → ID)

Matematiksel: her ikisi de vocab-vector mapping. Tek tabela.

3.3 Inan, Khosravi, Socher 2016 paper'ı#

'Tying Word Vectors and Word Classifiers'.

Aynı keşfi independent olarak yaptı. Bayesian justification: tied weights MAP estimate'i prior'la augment ediyor — better regularization.

3.4 Parameter savings#

Llama-3 8B: V=128K, d=4096.

Untied: 524M (input) + 524M (output) = 1.05B
Tied: 524M

Tasarruf: 524M params, total 8B'in %6.5'i.

70B model: 1.05B params tasarruf, total 70B'in %1.5'i.

3.5 Perplexity etkisi#

Press & Wolf 2016: tied modeller untied'a göre %3-5 daha iyi perplexity (eşit param budget'la).

Sebep:

Daha az parameter → daha az overfitting
Implicit regularization (input ve output vocab consistency)
Better generalization to rare words

3.6 Implementation in PyTorch#

class LLM(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        # LM head — TIED
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        self.lm_head.weight = self.embedding.weight   # SHARED
    
    def forward(self, token_ids):
        x = self.embedding(token_ids)
        # ... transformer layers ...
        logits = self.lm_head(x)
        return logits

Kritik: `self.lm_head.weight = self.embedding.weight` — Python reference aynı tensor. Backward pass'te gradient her ikisinden de aynı parametre'ye gider.

3.7 Llama-3 implementation#

Llama-3 source code (transformers library):

self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

if config.tie_word_embeddings:
    self.lm_head.weight = self.embed_tokens.weight

Llama-3-8B `config.tie_word_embeddings = True`. Llama-3-70B aynı.

3.8 GPT-3 anomaly#

GPT-3 paper: tied weights kullanılmadı. Niye?

175B model — 2K vocab × 12288 dim = 25M params (% of total: minimal)
Untied iki bağımsız matris → daha esnek
Empirical: 175B scale'de tying gain marginal

Ama GPT-4, Llama-3 — tied. Anomaly GPT-3 idi. Modern preference: tied.

6. Embedding Scaling — sqrt(d_model) ya da Yok#

6.1 Original Transformer (Vaswani 2017) scaling#

Original paper: 'We multiply those weights by sqrt(d_model)'.

x = embedding(token_ids) * math.sqrt(d_model)

Sebep: embedding initialization (normal(0, 1)) ile position encoding (also normal-ish) magnitudes match.

6.2 Modern modeller scaling kullanmıyor#

Llama-3, GPT-3+, Mistral — embedding scaling YOK:

x = embedding(token_ids)   # NO scaling

Neden değişti:

Modern init: std=0.02 (small) — scaling artık gerekmiyor
Pre-LN architecture: ilk RMSNorm activation magnitudes regularize ediyor
RoPE positional embedding: ek embedding scaling'e gerek yok

6.3 Scaling matematik#

If embedding ~ N(0, σ²), then

E[||embedding||²] = d_model × σ²

For Vaswani 2017 (σ=1): magnitude ~sqrt(d_model). Multiplying by sqrt(d_model) gives magnitude d_model.

For modern (σ=0.02): magnitude ~0.02 × sqrt(d_model) = 0.02 × 64 = 1.28 (for d=4096). Already reasonable.

6.4 ALiBi, RoPE positional — niye scaling şart değil#

ALiBi (Press 2021): attention bias-based position info. No additive position embedding. RoPE (Su 2021): rotary position embedding inside attention. No additive.

Both eliminate the need for embedding ↔ position scaling.

6.5 Tied embedding'de scaling tehlikesi#

If input scaled (multiplied by sqrt(d_model)) but output projection unscaled — gradient asymmetry. Modern systems: no scaling, ya da symmetric scaling.

6.6 Practical: hangi seçim#

Original Vaswani 2017: scaled
GPT-2, GPT-3: scaled (untied embedding)
Llama-1, Llama-2, Llama-3: NOT scaled (RMSNorm + RoPE handles it)
Mistral, Mixtral: NOT scaled
GPT-4 (tahmini): NOT scaled (modern best practice)

Modern best practice: skip scaling, rely on init (0.02) + pre-LN + RoPE.

8. Position Embedding — Eski vs Yeni#

8.1 Original Transformer (Vaswani 2017): sinusoidal positional#

Position info embedding'e eklenir (additive):

pos_enc = sinusoidal_positions(seq_len, d_model)
x = embedding(token_ids) + pos_enc

Formula:

PE(pos, 2i)   = sin(pos / 10000^{2i/d_model})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})

Deterministic — learnable değil. Generalizes to longer sequences (theoretically).

8.2 GPT-2 / BERT: learned absolute positional embedding#

self.pos_emb = nn.Embedding(max_seq_len, d_model)   # learnable
x = self.tok_emb(token_ids) + self.pos_emb(position_ids)

Learned. Max seq length sınırlı (training'de görülen).

8.3 Modern: RoPE (Su 2021)#

Position info embedding'e EKLENMEZ — attention computation'da rotate edilir.

# In attention layer (not in embedding!)
q_rot = apply_rope(q, position_ids)
k_rot = apply_rope(k, position_ids)
attn_logits = q_rot @ k_rot.T

RoPE pozisyon bilgisini q ve k vector rotation olarak inject eder. Embedding katmanı pozisyondan independent.

8.4 Llama-3 implementation#

class LlamaModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
        self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
        # NO pos_emb attribute!

Llama-3'te pos_emb attribute yok. Position info attention'da injected.

8.5 RoPE detayı (özet)#

Query ve key vector'lerini ikişer ikişer çift olarak grup, sonra her çift (2D vector) için pozisyon-bağımlı rotation:

[q_0, q_1] → [q_0 cos(mθ) - q_1 sin(mθ), q_0 sin(mθ) + q_1 cos(mθ)]

m = position index, θ = base frequency.

Detay Modül 9'da (Position Encoding).

8.6 Önemli: embedding katmanı now position-agnostic#

Modern LLM'lerde embedding katmanı sadece token ID → vector. Position bilgisi taşımaz. Bu, mimari hijyenin temel taşı.

9. GPT-4o Multimodal Embedding — Vision + Audio Tokens#

9.1 Text token embedding (klasik)#

token_id (text) → vector

9.2 Image patch embedding#

GPT-4o image input:

Image 224 × 224 → patches 14 × 14 (256 patch)
Each patch → linear projection → 'image token' vector
Image tokenları text token sequence'ine append
Same transformer processes mixed sequence

9.3 Audio token embedding#

GPT-4o audio:

Audio waveform → MEL spectrogram
Spectrogram → chunks (e.g., 25 ms each)
Each chunk → audio embedding
Audio tokens text sequence'ine interleave

9.4 Unified embedding space#

Key insight: text, image, audio embedding'leri aynı d_model dimensional space'te yaşar. Vector arithmetic mümkün:

embedding(image of cat) ≈ embedding(token "cat")

This is the magic of multimodal LLMs.

9.5 Implementation pattern#

class MultimodalLLM(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.text_embedding = nn.Embedding(config.vocab_size, config.d_model)
        self.image_projection = nn.Linear(config.image_patch_dim, config.d_model)
        self.audio_projection = nn.Linear(config.audio_feature_dim, config.d_model)
    
    def forward(self, text_ids, image_patches, audio_features):
        text_emb = self.text_embedding(text_ids)
        image_emb = self.image_projection(image_patches)
        audio_emb = self.audio_projection(audio_features)
        # Concatenate (with special tokens)
        x = torch.cat([text_emb, image_emb, audio_emb], dim=1)
        # Transformer processes mixed sequence
        return self.transformer(x)

9.6 Special tokens (multimodal)#

GPT-4o reserved tokenları (vocab):

<|image_start|>, <|image_end|>
<|audio_start|>, <|audio_end|>
<|video_start|>, <|video_end|>

Bu tokenlar embedding katmanında learnable vector'e map. Modality boundaries işaretler.

9.7 Türkçe için#

GPT-4o Türkçe (text) + Türkçe sesli komut (audio) + görsel input (image) — hepsi unified embedding space'te. Türk dilinin multimodal anlam zenginliği bu mimariden faydalanır.

✅ Ders 7.4 Özeti — Modern LLM Embedding

Modern LLM embedding katmanı = nn.Embedding(V, d_model) + embedding tying (input/output paylaşımı, %5 perplexity boost + 500M params tasarruf). Llama-3 init std=0.02, scaling YOK (RMSNorm + RoPE handles it). Position info embedding'e eklenmez — attention'da RoPE rotation. GPT-4o multimodal: text + image patch + audio chunk → unified d_model space. Reserved tokenlar (<|image_start|> vs.) modality boundaries işaretler. Ders 7.5'te embedding geometry'sine geçeceğiz: cosine similarity, isotropy, BERTology findings.

Sıradaki Ders: Embedding Geometry#

Ders 7.5: cosine similarity vs Euclidean distance vs dot product — hangisi ne zaman. Isotropy concept (vectors balanced across all directions). BERTology: embedding space'in topolojisi. Türkçe semantic search demosu.

Frequently Asked Questions

Yes for most modern LLMs (%3-5 perplexity boost + memory savings). Exceptions: (1) Marginal effect on very small models. (2) Don't tie for asymmetric vocab tasks (translation: source vs target vocab). (3) At GPT-3 175B scale empirically marginal — untied OK for large models.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...