Modern LLM Embedding Layer + Embedding Tying: Input/Output Sharing and Scaling
Embedding layer in modern transformer architecture: nn.Embedding initialization (Llama-3 style), embedding tying (input/output sharing) — mathematical justification and memory savings, embedding scaling before pre-layernorm (sqrt(d_model) or not), no position addition before RoPE, multimodal embeddings (vision + audio tokens). Architectural differences between Llama-3, GPT-4o, Claude-3.
Şükrü Yusuf KAYA
70 min read
Advanced🧬 Modern LLM embedding — 'iki kafalı' katmanın matematiği
Modern transformer mimarisinde embedding katmanı iki kafalı: input'ta token ID → vector, output'ta vector → token logits. Tarihsel olarak ayrı iki matristi (E_input ve E_output). 2016'da Press & Wolf, Inan et al. ayrı ayrı keşfetti: bu iki matris paylaşılabilir ("weight tying"). Sonuç: parametre tasarrufu, generalization artışı, perplexity %3-5 daha iyi. GPT-2'den beri tüm modern modeller bunu kullanıyor — Llama-3 bile (subtle exception: gpt-3 untied). 70 dakika sonra: modern LLM embedding'in tüm anatomisini — initialization, tying, scaling, RoPE öncesi, multimodal extensions — Llama-3 ve GPT-4o üzerinden kavramış olacaksın.
Ders Haritası (12 Bölüm)#
- Modern transformer'da embedding'in yeri — input layer
- Llama-3 embedding initialization — exact specification
- Embedding tying — matematiksel justification
- Tied vs untied — empirical comparison
- GPT-3 anomaly — niye untied
- Embedding scaling — sqrt(d_model) ya da yok
- Pre-LayerNorm + embedding — modern architecture
- Position embedding — eski (sinusoidal) vs RoPE (modern)
- GPT-4o multimodal embedding — vision + audio tokens
- Llama-3 vs GPT-4o vs Claude-3 — embedding architecture farkları
- Embedding finetune — domain adaptation pratiği
- Edge cases — out-of-vocab token, frozen embedding
1. Modern Transformer'da Embedding'in Yeri#
1.1 High-level architecture (Llama-3)#
Input: token_ids [batch, seq] ↓ [1] Embedding lookup → hidden [batch, seq, d_model] ↓ [2] (Position info embedded via RoPE in attention — no add here) ↓ [3] Transformer blocks ×N (attention + FFN + RMSNorm) ↓ [4] Final RMSNorm → hidden ↓ [5] LM head (output projection) → logits [batch, seq, vocab_size] ↓ [6] Softmax → token probabilities
1.2 [1] Embedding lookup (current focus)#
emb = nn.Embedding(vocab_size, d_model) hidden = emb(token_ids)
Dimensions:
- Llama-3 8B: vocab=128K, d_model=4096
- Llama-3 70B: vocab=128K, d_model=8192
- GPT-4o (tahmini): vocab=200K, d_model=12288
1.3 [5] LM head (output projection)#
logits = hidden @ output_weight.T # [batch, seq, vocab_size]
output_weight shape: [vocab_size, d_model]. Aynı boyut input embedding ile.
1.4 Crucial observation#
Input embedding [V, d] ve output projection [V, d] aynı shape'te. Bu, embedding tying için fundamental observation.
1.5 Niye 'tek' embedding katmanı denir#
İlk bakışta embedding 'sadece input'ta'. Ama LM head aslında transpose'lanmış output embedding. Tying ile gerçekten tek matris kullanılır.
1.6 Pre-LayerNorm modern pattern#
Modern LLM (Llama-3, GPT-4): pre-LN. Yani RMSNorm her bloğun girişinde uygulanır, embedding output'una doğrudan RMSNorm gelmiyor — ilk transformer bloğunun ilk işlemi RMSNorm.
3. Embedding Tying — Matematiksel Justification#
3.1 Tied vs untied#
Untied (eski GPT, GPT-3):
self.input_emb = nn.Embedding(V, d) # E_in self.output_emb = nn.Linear(d, V, bias=False) # E_out # E_in ≠ E_out (ayrı params)
Tied (modern):
self.emb = nn.Embedding(V, d) # Output: hidden @ self.emb.weight.T # Aynı matris hem input lookup hem output projection
3.2 Press & Wolf 2016 paper'ı#
'Using the Output Embedding to Improve Language Models'.
Intuition: input embedding ve output projection aynı semantic role'da:
- Input: 'bu token hangi anlama gelir' (ID → vector)
- Output: 'hangi anlamsal vektör hangi token'a karşılık gelir' (vector → ID)
Matematiksel: her ikisi de vocab-vector mapping. Tek tabela.
3.3 Inan, Khosravi, Socher 2016 paper'ı#
'Tying Word Vectors and Word Classifiers'.
Aynı keşfi independent olarak yaptı. Bayesian justification: tied weights MAP estimate'i prior'la augment ediyor — better regularization.
3.4 Parameter savings#
Llama-3 8B: V=128K, d=4096.
- Untied: 524M (input) + 524M (output) = 1.05B
- Tied: 524M
Tasarruf: 524M params, total 8B'in %6.5'i.
70B model: 1.05B params tasarruf, total 70B'in %1.5'i.
3.5 Perplexity etkisi#
Press & Wolf 2016: tied modeller untied'a göre %3-5 daha iyi perplexity (eşit param budget'la).
Sebep:
- Daha az parameter → daha az overfitting
- Implicit regularization (input ve output vocab consistency)
- Better generalization to rare words
3.6 Implementation in PyTorch#
class LLM(nn.Module): def __init__(self, vocab_size, d_model): super().__init__() self.embedding = nn.Embedding(vocab_size, d_model) # LM head — TIED self.lm_head = nn.Linear(d_model, vocab_size, bias=False) self.lm_head.weight = self.embedding.weight # SHARED def forward(self, token_ids): x = self.embedding(token_ids) # ... transformer layers ... logits = self.lm_head(x) return logits
Kritik: `self.lm_head.weight = self.embedding.weight` — Python reference aynı tensor. Backward pass'te gradient her ikisinden de aynı parametre'ye gider.
3.7 Llama-3 implementation#
Llama-3 source code (transformers library):
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size) self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) if config.tie_word_embeddings: self.lm_head.weight = self.embed_tokens.weight
Llama-3-8B `config.tie_word_embeddings = True`. Llama-3-70B aynı.
3.8 GPT-3 anomaly#
GPT-3 paper: tied weights kullanılmadı. Niye?
- 175B model — 2K vocab × 12288 dim = 25M params (% of total: minimal)
- Untied iki bağımsız matris → daha esnek
- Empirical: 175B scale'de tying gain marginal
Ama GPT-4, Llama-3 — tied. Anomaly GPT-3 idi. Modern preference: tied.
6. Embedding Scaling — sqrt(d_model) ya da Yok#
6.1 Original Transformer (Vaswani 2017) scaling#
Original paper: 'We multiply those weights by sqrt(d_model)'.
x = embedding(token_ids) * math.sqrt(d_model)
Sebep: embedding initialization (normal(0, 1)) ile position encoding (also normal-ish) magnitudes match.
6.2 Modern modeller scaling kullanmıyor#
Llama-3, GPT-3+, Mistral — embedding scaling YOK:
x = embedding(token_ids) # NO scaling
Neden değişti:
- Modern init: std=0.02 (small) — scaling artık gerekmiyor
- Pre-LN architecture: ilk RMSNorm activation magnitudes regularize ediyor
- RoPE positional embedding: ek embedding scaling'e gerek yok
6.3 Scaling matematik#
If embedding ~ N(0, σ²), then
E[||embedding||²] = d_model × σ²
For Vaswani 2017 (σ=1): magnitude ~sqrt(d_model). Multiplying by sqrt(d_model) gives magnitude d_model.
For modern (σ=0.02): magnitude ~0.02 × sqrt(d_model) = 0.02 × 64 = 1.28 (for d=4096). Already reasonable.
6.4 ALiBi, RoPE positional — niye scaling şart değil#
ALiBi (Press 2021): attention bias-based position info. No additive position embedding.
RoPE (Su 2021): rotary position embedding inside attention. No additive.
Both eliminate the need for embedding ↔ position scaling.
6.5 Tied embedding'de scaling tehlikesi#
If input scaled (multiplied by sqrt(d_model)) but output projection unscaled — gradient asymmetry. Modern systems: no scaling, ya da symmetric scaling.
6.6 Practical: hangi seçim#
- Original Vaswani 2017: scaled
- GPT-2, GPT-3: scaled (untied embedding)
- Llama-1, Llama-2, Llama-3: NOT scaled (RMSNorm + RoPE handles it)
- Mistral, Mixtral: NOT scaled
- GPT-4 (tahmini): NOT scaled (modern best practice)
Modern best practice: skip scaling, rely on init (0.02) + pre-LN + RoPE.
8. Position Embedding — Eski vs Yeni#
8.1 Original Transformer (Vaswani 2017): sinusoidal positional#
Position info embedding'e eklenir (additive):
pos_enc = sinusoidal_positions(seq_len, d_model) x = embedding(token_ids) + pos_enc
Formula:
PE(pos, 2i) = sin(pos / 10000^{2i/d_model}) PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})
Deterministic — learnable değil. Generalizes to longer sequences (theoretically).
8.2 GPT-2 / BERT: learned absolute positional embedding#
self.pos_emb = nn.Embedding(max_seq_len, d_model) # learnable x = self.tok_emb(token_ids) + self.pos_emb(position_ids)
Learned. Max seq length sınırlı (training'de görülen).
8.3 Modern: RoPE (Su 2021)#
Position info embedding'e EKLENMEZ — attention computation'da rotate edilir.
# In attention layer (not in embedding!) q_rot = apply_rope(q, position_ids) k_rot = apply_rope(k, position_ids) attn_logits = q_rot @ k_rot.T
RoPE pozisyon bilgisini q ve k vector rotation olarak inject eder. Embedding katmanı pozisyondan independent.
8.4 Llama-3 implementation#
class LlamaModel(nn.Module): def __init__(self, config): super().__init__() self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size) self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)]) # NO pos_emb attribute!
Llama-3'te pos_emb attribute yok. Position info attention'da injected.
8.5 RoPE detayı (özet)#
Query ve key vector'lerini ikişer ikişer çift olarak grup, sonra her çift (2D vector) için pozisyon-bağımlı rotation:
[q_0, q_1] → [q_0 cos(mθ) - q_1 sin(mθ), q_0 sin(mθ) + q_1 cos(mθ)]
m = position index, θ = base frequency.
Detay Modül 9'da (Position Encoding).
8.6 Önemli: embedding katmanı now position-agnostic#
Modern LLM'lerde embedding katmanı sadece token ID → vector. Position bilgisi taşımaz. Bu, mimari hijyenin temel taşı.
9. GPT-4o Multimodal Embedding — Vision + Audio Tokens#
9.1 Text token embedding (klasik)#
token_id (text) → vector
9.2 Image patch embedding#
GPT-4o image input:
- Image 224 × 224 → patches 14 × 14 (256 patch)
- Each patch → linear projection → 'image token' vector
- Image tokenları text token sequence'ine append
- Same transformer processes mixed sequence
9.3 Audio token embedding#
GPT-4o audio:
- Audio waveform → MEL spectrogram
- Spectrogram → chunks (e.g., 25 ms each)
- Each chunk → audio embedding
- Audio tokens text sequence'ine interleave
9.4 Unified embedding space#
Key insight: text, image, audio embedding'leri aynı d_model dimensional space'te yaşar. Vector arithmetic mümkün:
embedding(image of cat) ≈ embedding(token "cat")
This is the magic of multimodal LLMs.
9.5 Implementation pattern#
class MultimodalLLM(nn.Module): def __init__(self, config): super().__init__() self.text_embedding = nn.Embedding(config.vocab_size, config.d_model) self.image_projection = nn.Linear(config.image_patch_dim, config.d_model) self.audio_projection = nn.Linear(config.audio_feature_dim, config.d_model) def forward(self, text_ids, image_patches, audio_features): text_emb = self.text_embedding(text_ids) image_emb = self.image_projection(image_patches) audio_emb = self.audio_projection(audio_features) # Concatenate (with special tokens) x = torch.cat([text_emb, image_emb, audio_emb], dim=1) # Transformer processes mixed sequence return self.transformer(x)
9.6 Special tokens (multimodal)#
GPT-4o reserved tokenları (vocab):
<|image_start|>, <|image_end|> <|audio_start|>, <|audio_end|> <|video_start|>, <|video_end|>
Bu tokenlar embedding katmanında learnable vector'e map. Modality boundaries işaretler.
9.7 Türkçe için#
GPT-4o Türkçe (text) + Türkçe sesli komut (audio) + görsel input (image) — hepsi unified embedding space'te. Türk dilinin multimodal anlam zenginliği bu mimariden faydalanır.
✅ Ders 7.4 Özeti — Modern LLM Embedding
Modern LLM embedding katmanı = nn.Embedding(V, d_model) + embedding tying (input/output paylaşımı, %5 perplexity boost + 500M params tasarruf). Llama-3 init std=0.02, scaling YOK (RMSNorm + RoPE handles it). Position info embedding'e eklenmez — attention'da RoPE rotation. GPT-4o multimodal: text + image patch + audio chunk → unified d_model space. Reserved tokenlar (<|image_start|> vs.) modality boundaries işaretler. Ders 7.5'te embedding geometry'sine geçeceğiz: cosine similarity, isotropy, BERTology findings.
Sıradaki Ders: Embedding Geometry#
Ders 7.5: cosine similarity vs Euclidean distance vs dot product — hangisi ne zaman. Isotropy concept (vectors balanced across all directions). BERTology: embedding space'in topolojisi. Türkçe semantic search demosu.
Frequently Asked Questions
Yes for most modern LLMs (%3-5 perplexity boost + memory savings). Exceptions: (1) Marginal effect on very small models. (2) Don't tie for asymmetric vocab tasks (translation: source vs target vocab). (3) At GPT-3 175B scale empirically marginal — untied OK for large models.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup