Difference between embedding and encoding?

Encoding: discrete representation (tokenization, one-hot). Embedding: continuous dense representation (vector space). Tokenizer encodes (text → IDs), embedding layer (IDs → vectors). Applied sequentially.

When to fine-tune embedding layer in production?

Mostly NO. Embedding is the most stable part of pre-trained model, can be frozen during fine-tuning. Exceptions: (1) Domain very different (medical, legal — special terms). (2) New tokens added (new vocab). (3) Multilingual transfer (new language added).

Can I visualize embedding vectors?

Yes. t-SNE or UMAP for 4096-dim → 2D projection. tensorboard.dev or WandB built-in. Well-trained Turkish embedding shows professions clustering, cities clustering, numbers forming separate cluster.

Relationship between embedding and RAG (vector search)?

RAG uses separate embedding models (text-embedding-3-small, sentence-transformers). Different from LLM's input embedding layer. RAG embedding: sentence/paragraph-level, for semantic search. LLM embedding: token-level, model's internal representation. Different purposes.

What's the ideal d_model for Turkish?

Depends on total model parameter size. 7B model → d_model 4096 standard. No Turkish-specific advantage — d_model language-independent. Tokenizer (vocab) Turkish-specific is critical.

What is Embedding? Bridge from Token ID to Meaning Vector — The Discrete-to-Continuous Revolution

Mathematical anatomy of embedding: integer token ID to d-dimensional dense vector mapping. Vocab × d_model matrix. Degenerate case of one-hot encoding. Why semantic vector space works (distributional hypothesis, Firth 1957). 'Meaning emerges from co-occurrence' philosophy. Pre-NN era (LSA, LSI) vs neural era (word2vec → BERT → LLM). Practical meaning for Turkish.

Şükrü Yusuf KAYA

65 min read

5/13/2026

Intermediate

Embedding Nedir? Token ID'den Anlam Vektörüne Köprü — Discrete'den Continuous'a Devrim

🧭 Discrete'den Continuous'a — LLM'in en derin kavramsal sıçraması

Token ID 4523 ne demek? Tek başına HİÇBİR ŞEY. 'merhaba' kelimesi 'selam'a yakın olmalı, 'kuantum' uzak — ama integer ID 4523 ile 8912 arasında bu yakınlığı nasıl ifade edersin? Cevap: embedding katmanı. Her token ID'yi d-dimensional bir vektöre map eder. 4523 → [0.12, -0.87, 0.34, ..., 0.05] (4096 boyut). Bu vektörler anlamsal uzayda yaşar: 'merhaba' ve 'selam' vektörleri yakın, 'kuantum' uzak. Bu, distributional hypothesis'in (Firth 1957) modern uygulaması: 'a word is characterized by the company it keeps'. 65 dakika sonra: embedding'in matematiksel anatomisi, tarihsel gelişimi (LSA → word2vec → BERT → LLM), modern `nn.Embedding` implementasyonu, Türkçe için pratik anlamı — hepsini öğrenmiş olacaksın. Bu, transformer'ın anlam katmanı.

Ders Haritası (12 Bölüm)#

Problem: Token ID neden yetersiz, niye vektör lazım
One-hot encoding — naif çözüm ve niye dejenere
Embedding matrisi — `V × d_model` lookup table'ın anatomisi
`nn.Embedding` PyTorch implementation — satır satır
Distributional hypothesis — Firth 1957 → modern LLM
Semantic vector space — neyi öğrenir, nasıl yaşar
Tarihçe: LSA → word2vec → contextual (BERT) → LLM
Dimensionality — d_model nasıl seçilir (4096, 8192, 12288)
Initialization — sıfırdan training, normal vs uniform vs scaled
Embedding ↔ Output projection — neden 'iki kafalı' katman
Türkçe practice — embedding vector ile semantic search demosu
Edge cases — out-of-vocab, rare tokens, multilingual

1. Problem: Token ID Neden Yetersiz#

1.1 Senaryo#

Tokenizer 'merhaba' → 4523, 'selam' → 8912, 'kuantum' → 22789 mapping yaptı.

Kurmak istediğimiz model:

'merhaba' ile 'selam' anlamsal olarak yakın görsün
'kuantum' ile uzak
'merhaba' + 'gün' = 'iyi gün' yönünde semantic vector sıçraması yapsın

Ama ID'ler integer. 4523, 8912 arasında semantic ilişki yok. ID 4524 'kahve' olabilir, ID 4525 'kıvılcım' — arbitrary.

1.2 İlk akla gelen yanlış: ID'leri float gibi kullan#

# YANLIŞ - aşağıdaki gibi DEĞIL!
hidden = transformer(token_id)   # token_id integer
# Hesaplama: model bu integer'ı continuous variable gibi alıp lineer cebir yapamaz

4523 ile 4524 arasında 1 birim fark — ama 'merhaba' ve 'kahve' birbirine yakın değil. ID'lerin numerical değeri anlam taşımaz.

1.3 Embedding'in doğuşu#

İhtiyaç: integer ID'yi continuous vector'e çevir, ama bu çeviri öğrenilebilir olsun.

Çözüm:

E: integer (token_id) → vector of d floats

Bu E function'ı:

Deterministic: aynı ID hep aynı vector
Learnable: vectorlerin değeri pre-training'te optimize edilir
Differentiable: gradient propagation lazım
Efficient: O(1) lookup

1.4 Çözüm: lookup table (embedding matrix)#

E = V × d_model matrix (parameter)

Vector(token_id) = E[token_id, :]    # row token_id'yi al

Matrix V (vocab size) rows, d_model (model dimension) columns. Her satır bir token'ın 'embedding'i.

Llama-3-8B için:

V = 128,000
d_model = 4096
E shape: 128000 × 4096
E parameter count: 128K × 4K = 524M parameters

Bu, modelin input katmanı. Tüm öğrenme bu vektörlerin yerini ayarlamayla başlar.

2. One-Hot Encoding — Naif Çözüm ve Niye Dejenere#

2.1 One-hot tanımı#

'merhaba' (ID 4523) → V-dimensional vektör, sadece 4523. pozisyon 1, diğerleri 0:

[0, 0, 0, ..., 0, 1, 0, ..., 0]    # V boyut, sadece pos 4523 = 1

2.2 One-hot ile lineer cebir#

# One-hot input
x_onehot = torch.zeros(V)
x_onehot[token_id] = 1

# Lineer projection: W @ x
W = torch.randn(d_model, V)
hidden = W @ x_onehot   # shape: d_model

Matematik:

hidden = W @ x_onehot = W[:, token_id]   # token_id'inci sütunu al

Yani one-hot @ matrix = matrix column lookup. One-hot teknik olarak dejenere bir lookup operasyonu.

2.3 Embedding lookup = sparse matrix multiplication#

Embedding lookup operasyonu özel optimize edilmiş one-hot multiplication:

# Naive (V × d_model matrix multiply)
hidden = W @ one_hot(token_id)   # O(V × d_model) FLOP

# Efficient (table lookup)
hidden = E[token_id]              # O(d_model) memory read

Llama-3'te V=128K, d_model=4096. Naive yöntem her token için 524M FLOP. Lookup ise 4K memory read. 130,000x hızlı.

2.4 One-hot'un asıl problemi: ortogonalite#

Tüm one-hot vektörleri birbirine ortogonal (cosine sim = 0):

onehot("merhaba") · onehot("selam") = 0
onehot("merhaba") · onehot("kuantum") = 0

Yani her kelime birbirine eşit uzaklıkta. Semantic bilgi YOK. Bu, embedding'in çözmesi gereken esas problem.

2.5 Embedding bu problemi nasıl çözer#

Embedding matrix E öğrendikten sonra:

E["merhaba"] · E["selam"] ≈ 0.85    # high cosine sim
E["merhaba"] · E["kuantum"] ≈ 0.10  # low cosine sim

Vektörler eğitim sırasında konumlandırılır — co-occurrence pattern'lerine göre. Detay Bölüm 5'te (distributional hypothesis).

2.6 One-hot bazı durumlarda hâlâ yararlı#

Classification output: tek doğru class için one-hot label
Cross-entropy loss hesaplamada hedef one-hot
Pedagojik açıklama (eğitim materyali)

Ama model input olarak one-hot kullanılmaz modern NN'lerde.

3. Embedding Matrisi —
`V × d_model`
Lookup Table#

3.1 Matrix structure#

E = | e_0       |   # token 0 embedding (V-dim → d_model-dim)
    | e_1       |   # token 1
    | e_2       |
    | ...       |
    | e_{V-1}   |   # token V-1

shape: V × d_model
E[i] ∈ R^{d_model}

3.2 Lookup operation#

def embedding_lookup(E, token_ids):
    # token_ids: shape [batch, seq_len]
    # E: shape [V, d_model]
    # returns: shape [batch, seq_len, d_model]
    return E[token_ids]

PyTorch native: `F.embedding(token_ids, E)` veya `nn.Embedding` modülü.

3.3 Parameter count#

Llama-3-8B detay:

V = 128,000
d_model = 4096
E params = 524M

GPT-4 (tahmini):

V = ~200K
d_model = ~12,288
E params = ~2.5B

Embedding katmanı, modelin en büyük parametre tüketicilerinden biri. Bazı modellerde toplam paramların %10-20'sı.

3.4 Memory footprint#

Llama-3-8B embedding fp16:

524M × 2 byte = 1 GB just embedding

fp32 (training-time): 2 GB. bf16 (compromise): 1 GB.

3.5 Vocab vs model size trade-off#

Eşit parameter budget:

Büyük vocab + sığ model: kelime ezberlemede iyi, reasoning sığ
Küçük vocab + derin model: fertility yüksek (Modül 6.9), reasoning derin

Sweet spot: V ∝ log(corpus_size). Llama-3'te 128K — 15T token corpus için.

3.6 Storage format#

Production'da embedding tensor:

Disk: bfloat16 (.safetensors)
RAM: bfloat16 (load)
Compute: bfloat16 forward, fp32 gradient backward

Quantization mümkün (int8, 4-bit) — embedding hassasiyetinde tipik %1-3 quality loss tolerable.

python

import torch
import torch.nn as nn
 
# 1. nn.Embedding manuel oluştur
vocab_size = 128_000
d_model = 4096
 
embedding = nn.Embedding(vocab_size, d_model)
# Default initialization: normal(0, 1)
 
print(f"Embedding params: {embedding.weight.numel():,}")  # 524,288,000
print(f"Embedding shape: {embedding.weight.shape}")        # [128000, 4096]
 
# 2. Lookup
token_ids = torch.tensor([[4523, 8912, 22789]])  # batch=1, seq_len=3
vectors = embedding(token_ids)
print(f"Output shape: {vectors.shape}")             # [1, 3, 4096]
 
# 3. Tek token vector'üne bakma
vec_merhaba = embedding.weight[4523]   # row 4523
print(f"Vec shape: {vec_merhaba.shape}")   # [4096]
print(f"First 10 dims: {vec_merhaba[:10]}")
 
# 4. Cosine similarity hesabı
import torch.nn.functional as F
vec_selam = embedding.weight[8912]
cos_sim = F.cosine_similarity(vec_merhaba.unsqueeze(0), vec_selam.unsqueeze(0))
print(f"Cosine sim (merhaba, selam): {cos_sim.item():.4f}")
# Untrained: ~0.0 (random init)
# Trained model: ~0.85 (high similarity)
 
# 5. Custom initialization
nn.init.normal_(embedding.weight, mean=0.0, std=0.02)   # GPT-2 default
# veya
nn.init.uniform_(embedding.weight, a=-0.1, b=0.1)         # uniform
# veya
nn.init.xavier_uniform_(embedding.weight)                 # Xavier/Glorot
# veya
nn.init.kaiming_normal_(embedding.weight, mode='fan_out')  # He init

nn.Embedding — production-grade lookup

4.
`nn.Embedding`
PyTorch Implementation#

4.1 PyTorch source code (essence)#

class Embedding(nn.Module):
    def __init__(self, num_embeddings, embedding_dim, padding_idx=None,
                 max_norm=None, norm_type=2.0, scale_grad_by_freq=False,
                 sparse=False, _weight=None):
        super().__init__()
        self.num_embeddings = num_embeddings
        self.embedding_dim = embedding_dim
        self.padding_idx = padding_idx
        self.weight = nn.Parameter(torch.empty(num_embeddings, embedding_dim))
        self.reset_parameters()

    def forward(self, input):
        return F.embedding(input, self.weight, self.padding_idx, ...)

4.2
`F.embedding`
internals (CUDA kernel)#

__global__ void embedding_kernel(
    const long* input,      // [B, L]
    const float* weight,    // [V, D]
    float* output,           // [B, L, D]
    int B, int L, int D, int V
) {
    int batch_idx = blockIdx.x;
    int seq_idx = blockIdx.y;
    int dim_idx = threadIdx.x;
    
    long token_id = input[batch_idx * L + seq_idx];
    output[batch_idx * L * D + seq_idx * D + dim_idx] =
        weight[token_id * D + dim_idx];
}

İdeal access pattern: coalesced memory reads (her thread bir dim).

4.3 Key parameters#

num_embeddings (V): vocab size
embedding_dim (d_model): vector dimension
padding_idx: bu ID için vektör learning'den exclude (genelde 0)
max_norm: vektörlerin L2 norm'unu sınırla
norm_type: hangi norm (default L2)
scale_grad_by_freq: rare token'lar daha güçlü gradient (research feature)
sparse: sparse gradient (büyük vocab için memory efficient)

4.4 padding_idx davranışı#

emb = nn.Embedding(1000, 128, padding_idx=0)
# Token ID 0 vektörü her zaman 0 olarak kalır (gradient yok)

Production: batch padding için. Variable-length sequence'leri pad ederken `[PAD]` tokenı genelde ID 0.

4.5 max_norm constraint#

emb = nn.Embedding(1000, 128, max_norm=1.0)
# Her vektörün L2 norm'u <= 1 olacak şekilde projection

Production'da nadir — instabil model için. Modern LLM'lerde max_norm None.

4.6 Sparse gradient#

Büyük vocab (V=128K) backprop'unda sparse gradient memory-friendly:

emb = nn.Embedding(128_000, 4096, sparse=True)
optimizer = torch.optim.SparseAdam(emb.parameters())  # sparse-compat

Sadece batch'te görülen tokenlar gradient alır → memory %95+ tasarruf.

4.7 Multi-GPU embedding (tensor parallel)#

Llama-3 multi-GPU'da embedding shard:

GPU 0: E[0:32000, :]
GPU 1: E[32000:64000, :]
GPU 2: E[64000:96000, :]
GPU 3: E[96000:128000, :]

Allreduce + scatter pattern. Megatron-LM library bunu native handles.

4.8 Performance#

Llama-3-8B embedding forward (batch=1, seq=2048):

FLOP: 0 (lookup, no multiply)
Memory bandwidth: 2048 × 4096 × 2 byte = 16 MB read
Time: ~0.5 ms on H100

Transformer'ın diğer katmanlarına göre çok hızlı — bottleneck değil.

5. Distributional Hypothesis — Firth 1957'den Modern LLM'e#

5.1 Firth (1957): 'A word is characterized by the company it keeps'#

İngiliz dilbilimci J.R. Firth: kelimenin anlamı co-occurring kelimeleriyle tanımlanır.

'Kahve' ne demek? Aşağıdaki kelimelerin yanında geçer:

içmek, demlemek, fincan, sabah, kafein, kahvaltı, sıcak, türk, espresso, latte

'Çay' ne demek? Benzer co-occurrence:

içmek, demlemek, fincan, sabah, kafein, kahvaltı, sıcak, türk, demli, ince belli bardak

'Kahve' ve 'çay' similar context'lerde geçer → similar meaning → similar embedding vector.

5.2 Harris (1954): 'distributional structure'#

Zellig Harris matematiksel olarak ifade etti: 'words that occur in similar context tend to have similar meanings'. Bu, modern NLP'nin temel taşı.

5.3 Modern embedding'in 'magic'#

Word2vec, BERT, LLM embeddings — hepsi distributional hypothesis'in nümerik uygulaması:

1. Corpus'tan token co-occurrence istatistik çıkar
2. Her token'a vector ata
3. Co-occurring token'ların vektörlerini yakınlaştır
4. Co-occurring olmayanları uzaklaştır

Bu süreç pre-training. Sonuç: vector space'te semantic clustering.

5.4 Word2Vec'in objective function#

Skip-gram: bir token verilince context tokenları predict et.

P(w_{t-k}, ..., w_{t+k} | w_t) = ∏ P(w_{t+j} | w_t)

CBOW (Continuous Bag of Words): context verilince merkez token predict et.

P(w_t | w_{t-k}, ..., w_{t+k})

5.5 LLM'de distributional hypothesis devam ediyor#

Llama-3 pre-training: next-token prediction.

P(w_t | w_1, ..., w_{t-1})

Bu, conditional distributional hypothesis — context'e bağlı co-occurrence prediction. Embedding katmanı bu süreçte semantic vektörleri öğreniyor.

5.6 Türkçe-spesifik tezahür#

'mavi' ve 'lacivert' Türkçe corpus'unda benzer context:

gömlek, ceket, deniz, gökyüzü, takım elbise, etiket, lacivert palto, mavi gömlek

İyi eğitilmiş Türkçe embedding'de cos(mavi, lacivert) ≈ 0.78. Random init: 0.02.

5.7 Limitations of distributional hypothesis#

Polysemy (kelime çoklu anlam): 'bank' (finansal kurum) vs 'bank' (oturma yeri) — tek vektörde sıkışır
Antonym (zıt anlam): 'sıcak' ve 'soğuk' similar context — vektörler de yakın olabilir (genelde 0.6-0.7)
Compositional: 'açık deniz' (deep sea) ≠ 'açık' + 'deniz' (sum of parts)

Çözüm: contextual embedding (BERT, LLM) — her token'ın embedding'i kullanıldığı context'e bağlı.

7. Embedding Tarihçesi — LSA'dan LLM'e 70 Yıl#

7.1 LSA (Latent Semantic Analysis, 1990)#

Deerwester et al. 1990: 'Indexing by Latent Semantic Analysis'.

Yöntem:

Document-term matrix oluştur (rows: docs, cols: words)
TF-IDF normalize et
SVD ile rank-k yaklaşımı al → her word k-dim vector

A ≈ U_k Σ_k V_k^T   # SVD truncation
word_vector(w) = V_k[w, :]   # k-dim

LSA'nın limitleri:

Static (context-aware değil)
Linear (SVD = linear algebra)
Batch (yeni doc geldiğinde tüm SVD'i tekrarla)
Polysemy'i yakalayamaz

7.2 word2vec (Mikolov 2013) — devrim#

Mikolov, Chen, Corrado, Dean (Google) 'Efficient Estimation of Word Representations in Vector Space'.

Key insight: SVD yerine shallow neural network kullan.

İki mimari:

Skip-gram: center → context
CBOW: context → center

Matematiksel objective: softmax over vocab + negative sampling (hierarchical softmax).

Word2vec'in magic: vector arithmetic!

vec("king") - vec("man") + vec("woman") ≈ vec("queen")

Türkçe için:

vec("İstanbul") - vec("Türkiye") + vec("Almanya") ≈ vec("Berlin")

Bu, embedding'in analogical reasoning kabiliyeti.

7.3 GloVe (Pennington 2014) — global statistics#

Stanford'dan Pennington et al. — global co-occurrence count'larıyla SVD-benzeri çözüm.

Objective:

J = Σ_{i,j} f(X_{ij}) (u_i^T v_j + b_i + b_j - log X_{ij})^2

X_{ij} = i ve j'nin co-occurrence sayısı.

Word2vec local windows üzerinde çalışırken GloVe global statistics — interpretasyon farkı.

7.4 FastText (Bojanowski 2017) — subword features#

Facebook AI: subword n-gram embedding.

'merhaba' kelimesi:

char 3-gram: 'mer', 'erh', 'rha', 'hab', 'aba'
Her n-gram'in kendi vektörü
Word vector = sum of n-gram vectors

Avantaj:

OOV problemi azalır (unseen word için subword'lerden compose)
Türkçe gibi morfolojik diller için ideal ('anlaşamadık' = 'anlaş' + 'ama' + 'dık')

7.5 ELMo (Peters 2018) — contextual#

'embeddings from Language Models': bidirectional LSTM ile her token'ın embedding'i context'e bağlı.

İlk contextual embedding. 'bank' kelimesi cümle 1'de farklı vektör, cümle 2'de farklı.

7.6 BERT (Devlin 2018) — transformer + contextual#

Bidirectional transformer + MLM pre-training. Her token'ın embedding'i tüm cümle context'inde hesaplanır.

BERT'in 'last hidden state' aslında contextual embedding.

7.7 GPT family (2018-2026) — generative + contextual#

GPT-1, GPT-2, GPT-3, GPT-4, GPT-4o: autoregressive transformer.

Embedding katmanı input (token ID → vector) ve output projection (vector → logits) genelde tied (weight sharing).

7.8 Llama, Mistral, Claude (2023-2026)#

Modern LLM'lerde embedding katmanı:

Llama-3: 128K vocab × 4096 dim = 524M params
Mistral 7B: 32K × 4096 = 131M
Claude (tahmini): ~200K × 8192 = ~1.6B
GPT-4 (tahmini): ~200K × 12288 = ~2.5B

7.9 2026 trend: multimodal embeddings#

Text + image + audio + video → unified embedding space:

CLIP (OpenAI 2021): text-image joint
GPT-4o: text-image-audio-video unified
Gemini Ultra: all modalities

Embedding artık 'token' kavramından 'concept' kavramına evrilmekte.

8. Dimensionality — d_model Nasıl Seçilir#

8.1 Modern model dimensions#

Model	V	d_model	Embedding params
GPT-2 small	50K	768	38M
BERT-base	30K	768	23M
GPT-3	50K	12288	614M
Llama-2 7B	32K	4096	131M
Llama-3 8B	128K	4096	524M
Llama-3 70B	128K	8192	1.05B
Mistral 7B	32K	4096	131M
GPT-4 (tahmini)	200K	12288	2.46B

8.2 d_model seçim heuristic#

Empirical (Chinchilla scaling laws + Hoffmann 2022):

d_model ≈ N^{0.33}   # N = total params

7B model: d_model ~4096. 70B model: d_model ~8192. 500B model: d_model ~12288.

8.3 Niye 4096 / 8192 / 12288#

2'nin kuvveti: GPU tile-friendly (CUDA tensor cores 64'ün katları sever)
Attention head boyutu: d_head = 64 veya 128 typical. d_model = n_heads × d_head.
- 4096 = 32 × 128 (Llama-3)
- 8192 = 64 × 128 (Llama-3 70B)
- 12288 = 96 × 128 (GPT-3)
FFN expansion: d_ff = 4 × d_model typical. 4096 → 16384.

8.4 Trade-off: capacity vs cost#

Daha büyük d_model: daha çok semantic capacity, daha pahalı her layer
Daha küçük d_model: hızlı ama capacity'de sınırlı

8.5 d_model vs vocab trade-off (eşit budget)#

50M parameter embedding budget:

V=128K, d=400: 51M
V=64K, d=800: 51M
V=32K, d=1600: 51M

Hangisi iyi? Empirical: orta-yol. Çok büyük vocab + çok küçük d_model → embedding 'sığ', semantic capacity yetersiz. Tersi: vocab küçük, fertility yüksek (Modül 6.9).

8.6 Türkçe için optimal#

7B model + Türkçe-only:

V = 32K (TR-tuned)
d_model = 4096 (standard)
Embedding = 131M params

Multilingual 7B:

V = 128K (Llama-3 default)
d_model = 4096
Embedding = 524M (~%50 total params!)

Multilingual modellerde embedding payı çok büyük.

9. Initialization — Sıfırdan Training#

9.1 Default PyTorch init: normal(0, 1)#

nn.Embedding(V, d_model)
# weight ~ N(0, 1)

Sorun: variance çok yüksek → ilk forward pass'te activation patlar.

9.2 GPT-2 standard: normal(0, 0.02)#

nn.init.normal_(embedding.weight, mean=0, std=0.02)

Küçük standart sapma → vektörler küçük → ilk forward pass stable. Modern LLM default.

9.3 Llama style: scaled#

std = (2 / d_model) ** 0.5   # ≈ 0.022 for d_model=4096
nn.init.normal_(embedding.weight, mean=0, std=std)

He initialization (Kaiming) inspired — d_model'e bağlı scale.

9.4 Xavier/Glorot#

nn.init.xavier_uniform_(embedding.weight)
# uniform(-a, a), a = sqrt(6 / (fan_in + fan_out))

Linear layer'ler için ideal. Embedding için biraz suboptimal (fan_in = V çok büyük).

9.5 Pretrained init (transfer learning)#

Mevcut bir modelden embedding kopyala:

pretrained = AutoModel.from_pretrained("meta-llama/Meta-Llama-3-8B")
new_model.embedding.weight.data = pretrained.embedding.weight.data.clone()

Fine-tune scenarios'da yaygın.

9.6 Layer-wise scaling (Llama-3 style)#

Llama-3 paper'ında her layer farklı init scale:

std_layer_i = base_std / sqrt(i)

Derin layer'lerde init daha küçük → deep transformer'ı stabilize eder.

9.7 Initialization'ın downstream etkisi#

Çok büyük std: ilk forward'da exploding activations → loss NaN ilk step
Çok küçük std: gradient vanishing → çok yavaş öğrenme
Optimal: 0.02-0.05 range

9.8 Reinitialization sırası#

Production model loading'de init sırası:

Model architecture create (random init)
Checkpoint load (overwrite weights)
Resume training veya inference

If checkpoint yoksa random init ile pre-training'e başla.

Egzersizler#

Egzersiz 1#

Vocab size 128K, d_model 4096, fp16 storage. Embedding katmanının disk ve RAM footprint'i kaç?

Egzersiz 2#

One-hot encoding ile dense embedding lookup arasında FLOP farkı kaç? Llama-3 (V=128K, d=4096) için sayısal cevap.

Egzersiz 3#

Distributional hypothesis'in temel iddiası ne? Bunun matematiksel ifadesi (skip-gram objective) nasıl?

Egzersiz 4#

Word2vec'te vec(king) - vec(man) + vec(woman) ≈ vec(queen). Bu 'analogical reasoning' niye çalışıyor? Geometrik intuition?

Egzersiz 5#

FastText subword n-gram embedding Türkçe için niye Llama-3'ten daha avantajlı olabilir? Concrete örnekle.

Egzersiz 6#

Llama-3 70B'de d_model = 8192, V = 128K. Embedding params kaç? Total 70B params'in yüzde kaçı?

Egzersiz 7#

GPT-2 std=0.02 init kullanıyor. Niye 1.0 değil? Stability açıklaması.

Egzersiz 8#

Multilingual 7B model'de embedding katmanı total params'in %50'sini alıyor. Bu trade-off normal mi? Alternative tasarım?

Egzersiz 9#

Padding token (ID 0) embedding'i niye gradient almasın? `padding_idx=0` ne yapıyor?

Egzersiz 10#

Production'da embedding load time'ı 3 sec. Ne optimize edebilirsin?

✅ Ders 7.1 Özeti — Embedding'in Doğuşu

Embedding katmanı integer token ID'yi d-dimensional dense vector'e çevirir. V × d_model matrix lookup table. One-hot dejenere case, embedding semantic vector space yaratır. Distributional hypothesis (Firth 1957) modern LLM'lerin temel taşı: 'word is its context'. Tarihsel evrim: LSA → word2vec → GloVe → FastText → ELMo → BERT → LLM. Modern d_model: 4096-12288 range (Llama-3, GPT-4). Init standard: normal(0, 0.02). Multilingual'da embedding katmanı total params'in %30-50'sini tutabilir. Bölüm 7.2'de Word2Vec algoritmasını satır satır işleyeceğiz.

Sıradaki Ders: Word2Vec Derinlemesine#

Ders 7.2: Mikolov 2013 paper'ın satır satır anatomy. Skip-gram vs CBOW objective, hierarchical softmax, negative sampling, subword extensions. Türkçe corpus'ta word2vec-tr eğitim demosu (gensim library).

Frequently Asked Questions

First operation in transformer architecture mapping token IDs → vectors. All other layers (attention, FFN, layernorm) use embedding output. 'Input embedding' is the same concept.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Pillar topics this article maps to

Pillar Topic

RAG (Retrieval-Augmented Generation) Architecture

RAG (Retrieval-Augmented Generation) is an architecture that grounds large-language-model answers in chunks retrieved from the organization's own documents or data sources, providing both freshness and citations.