Skip-Gram vs CBOW hangisi daha iyi?

Mikolov: Skip-Gram rare word quality için daha iyi, ama CBOW frequent words için 4x hızlı. Modern preference: Skip-Gram (genelde quality > speed).

Negative sampling neden çalışıyor — softmax full hesap olmadan?

Insight: softmax tam hesabı, **gradient direction** için gerekmez. Pozitif örneği vurgulamak + birkaç negatif örneği çürütmek yeterli. K=10-20 negatif → unbiased gradient estimator (asymptotically). Empirical: full softmax kalitesine çok yakın, 1000x hızlı.

Subsampling parametresi 1e-4 niye seçildi?

Mikolov empirical: corpus frequencies'e göre subsampling threshold 'the', 'a', 'is' gibi çok frequent word'leri dampens. Formula: P(keep w) = (sqrt(f / t) + 1) × (t / f). t = 1e-5 to 1e-4 typical sweet spot. Türkçe için: 'bir', 've', 'de', 'da' gibi very frequent words dampening.

Word2vec ile sentence/document embedding nasıl alırım?

Naive: word vector'lerin avg veya sum. Better: TF-IDF weighted avg. Modern: SentenceTransformer (BERT-based) çok daha iyi. Word2vec primarily token-level — sentence-level için modern alternatives öneriliyor.

Word2Vec Satır Satır: Mikolov 2013'ün Skip-Gram + CBOW + Negative Sampling Anatomisi

Mikolov 2013 paper'ının satır satır anatomi: Skip-Gram vs CBOW mimari farkları, softmax computational bottleneck, hierarchical softmax (Huffman tree), negative sampling (Mikolov 2013b), subsampling, dynamic window. Pure Python implementation 100 satırda. Gensim ile Türkçe word2vec eğitim demosu. Modern LLM embedding ile karşılaştırma.

Şükrü Yusuf KAYA

70 dakikalık okuma

26.06.2026

İleri

Word2Vec Satır Satır: Mikolov 2013'ün Skip-Gram + CBOW + Negative Sampling Anatomisi

📜 Mikolov 2013 — embedding'in evrimini başlatan paper

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (Google) Ocak 2013'te 'Efficient Estimation of Word Representations in Vector Space' paper'ını yayınladı. Bu 13 sayfa, NLP'yi sonsuza dek değiştirdi. Shallow neural network + distributional hypothesis + mühendislik incelikleri (negative sampling, hierarchical softmax, subsampling). Eylül 2013'te follow-up paper'ı yayınlandı, code release oldu (`word2vec` C library), sonra gensim Python port'u. 2.5 yıl içinde dünya'nın her NLP labında kullanıma girdi. BERT, GPT, Llama — hepsi bu paper'a borçlu. 70 dakika sonra: paper'ı satır satır anlamış, kendi 100-satır Python implementasyonunu yazmış, Türkçe corpus'ta word2vec-tr eğitmiş olacaksın. Bu, embedding'in algoritmik temel taşı.

Ders Haritası (13 Bölüm)#

Tarihsel bağlam — neural language model 2013 öncesi
Mikolov 2013a — Skip-Gram vs CBOW intro
Skip-Gram derin matematik — objective, softmax
CBOW derin matematik — context window avg
Computational bottleneck — V'ye softmax problem
Hierarchical softmax — Huffman tree çözümü
Negative sampling (Mikolov 2013b) — modern tercih
Subsampling — frequent word'lerin dampening'i
Dynamic window — variable context size
Pure Python implementation — 100 satır skip-gram
Gensim Türkçe demo — pratik word2vec-tr training
Vector arithmetic — magic of analogy (king-man+woman=queen)
Modern LLM embedding ile karşılaştırma

1. Tarihsel Bağlam — 2013 Öncesi#

1.1 Mikolov öncesi neural LM#

Bengio 2003: 'A Neural Probabilistic Language Model' — first NN-based word embedding
Collobert & Weston 2008: NLP from Scratch — task-agnostic embedding
Mnih & Hinton 2009: Hierarchical Log-Bilinear Model
Turian, Ratinov, Bengio 2010: 'Word representations: A simple and general method for semi-supervised learning'

Bu paper'lar embedding'i tanıttı ama practical scale yoktu: 100K kelime corpus, küçük vocab.

1.2 Mikolov 2013'ün sıçraması#

Corpus: 1.6 Billion words (Google News)
Vocab: 692K
Training time: 1 gün (16 core CPU)
Quality: vector arithmetic ile analogical reasoning yapabilir

Önceki state-of-art: 4 saat / 100K kelime / 30K vocab. Mikolov 1000x scale + 10x quality sağladı.

1.3 Niye bu mümkün oldu#

Simple architecture: hidden layer yok (sadece projection + softmax)
Distributed training: paralel optimize edilebilir
Hierarchical softmax: O(V) → O(log V)
Negative sampling: stochastic loss approximation
Subsampling: frequent words'i kayıpsız atla

Her biri orta-seviye mühendislik trick'i, ama beraber paradigma değişimi.

2. Skip-Gram vs CBOW — İki Yön#

2.1 Skip-Gram: center → context#

Given w_t, predict w_{t-k}, ..., w_{t-1}, w_{t+1}, ..., w_{t+k}

Örnek (k=2):

"Bugün İstanbul Boğazı çok güzel."
  w_t = "Boğazı"
  Predict: "Bugün", "İstanbul", "çok", "güzel"

Objective:

J = -1/T Σ_{t=1}^T Σ_{j∈[-k,k], j≠0} log P(w_{t+j} | w_t)

2.2 CBOW: context → center#

Given w_{t-k}, ..., w_{t+k}, predict w_t

Örnek:

"Bugün İstanbul ___ çok güzel."
  Context: ["Bugün", "İstanbul", "çok", "güzel"]
  Predict: "Boğazı"

Objective:

J = -1/T Σ_{t=1}^T log P(w_t | w_{t-k}, ..., w_{t+k})

2.3 Hangisi iyi#

Mikolov paper:

Skip-Gram: rare words için daha iyi (her token kendi predict hedefi)
CBOW: hızlı (her training step bir context → bir prediction), frequent words için yeterli

Modern preference: Skip-Gram (rare word quality önemli).

2.4 Pictorial overview#

             SKIP-GRAM                          CBOW
  +--------+                            +----------+
  | w_t    |---->                       | w_{t-2}  |
  +--------+    |---> Predict           +----------+
                |                       | w_{t-1}  | ---|
                +---> w_{t-2}           +----------+    |
                +---> w_{t-1}           | w_{t+1}  | ---+---> Predict
                +---> w_{t+1}           +----------+    |     w_t
                +---> w_{t+2}           | w_{t+2}  | ---|
                                        +----------+

2.5 Architecture (Skip-Gram)#

Input:  one-hot(w_t)              [V]
  ↓ multiply by W_in (V × d)
Hidden: vector(w_t)               [d]
  ↓ multiply by W_out (d × V)
Output: logits over vocab         [V]
  ↓ softmax
Probs:  P(* | w_t)                [V]

İki matrix: W_in (input embedding) ve W_out (output embedding). Genelde W_in kullanılır 'word vector' olarak.

2.6 Niye 'hidden layer YOK'#

Classical NN'de hidden layer = nonlinear activation. Word2vec'te sadece linear projection — hidden layer 'identity' (tanh/relu yok). Bu, modeli shallow yapar, training hızlandırır.

5. Computational Bottleneck — V'ye Softmax#

5.1 Softmax formula#

P(w_O | w_I) = exp(u_O^T v_I) / Σ_{w∈V} exp(u_w^T v_I)

Denominator: tüm V kelimeler üzerinden sum. V=692K → her training step için 692K hesap.

5.2 FLOP count#

Her training step:

Forward: V × d operasyon (output projection)
Backward: V × d gradient hesabı
Total: O(V × d) per token

692K × 300 = 207M FLOP per token. 1.6B token corpus → 3.3 × 10^17 FLOP total. CPU üzerinde yıllarca.

5.3 İki çözüm#

Hierarchical softmax — softmax'i tree-based replace et, O(log V)
Negative sampling — softmax'i binary classification yap, O(K) (K=5-20)

İkinci modern tercih.

5.4 Hierarchical softmax (kısa)#

Vocab'ı Huffman tree olarak organize et. Her kelime tree leaf'i. P(w | w_I) = tree path probability product.

Log(V) tree depth. Her node binary decision (left/right).
O(log V) operations per softmax instead of O(V).

692K vocab → 20 binary decisions instead of 692K. 35000x hızlanma.

5.5 Negative sampling (gerçek modern tercih)#

Insight: softmax'in tam hesabı GEREKMEZ, sadece pozitif örnekleri vurgula + birkaç negatif örneği çürüt.

Objective:

J = log σ(u_O^T v_I) + Σ_{w_N ∼ noise} log σ(-u_{w_N}^T v_I)

İlk term: pozitif (actual context word). σ = sigmoid. İkinci sum: K negatif örnek (random sampled noise). σ ile çürüt.

K = 5-20 typical. 692K vocab vs K=10 = 69200x hızlanma.

5.6 Noise distribution#

Negatif örnekler nasıl seçilir? Mikolov 2013b: unigram distribution^{0.75}:

P_noise(w) ∝ count(w)^0.75

Niye 0.75 power? Empirical sweet spot. Çok frequent ('the', 'a') ve çok rare arası dengeyi yakalar.

5.7 Pseudocode#

def neg_sampling_loss(center, context, negatives, W_in, W_out):
    v_c = W_in[center]            # [d]
    u_o = W_out[context]           # [d]
    u_n = W_out[negatives]         # [K, d]
    
    pos = -log_sigmoid(u_o @ v_c)
    neg = -log_sigmoid(-u_n @ v_c).sum()
    
    return pos + neg

FLOP per step: (d) for pos + K × (d) for negs = O(K × d). V'den bağımsız.

python

# Skip-gram with negative sampling — pure Python, ~100 lines
import numpy as np
import random
from collections import Counter, defaultdict
 
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-np.clip(x, -30, 30)))
 
def build_vocab(corpus, min_count=5):
    counts = Counter()
    for sentence in corpus:
        counts.update(sentence)
    vocab = {w: i for i, (w, c) in enumerate(counts.items()) if c >= min_count}
    word_freqs = np.array([counts[w] ** 0.75 for w in vocab])
    word_freqs /= word_freqs.sum()
    return vocab, word_freqs
 
def get_negatives(K, word_freqs):
    return np.random.choice(len(word_freqs), size=K, p=word_freqs)
 
def train_skipgram(corpus, d=100, window=5, K=10, epochs=5, lr=0.025):
    vocab, word_freqs = build_vocab(corpus)
    V = len(vocab)
    
    # Init embeddings
    W_in = np.random.uniform(-0.5/d, 0.5/d, (V, d))
    W_out = np.zeros((V, d))
    
    for epoch in range(epochs):
        random.shuffle(corpus)
        total_loss = 0
        steps = 0
        for sentence in corpus:
            tokens = [vocab[w] for w in sentence if w in vocab]
            for i, center in enumerate(tokens):
                # Dynamic window
                w = random.randint(1, window)
                ctx_indices = list(range(max(0, i - w), i)) + list(range(i + 1, min(len(tokens), i + w + 1)))
                for j in ctx_indices:
                    context = tokens[j]
                    # Positive
                    v_c = W_in[center]
                    u_o = W_out[context]
                    score = sigmoid(u_o @ v_c)
                    grad = score - 1
                    W_in[center] -= lr * grad * u_o
                    W_out[context] -= lr * grad * v_c
                    # Negatives
                    negs = get_negatives(K, word_freqs)
                    for neg in negs:
                        u_n = W_out[neg]
                        score = sigmoid(u_n @ v_c)
                        grad = score - 0
                        W_in[center] -= lr * grad * u_n
                        W_out[neg] -= lr * grad * v_c
                    total_loss += -np.log(sigmoid(u_o @ W_in[center]) + 1e-9)
                    steps += 1
        print(f"Epoch {epoch + 1}: avg loss = {total_loss / max(steps, 1):.4f}")
    
    return W_in, vocab
 
# Usage on Turkish corpus
turkish_corpus = [
    ["bugün", "İstanbul", "çok", "güzel"],
    ["İstanbul", "boğazı", "manzarası", "muhteşem"],
    # ... (gerçek corpus için daha fazla)
]
 
W_in, vocab = train_skipgram(turkish_corpus, d=100, epochs=10)
print(f"Vocab size: {len(vocab)}, embedding dim: {W_in.shape[1]}")
 
# Cosine similarity
def cosine_sim(v1, v2):
    return v1 @ v2 / (np.linalg.norm(v1) * np.linalg.norm(v2))
 
if "İstanbul" in vocab and "boğazı" in vocab:
    sim = cosine_sim(W_in[vocab["İstanbul"]], W_in[vocab["boğazı"]])
    print(f"Sim(İstanbul, boğazı) = {sim:.3f}")

Word2Vec Skip-Gram + Negative Sampling — pure Python implementation

11. Gensim ile Türkçe word2vec Demosu#

11.1 Setup#

pip install gensim

11.2 Corpus hazırlama#

import re
from gensim.models import Word2Vec

def tokenize_tr(text):
    text = text.lower()
    return re.findall(r"\b[\wçğıöşüâî]+\b", text, re.UNICODE)

import datasets
wiki_tr = datasets.load_dataset("wikipedia", "20240401.tr", split="train[:10000]")

sentences = []
for row in wiki_tr:
    for paragraph in row["text"].split("\n"):
        if len(paragraph) > 50:
            sentences.append(tokenize_tr(paragraph))

print(f"Total sentences: {len(sentences):,}")

11.3 Training#

model = Word2Vec(
    sentences=sentences,
    vector_size=100,       # d_model
    window=5,              # context window
    min_count=5,           # min word frequency
    workers=8,             # CPU thread count
    sg=1,                  # 1 = skip-gram, 0 = CBOW
    hs=0,                  # 0 = negative sampling, 1 = hierarchical softmax
    negative=10,           # K negative samples
    sample=1e-4,           # subsampling threshold
    epochs=10,             # iterations over corpus
)
model.save("word2vec-tr-100d.model")
print(f"Vocab size: {len(model.wv):,}")

Training time: ~5 dakika 10K Wikipedia paragrafları için.

11.4 Kullanım#

model = Word2Vec.load("word2vec-tr-100d.model")

# En benzer kelimeler
print(model.wv.most_similar("istanbul"))
# [('boğazı', 0.78), ('şehir', 0.74), ('ankara', 0.69), ...]

print(model.wv.most_similar("kahve"))
# [('çay', 0.68), ('süt', 0.56), ('demlemek', 0.54), ...]

# Analogy
result = model.wv.most_similar(
    positive=["istanbul", "almanya"],
    negative=["türkiye"],
)
print(result[0])  # ('berlin', 0.74) — başarı!

11.5 Vector arithmetic — magic#

# Vec(istanbul) - Vec(türkiye) + Vec(almanya) ≈ Vec(berlin)
king_vec = model.wv["kral"]
man_vec = model.wv["erkek"]
woman_vec = model.wv["kadın"]
queen_vec = king_vec - man_vec + woman_vec
sim = model.wv.most_similar(positive=[queen_vec])
print(sim[0])  # ('kraliçe', 0.71)

11.6 Türkçe için tipik kalite#

10M token Wikipedia corpus, 100d, 10 epoch:

analogy accuracy: %30-40 (limited corpus)
Most similar quality: subjectively good
Geographical analogies çalışıyor
Profession-gender analogies çalışıyor

✅ Ders 7.2 Özeti — Word2Vec Algoritması

Mikolov 2013: shallow neural network + clever engineering = embedding revolution. Skip-Gram (center → context) vs CBOW (context → center). Softmax bottleneck (O(V)) → negative sampling (O(K)) ile çözüldü. K=5-20 typical. Subsampling frequent word'lerin dampening'i. Dynamic window. Pure Python 100 satırda implementasyon mümkün. Gensim ile Türkçe word2vec-tr eğitim: 5 dakikada anlamlı embedding'ler. Vector arithmetic ile analogical reasoning: vec(istanbul) - vec(türkiye) + vec(almanya) ≈ vec(berlin). Ders 7.3'te GloVe + FastText'e geçeceğiz.

Sıradaki Ders: GloVe + FastText#

Ders 7.3'te global co-occurrence based GloVe ve subword-aware FastText. Türkçe morfolojik dil için FastText avantajı pratik demo.

Sık Sorulan Sorular

Doğrudan HAYIR. Modern LLM (Llama-3, GPT-4) end-to-end pre-trained — embedding katmanı transformer'la beraber öğreniliyor. AMA: (1) Initialization için word2vec embedding'ler bazen kullanılır. (2) Lightweight semantic search için hâlâ pratik. (3) Tarihsel/eğitsel öneminde anchor.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Modül 0: Kurs Çerçevesi ve Atölye Kurulumu

Ders Haritası (13 Bölüm)#

1. Tarihsel Bağlam — 2013 Öncesi#

1.1 Mikolov öncesi neural LM#

1.2 Mikolov 2013'ün sıçraması#

1.3 Niye bu mümkün oldu#

2. Skip-Gram vs CBOW — İki Yön#

2.1 Skip-Gram: center → context#

2.2 CBOW: context → center#

2.3 Hangisi iyi#

2.4 Pictorial overview#

2.5 Architecture (Skip-Gram)#

2.6 Niye 'hidden layer YOK'#

5. Computational Bottleneck — V'ye Softmax#

5.1 Softmax formula#

5.2 FLOP count#

5.3 İki çözüm#

5.4 Hierarchical softmax (kısa)#

5.5 Negative sampling (gerçek modern tercih)#

5.6 Noise distribution#

5.7 Pseudocode#

11. Gensim ile Türkçe word2vec Demosu#

11.1 Setup#

11.2 Corpus hazırlama#

11.3 Training#

11.4 Kullanım#

11.5 Vector arithmetic — magic#

11.6 Türkçe için tipik kalite#

Sıradaki Ders: GloVe + FastText#

Sık Sorulan Sorular

Word2Vec modern LLM'lerde hâlâ kullanılıyor mu?

Skip-Gram vs CBOW hangisi daha iyi?

Negative sampling neden çalışıyor — softmax full hesap olmadan?

Subsampling parametresi 1e-4 niye seçildi?

Word2vec ile sentence/document embedding nasıl alırım?

Yorumlar & Soru-Cevap

İlgili İçerikler

LLM Engineer Kimdir? Junior'dan Staff'a Yapay Zekâ Mühendisliği Kariyer Haritası

Kurs Felsefesi: Neden Bu Yol, Neden Bu Sıra — 8 Aylık Müfredatın İskeleti

Atölye Kurulumu: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight