Which is better, Skip-Gram or CBOW?

Mikolov: Skip-Gram better for rare word quality, but CBOW 4x faster for frequent words. Modern preference: Skip-Gram (quality > speed generally).

Why does negative sampling work — without full softmax computation?

Insight: full softmax not needed for gradient direction. Reinforce positive + discredit few negatives suffices. K=10-20 negatives → unbiased gradient estimator (asymptotically). Empirical: very close to full softmax quality, 1000x faster.

Why is subsampling parameter set to 1e-4?

Mikolov empirical: subsampling threshold per corpus frequencies dampens very frequent words like 'the', 'a', 'is'. Formula: P(keep w) = (sqrt(f / t) + 1) × (t / f). t = 1e-5 to 1e-4 typical sweet spot. For Turkish: 'bir', 've', 'de', 'da' very frequent words dampening.

How to get sentence/document embedding with word2vec?

Naive: average or sum word vectors. Better: TF-IDF weighted average. Modern: SentenceTransformer (BERT-based) much better. Word2vec is primarily token-level — modern alternatives recommended for sentence-level.

Word2Vec Line by Line: Anatomy of Mikolov 2013's Skip-Gram + CBOW + Negative Sampling

Line-by-line anatomy of Mikolov 2013 paper: Skip-Gram vs CBOW architecture differences, softmax computational bottleneck, hierarchical softmax (Huffman tree), negative sampling (Mikolov 2013b), subsampling, dynamic window. Pure Python implementation in 100 lines. Gensim Turkish word2vec training demo. Comparison with modern LLM embeddings.

Şükrü Yusuf KAYA

70 min read

5/13/2026

Advanced

Word2Vec Satır Satır: Mikolov 2013'ün Skip-Gram + CBOW + Negative Sampling Anatomisi

📜 Mikolov 2013 — embedding'in evrimini başlatan paper

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (Google) Ocak 2013'te 'Efficient Estimation of Word Representations in Vector Space' paper'ını yayınladı. Bu 13 sayfa, NLP'yi sonsuza dek değiştirdi. Shallow neural network + distributional hypothesis + mühendislik incelikleri (negative sampling, hierarchical softmax, subsampling). Eylül 2013'te follow-up paper'ı yayınlandı, code release oldu (`word2vec` C library), sonra gensim Python port'u. 2.5 yıl içinde dünya'nın her NLP labında kullanıma girdi. BERT, GPT, Llama — hepsi bu paper'a borçlu. 70 dakika sonra: paper'ı satır satır anlamış, kendi 100-satır Python implementasyonunu yazmış, Türkçe corpus'ta word2vec-tr eğitmiş olacaksın. Bu, embedding'in algoritmik temel taşı.

Ders Haritası (13 Bölüm)#

Tarihsel bağlam — neural language model 2013 öncesi
Mikolov 2013a — Skip-Gram vs CBOW intro
Skip-Gram derin matematik — objective, softmax
CBOW derin matematik — context window avg
Computational bottleneck — V'ye softmax problem
Hierarchical softmax — Huffman tree çözümü
Negative sampling (Mikolov 2013b) — modern tercih
Subsampling — frequent word'lerin dampening'i
Dynamic window — variable context size
Pure Python implementation — 100 satır skip-gram
Gensim Türkçe demo — pratik word2vec-tr training
Vector arithmetic — magic of analogy (king-man+woman=queen)
Modern LLM embedding ile karşılaştırma

1. Tarihsel Bağlam — 2013 Öncesi#

1.1 Mikolov öncesi neural LM#

Bengio 2003: 'A Neural Probabilistic Language Model' — first NN-based word embedding
Collobert & Weston 2008: NLP from Scratch — task-agnostic embedding
Mnih & Hinton 2009: Hierarchical Log-Bilinear Model
Turian, Ratinov, Bengio 2010: 'Word representations: A simple and general method for semi-supervised learning'

Bu paper'lar embedding'i tanıttı ama practical scale yoktu: 100K kelime corpus, küçük vocab.

1.2 Mikolov 2013'ün sıçraması#

Corpus: 1.6 Billion words (Google News)
Vocab: 692K
Training time: 1 gün (16 core CPU)
Quality: vector arithmetic ile analogical reasoning yapabilir

Önceki state-of-art: 4 saat / 100K kelime / 30K vocab. Mikolov 1000x scale + 10x quality sağladı.

1.3 Niye bu mümkün oldu#

Simple architecture: hidden layer yok (sadece projection + softmax)
Distributed training: paralel optimize edilebilir
Hierarchical softmax: O(V) → O(log V)
Negative sampling: stochastic loss approximation
Subsampling: frequent words'i kayıpsız atla

Her biri orta-seviye mühendislik trick'i, ama beraber paradigma değişimi.

2. Skip-Gram vs CBOW — İki Yön#

2.1 Skip-Gram: center → context#

Given w_t, predict w_{t-k}, ..., w_{t-1}, w_{t+1}, ..., w_{t+k}

Örnek (k=2):

"Bugün İstanbul Boğazı çok güzel."
  w_t = "Boğazı"
  Predict: "Bugün", "İstanbul", "çok", "güzel"

Objective:

J = -1/T Σ_{t=1}^T Σ_{j∈[-k,k], j≠0} log P(w_{t+j} | w_t)

2.2 CBOW: context → center#

Given w_{t-k}, ..., w_{t+k}, predict w_t

Örnek:

"Bugün İstanbul ___ çok güzel."
  Context: ["Bugün", "İstanbul", "çok", "güzel"]
  Predict: "Boğazı"

Objective:

J = -1/T Σ_{t=1}^T log P(w_t | w_{t-k}, ..., w_{t+k})

2.3 Hangisi iyi#

Mikolov paper:

Skip-Gram: rare words için daha iyi (her token kendi predict hedefi)
CBOW: hızlı (her training step bir context → bir prediction), frequent words için yeterli

Modern preference: Skip-Gram (rare word quality önemli).

2.4 Pictorial overview#

             SKIP-GRAM                          CBOW
  +--------+                            +----------+
  | w_t    |---->                       | w_{t-2}  |
  +--------+    |---> Predict           +----------+
                |                       | w_{t-1}  | ---|
                +---> w_{t-2}           +----------+    |
                +---> w_{t-1}           | w_{t+1}  | ---+---> Predict
                +---> w_{t+1}           +----------+    |     w_t
                +---> w_{t+2}           | w_{t+2}  | ---|
                                        +----------+

2.5 Architecture (Skip-Gram)#

Input:  one-hot(w_t)              [V]
  ↓ multiply by W_in (V × d)
Hidden: vector(w_t)               [d]
  ↓ multiply by W_out (d × V)
Output: logits over vocab         [V]
  ↓ softmax
Probs:  P(* | w_t)                [V]

İki matrix: W_in (input embedding) ve W_out (output embedding). Genelde W_in kullanılır 'word vector' olarak.

2.6 Niye 'hidden layer YOK'#

Classical NN'de hidden layer = nonlinear activation. Word2vec'te sadece linear projection — hidden layer 'identity' (tanh/relu yok). Bu, modeli shallow yapar, training hızlandırır.

5. Computational Bottleneck — V'ye Softmax#

5.1 Softmax formula#

P(w_O | w_I) = exp(u_O^T v_I) / Σ_{w∈V} exp(u_w^T v_I)

Denominator: tüm V kelimeler üzerinden sum. V=692K → her training step için 692K hesap.

5.2 FLOP count#

Her training step:

Forward: V × d operasyon (output projection)
Backward: V × d gradient hesabı
Total: O(V × d) per token

692K × 300 = 207M FLOP per token. 1.6B token corpus → 3.3 × 10^17 FLOP total. CPU üzerinde yıllarca.

5.3 İki çözüm#

Hierarchical softmax — softmax'i tree-based replace et, O(log V)
Negative sampling — softmax'i binary classification yap, O(K) (K=5-20)

İkinci modern tercih.

5.4 Hierarchical softmax (kısa)#

Vocab'ı Huffman tree olarak organize et. Her kelime tree leaf'i. P(w | w_I) = tree path probability product.

Log(V) tree depth. Her node binary decision (left/right).
O(log V) operations per softmax instead of O(V).

692K vocab → 20 binary decisions instead of 692K. 35000x hızlanma.

5.5 Negative sampling (gerçek modern tercih)#

Insight: softmax'in tam hesabı GEREKMEZ, sadece pozitif örnekleri vurgula + birkaç negatif örneği çürüt.

Objective:

J = log σ(u_O^T v_I) + Σ_{w_N ∼ noise} log σ(-u_{w_N}^T v_I)

İlk term: pozitif (actual context word). σ = sigmoid. İkinci sum: K negatif örnek (random sampled noise). σ ile çürüt.

K = 5-20 typical. 692K vocab vs K=10 = 69200x hızlanma.

5.6 Noise distribution#

Negatif örnekler nasıl seçilir? Mikolov 2013b: unigram distribution^{0.75}:

P_noise(w) ∝ count(w)^0.75

Niye 0.75 power? Empirical sweet spot. Çok frequent ('the', 'a') ve çok rare arası dengeyi yakalar.

5.7 Pseudocode#

def neg_sampling_loss(center, context, negatives, W_in, W_out):
    v_c = W_in[center]            # [d]
    u_o = W_out[context]           # [d]
    u_n = W_out[negatives]         # [K, d]
    
    pos = -log_sigmoid(u_o @ v_c)
    neg = -log_sigmoid(-u_n @ v_c).sum()
    
    return pos + neg

FLOP per step: (d) for pos + K × (d) for negs = O(K × d). V'den bağımsız.

python

# Skip-gram with negative sampling — pure Python, ~100 lines
import numpy as np
import random
from collections import Counter, defaultdict
 
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-np.clip(x, -30, 30)))
 
def build_vocab(corpus, min_count=5):
    counts = Counter()
    for sentence in corpus:
        counts.update(sentence)
    vocab = {w: i for i, (w, c) in enumerate(counts.items()) if c >= min_count}
    word_freqs = np.array([counts[w] ** 0.75 for w in vocab])
    word_freqs /= word_freqs.sum()
    return vocab, word_freqs
 
def get_negatives(K, word_freqs):
    return np.random.choice(len(word_freqs), size=K, p=word_freqs)
 
def train_skipgram(corpus, d=100, window=5, K=10, epochs=5, lr=0.025):
    vocab, word_freqs = build_vocab(corpus)
    V = len(vocab)
    
    # Init embeddings
    W_in = np.random.uniform(-0.5/d, 0.5/d, (V, d))
    W_out = np.zeros((V, d))
    
    for epoch in range(epochs):
        random.shuffle(corpus)
        total_loss = 0
        steps = 0
        for sentence in corpus:
            tokens = [vocab[w] for w in sentence if w in vocab]
            for i, center in enumerate(tokens):
                # Dynamic window
                w = random.randint(1, window)
                ctx_indices = list(range(max(0, i - w), i)) + list(range(i + 1, min(len(tokens), i + w + 1)))
                for j in ctx_indices:
                    context = tokens[j]
                    # Positive
                    v_c = W_in[center]
                    u_o = W_out[context]
                    score = sigmoid(u_o @ v_c)
                    grad = score - 1
                    W_in[center] -= lr * grad * u_o
                    W_out[context] -= lr * grad * v_c
                    # Negatives
                    negs = get_negatives(K, word_freqs)
                    for neg in negs:
                        u_n = W_out[neg]
                        score = sigmoid(u_n @ v_c)
                        grad = score - 0
                        W_in[center] -= lr * grad * u_n
                        W_out[neg] -= lr * grad * v_c
                    total_loss += -np.log(sigmoid(u_o @ W_in[center]) + 1e-9)
                    steps += 1
        print(f"Epoch {epoch + 1}: avg loss = {total_loss / max(steps, 1):.4f}")
    
    return W_in, vocab
 
# Usage on Turkish corpus
turkish_corpus = [
    ["bugün", "İstanbul", "çok", "güzel"],
    ["İstanbul", "boğazı", "manzarası", "muhteşem"],
    # ... (gerçek corpus için daha fazla)
]
 
W_in, vocab = train_skipgram(turkish_corpus, d=100, epochs=10)
print(f"Vocab size: {len(vocab)}, embedding dim: {W_in.shape[1]}")
 
# Cosine similarity
def cosine_sim(v1, v2):
    return v1 @ v2 / (np.linalg.norm(v1) * np.linalg.norm(v2))
 
if "İstanbul" in vocab and "boğazı" in vocab:
    sim = cosine_sim(W_in[vocab["İstanbul"]], W_in[vocab["boğazı"]])
    print(f"Sim(İstanbul, boğazı) = {sim:.3f}")

Word2Vec Skip-Gram + Negative Sampling — pure Python implementation

11. Gensim ile Türkçe word2vec Demosu#

11.1 Setup#

pip install gensim

11.2 Corpus hazırlama#

import re
from gensim.models import Word2Vec

def tokenize_tr(text):
    text = text.lower()
    return re.findall(r"\b[\wçğıöşüâî]+\b", text, re.UNICODE)

import datasets
wiki_tr = datasets.load_dataset("wikipedia", "20240401.tr", split="train[:10000]")

sentences = []
for row in wiki_tr:
    for paragraph in row["text"].split("\n"):
        if len(paragraph) > 50:
            sentences.append(tokenize_tr(paragraph))

print(f"Total sentences: {len(sentences):,}")

11.3 Training#

model = Word2Vec(
    sentences=sentences,
    vector_size=100,       # d_model
    window=5,              # context window
    min_count=5,           # min word frequency
    workers=8,             # CPU thread count
    sg=1,                  # 1 = skip-gram, 0 = CBOW
    hs=0,                  # 0 = negative sampling, 1 = hierarchical softmax
    negative=10,           # K negative samples
    sample=1e-4,           # subsampling threshold
    epochs=10,             # iterations over corpus
)
model.save("word2vec-tr-100d.model")
print(f"Vocab size: {len(model.wv):,}")

Training time: ~5 dakika 10K Wikipedia paragrafları için.

11.4 Kullanım#

model = Word2Vec.load("word2vec-tr-100d.model")

# En benzer kelimeler
print(model.wv.most_similar("istanbul"))
# [('boğazı', 0.78), ('şehir', 0.74), ('ankara', 0.69), ...]

print(model.wv.most_similar("kahve"))
# [('çay', 0.68), ('süt', 0.56), ('demlemek', 0.54), ...]

# Analogy
result = model.wv.most_similar(
    positive=["istanbul", "almanya"],
    negative=["türkiye"],
)
print(result[0])  # ('berlin', 0.74) — başarı!

11.5 Vector arithmetic — magic#

# Vec(istanbul) - Vec(türkiye) + Vec(almanya) ≈ Vec(berlin)
king_vec = model.wv["kral"]
man_vec = model.wv["erkek"]
woman_vec = model.wv["kadın"]
queen_vec = king_vec - man_vec + woman_vec
sim = model.wv.most_similar(positive=[queen_vec])
print(sim[0])  # ('kraliçe', 0.71)

11.6 Türkçe için tipik kalite#

10M token Wikipedia corpus, 100d, 10 epoch:

analogy accuracy: %30-40 (limited corpus)
Most similar quality: subjectively good
Geographical analogies çalışıyor
Profession-gender analogies çalışıyor

✅ Ders 7.2 Özeti — Word2Vec Algoritması

Mikolov 2013: shallow neural network + clever engineering = embedding revolution. Skip-Gram (center → context) vs CBOW (context → center). Softmax bottleneck (O(V)) → negative sampling (O(K)) ile çözüldü. K=5-20 typical. Subsampling frequent word'lerin dampening'i. Dynamic window. Pure Python 100 satırda implementasyon mümkün. Gensim ile Türkçe word2vec-tr eğitim: 5 dakikada anlamlı embedding'ler. Vector arithmetic ile analogical reasoning: vec(istanbul) - vec(türkiye) + vec(almanya) ≈ vec(berlin). Ders 7.3'te GloVe + FastText'e geçeceğiz.

Sıradaki Ders: GloVe + FastText#

Ders 7.3'te global co-occurrence based GloVe ve subword-aware FastText. Türkçe morfolojik dil için FastText avantajı pratik demo.

Frequently Asked Questions

Directly NO. Modern LLMs (Llama-3, GPT-4) are end-to-end pre-trained — embedding layer learned with transformer. BUT: (1) Word2Vec embeddings sometimes used for initialization. (2) Lightweight semantic search still practical. (3) Historical/educational anchor.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Ders Haritası (13 Bölüm)#

1. Tarihsel Bağlam — 2013 Öncesi#

1.1 Mikolov öncesi neural LM#

1.2 Mikolov 2013'ün sıçraması#

1.3 Niye bu mümkün oldu#

2. Skip-Gram vs CBOW — İki Yön#

2.1 Skip-Gram: center → context#

2.2 CBOW: context → center#

2.3 Hangisi iyi#

2.4 Pictorial overview#

2.5 Architecture (Skip-Gram)#

2.6 Niye 'hidden layer YOK'#

5. Computational Bottleneck — V'ye Softmax#

5.1 Softmax formula#

5.2 FLOP count#

5.3 İki çözüm#

5.4 Hierarchical softmax (kısa)#

5.5 Negative sampling (gerçek modern tercih)#

5.6 Noise distribution#

5.7 Pseudocode#

11. Gensim ile Türkçe word2vec Demosu#

11.1 Setup#

11.2 Corpus hazırlama#

11.3 Training#

11.4 Kullanım#

11.5 Vector arithmetic — magic#

11.6 Türkçe için tipik kalite#

Sıradaki Ders: GloVe + FastText#

Frequently Asked Questions

Is Word2Vec still used in modern LLMs?

Which is better, Skip-Gram or CBOW?

Why does negative sampling work — without full softmax computation?

Why is subsampling parameter set to 1e-4?

How to get sentence/document embedding with word2vec?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight