Word2Vec Line by Line: Anatomy of Mikolov 2013's Skip-Gram + CBOW + Negative Sampling
Line-by-line anatomy of Mikolov 2013 paper: Skip-Gram vs CBOW architecture differences, softmax computational bottleneck, hierarchical softmax (Huffman tree), negative sampling (Mikolov 2013b), subsampling, dynamic window. Pure Python implementation in 100 lines. Gensim Turkish word2vec training demo. Comparison with modern LLM embeddings.
Şükrü Yusuf KAYA
70 min read
Advanced📜 Mikolov 2013 — embedding'in evrimini başlatan paper
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (Google) Ocak 2013'te 'Efficient Estimation of Word Representations in Vector Space' paper'ını yayınladı. Bu 13 sayfa, NLP'yi sonsuza dek değiştirdi. Shallow neural network + distributional hypothesis + mühendislik incelikleri (negative sampling, hierarchical softmax, subsampling). Eylül 2013'te follow-up paper'ı yayınlandı, code release oldu (`word2vec` C library), sonra gensim Python port'u. 2.5 yıl içinde dünya'nın her NLP labında kullanıma girdi. BERT, GPT, Llama — hepsi bu paper'a borçlu. 70 dakika sonra: paper'ı satır satır anlamış, kendi 100-satır Python implementasyonunu yazmış, Türkçe corpus'ta word2vec-tr eğitmiş olacaksın. Bu, embedding'in algoritmik temel taşı.
Ders Haritası (13 Bölüm)#
- Tarihsel bağlam — neural language model 2013 öncesi
- Mikolov 2013a — Skip-Gram vs CBOW intro
- Skip-Gram derin matematik — objective, softmax
- CBOW derin matematik — context window avg
- Computational bottleneck — V'ye softmax problem
- Hierarchical softmax — Huffman tree çözümü
- Negative sampling (Mikolov 2013b) — modern tercih
- Subsampling — frequent word'lerin dampening'i
- Dynamic window — variable context size
- Pure Python implementation — 100 satır skip-gram
- Gensim Türkçe demo — pratik word2vec-tr training
- Vector arithmetic — magic of analogy (king-man+woman=queen)
- Modern LLM embedding ile karşılaştırma
1. Tarihsel Bağlam — 2013 Öncesi#
1.1 Mikolov öncesi neural LM#
- Bengio 2003: 'A Neural Probabilistic Language Model' — first NN-based word embedding
- Collobert & Weston 2008: NLP from Scratch — task-agnostic embedding
- Mnih & Hinton 2009: Hierarchical Log-Bilinear Model
- Turian, Ratinov, Bengio 2010: 'Word representations: A simple and general method for semi-supervised learning'
Bu paper'lar embedding'i tanıttı ama practical scale yoktu: 100K kelime corpus, küçük vocab.
1.2 Mikolov 2013'ün sıçraması#
- Corpus: 1.6 Billion words (Google News)
- Vocab: 692K
- Training time: 1 gün (16 core CPU)
- Quality: vector arithmetic ile analogical reasoning yapabilir
Önceki state-of-art: 4 saat / 100K kelime / 30K vocab.
Mikolov 1000x scale + 10x quality sağladı.
1.3 Niye bu mümkün oldu#
- Simple architecture: hidden layer yok (sadece projection + softmax)
- Distributed training: paralel optimize edilebilir
- Hierarchical softmax: O(V) → O(log V)
- Negative sampling: stochastic loss approximation
- Subsampling: frequent words'i kayıpsız atla
Her biri orta-seviye mühendislik trick'i, ama beraber paradigma değişimi.
2. Skip-Gram vs CBOW — İki Yön#
2.1 Skip-Gram: center → context#
Given w_t, predict w_{t-k}, ..., w_{t-1}, w_{t+1}, ..., w_{t+k}
Örnek (k=2):
"Bugün İstanbul Boğazı çok güzel." w_t = "Boğazı" Predict: "Bugün", "İstanbul", "çok", "güzel"
Objective:
J = -1/T Σ_{t=1}^T Σ_{j∈[-k,k], j≠0} log P(w_{t+j} | w_t)
2.2 CBOW: context → center#
Given w_{t-k}, ..., w_{t+k}, predict w_t
Örnek:
"Bugün İstanbul ___ çok güzel." Context: ["Bugün", "İstanbul", "çok", "güzel"] Predict: "Boğazı"
Objective:
J = -1/T Σ_{t=1}^T log P(w_t | w_{t-k}, ..., w_{t+k})
2.3 Hangisi iyi#
Mikolov paper:
- Skip-Gram: rare words için daha iyi (her token kendi predict hedefi)
- CBOW: hızlı (her training step bir context → bir prediction), frequent words için yeterli
Modern preference: Skip-Gram (rare word quality önemli).
2.4 Pictorial overview#
SKIP-GRAM CBOW +--------+ +----------+ | w_t |----> | w_{t-2} | +--------+ |---> Predict +----------+ | | w_{t-1} | ---| +---> w_{t-2} +----------+ | +---> w_{t-1} | w_{t+1} | ---+---> Predict +---> w_{t+1} +----------+ | w_t +---> w_{t+2} | w_{t+2} | ---| +----------+
2.5 Architecture (Skip-Gram)#
Input: one-hot(w_t) [V] ↓ multiply by W_in (V × d) Hidden: vector(w_t) [d] ↓ multiply by W_out (d × V) Output: logits over vocab [V] ↓ softmax Probs: P(* | w_t) [V]
İki matrix: W_in (input embedding) ve W_out (output embedding). Genelde W_in kullanılır 'word vector' olarak.
2.6 Niye 'hidden layer YOK'#
Classical NN'de hidden layer = nonlinear activation. Word2vec'te sadece linear projection — hidden layer 'identity' (tanh/relu yok). Bu, modeli shallow yapar, training hızlandırır.
5. Computational Bottleneck — V'ye Softmax#
5.1 Softmax formula#
P(w_O | w_I) = exp(u_O^T v_I) / Σ_{w∈V} exp(u_w^T v_I)
Denominator: tüm V kelimeler üzerinden sum. V=692K → her training step için 692K hesap.
5.2 FLOP count#
Her training step:
- Forward: V × d operasyon (output projection)
- Backward: V × d gradient hesabı
- Total: O(V × d) per token
692K × 300 = 207M FLOP per token. 1.6B token corpus → 3.3 × 10^17 FLOP total. CPU üzerinde yıllarca.
5.3 İki çözüm#
- Hierarchical softmax — softmax'i tree-based replace et, O(log V)
- Negative sampling — softmax'i binary classification yap, O(K) (K=5-20)
İkinci modern tercih.
5.4 Hierarchical softmax (kısa)#
Vocab'ı Huffman tree olarak organize et. Her kelime tree leaf'i. P(w | w_I) = tree path probability product.
Log(V) tree depth. Her node binary decision (left/right). O(log V) operations per softmax instead of O(V).
692K vocab → 20 binary decisions instead of 692K. 35000x hızlanma.
5.5 Negative sampling (gerçek modern tercih)#
Insight: softmax'in tam hesabı GEREKMEZ, sadece pozitif örnekleri vurgula + birkaç negatif örneği çürüt.
Objective:
J = log σ(u_O^T v_I) + Σ_{w_N ∼ noise} log σ(-u_{w_N}^T v_I)
İlk term: pozitif (actual context word). σ = sigmoid.
İkinci sum: K negatif örnek (random sampled noise). σ ile çürüt.
K = 5-20 typical. 692K vocab vs K=10 = 69200x hızlanma.
5.6 Noise distribution#
Negatif örnekler nasıl seçilir? Mikolov 2013b: unigram distribution^{0.75}:
P_noise(w) ∝ count(w)^0.75
Niye 0.75 power? Empirical sweet spot. Çok frequent ('the', 'a') ve çok rare arası dengeyi yakalar.
5.7 Pseudocode#
def neg_sampling_loss(center, context, negatives, W_in, W_out): v_c = W_in[center] # [d] u_o = W_out[context] # [d] u_n = W_out[negatives] # [K, d] pos = -log_sigmoid(u_o @ v_c) neg = -log_sigmoid(-u_n @ v_c).sum() return pos + neg
FLOP per step: (d) for pos + K × (d) for negs = O(K × d). V'den bağımsız.
python
# Skip-gram with negative sampling — pure Python, ~100 linesimport numpy as npimport randomfrom collections import Counter, defaultdict def sigmoid(x): return 1.0 / (1.0 + np.exp(-np.clip(x, -30, 30))) def build_vocab(corpus, min_count=5): counts = Counter() for sentence in corpus: counts.update(sentence) vocab = {w: i for i, (w, c) in enumerate(counts.items()) if c >= min_count} word_freqs = np.array([counts[w] ** 0.75 for w in vocab]) word_freqs /= word_freqs.sum() return vocab, word_freqs def get_negatives(K, word_freqs): return np.random.choice(len(word_freqs), size=K, p=word_freqs) def train_skipgram(corpus, d=100, window=5, K=10, epochs=5, lr=0.025): vocab, word_freqs = build_vocab(corpus) V = len(vocab) # Init embeddings W_in = np.random.uniform(-0.5/d, 0.5/d, (V, d)) W_out = np.zeros((V, d)) for epoch in range(epochs): random.shuffle(corpus) total_loss = 0 steps = 0 for sentence in corpus: tokens = [vocab[w] for w in sentence if w in vocab] for i, center in enumerate(tokens): # Dynamic window w = random.randint(1, window) ctx_indices = list(range(max(0, i - w), i)) + list(range(i + 1, min(len(tokens), i + w + 1))) for j in ctx_indices: context = tokens[j] # Positive v_c = W_in[center] u_o = W_out[context] score = sigmoid(u_o @ v_c) grad = score - 1 W_in[center] -= lr * grad * u_o W_out[context] -= lr * grad * v_c # Negatives negs = get_negatives(K, word_freqs) for neg in negs: u_n = W_out[neg] score = sigmoid(u_n @ v_c) grad = score - 0 W_in[center] -= lr * grad * u_n W_out[neg] -= lr * grad * v_c total_loss += -np.log(sigmoid(u_o @ W_in[center]) + 1e-9) steps += 1 print(f"Epoch {epoch + 1}: avg loss = {total_loss / max(steps, 1):.4f}") return W_in, vocab # Usage on Turkish corpusturkish_corpus = [ ["bugün", "İstanbul", "çok", "güzel"], ["İstanbul", "boğazı", "manzarası", "muhteşem"], # ... (gerçek corpus için daha fazla)] W_in, vocab = train_skipgram(turkish_corpus, d=100, epochs=10)print(f"Vocab size: {len(vocab)}, embedding dim: {W_in.shape[1]}") # Cosine similaritydef cosine_sim(v1, v2): return v1 @ v2 / (np.linalg.norm(v1) * np.linalg.norm(v2)) if "İstanbul" in vocab and "boğazı" in vocab: sim = cosine_sim(W_in[vocab["İstanbul"]], W_in[vocab["boğazı"]]) print(f"Sim(İstanbul, boğazı) = {sim:.3f}")Word2Vec Skip-Gram + Negative Sampling — pure Python implementation
11. Gensim ile Türkçe word2vec Demosu#
11.1 Setup#
pip install gensim
11.2 Corpus hazırlama#
import re from gensim.models import Word2Vec def tokenize_tr(text): text = text.lower() return re.findall(r"\b[\wçğıöşüâî]+\b", text, re.UNICODE) import datasets wiki_tr = datasets.load_dataset("wikipedia", "20240401.tr", split="train[:10000]") sentences = [] for row in wiki_tr: for paragraph in row["text"].split("\n"): if len(paragraph) > 50: sentences.append(tokenize_tr(paragraph)) print(f"Total sentences: {len(sentences):,}")
11.3 Training#
model = Word2Vec( sentences=sentences, vector_size=100, # d_model window=5, # context window min_count=5, # min word frequency workers=8, # CPU thread count sg=1, # 1 = skip-gram, 0 = CBOW hs=0, # 0 = negative sampling, 1 = hierarchical softmax negative=10, # K negative samples sample=1e-4, # subsampling threshold epochs=10, # iterations over corpus ) model.save("word2vec-tr-100d.model") print(f"Vocab size: {len(model.wv):,}")
Training time: ~5 dakika 10K Wikipedia paragrafları için.
11.4 Kullanım#
model = Word2Vec.load("word2vec-tr-100d.model") # En benzer kelimeler print(model.wv.most_similar("istanbul")) # [('boğazı', 0.78), ('şehir', 0.74), ('ankara', 0.69), ...] print(model.wv.most_similar("kahve")) # [('çay', 0.68), ('süt', 0.56), ('demlemek', 0.54), ...] # Analogy result = model.wv.most_similar( positive=["istanbul", "almanya"], negative=["türkiye"], ) print(result[0]) # ('berlin', 0.74) — başarı!
11.5 Vector arithmetic — magic#
# Vec(istanbul) - Vec(türkiye) + Vec(almanya) ≈ Vec(berlin) king_vec = model.wv["kral"] man_vec = model.wv["erkek"] woman_vec = model.wv["kadın"] queen_vec = king_vec - man_vec + woman_vec sim = model.wv.most_similar(positive=[queen_vec]) print(sim[0]) # ('kraliçe', 0.71)
11.6 Türkçe için tipik kalite#
10M token Wikipedia corpus, 100d, 10 epoch:
- analogy accuracy: %30-40 (limited corpus)
- Most similar quality: subjectively good
- Geographical analogies çalışıyor
- Profession-gender analogies çalışıyor
✅ Ders 7.2 Özeti — Word2Vec Algoritması
Mikolov 2013: shallow neural network + clever engineering = embedding revolution. Skip-Gram (center → context) vs CBOW (context → center). Softmax bottleneck (O(V)) → negative sampling (O(K)) ile çözüldü. K=5-20 typical. Subsampling frequent word'lerin dampening'i. Dynamic window. Pure Python 100 satırda implementasyon mümkün. Gensim ile Türkçe word2vec-tr eğitim: 5 dakikada anlamlı embedding'ler. Vector arithmetic ile analogical reasoning: vec(istanbul) - vec(türkiye) + vec(almanya) ≈ vec(berlin). Ders 7.3'te GloVe + FastText'e geçeceğiz.
Sıradaki Ders: GloVe + FastText#
Ders 7.3'te global co-occurrence based GloVe ve subword-aware FastText. Türkçe morfolojik dil için FastText avantajı pratik demo.
Frequently Asked Questions
Directly NO. Modern LLMs (Llama-3, GPT-4) are end-to-end pre-trained — embedding layer learned with transformer. BUT: (1) Word2Vec embeddings sometimes used for initialization. (2) Lightweight semantic search still practical. (3) Historical/educational anchor.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup