Why does FastText perform better in Turkish?

Turkish is agglutinative — a word is a sequence of morphemes ('anlaşamadık' = anlaş + ama + dık). Subword n-grams implicitly capture morpheme boundaries. Word2Vec learns separate vector per morphological variant (data inefficient). FastText composes shared subword vectors → better rare word quality, no OOV, better generalization.

On what corpus is FastText pretrained Turkish model trained?

Facebook official: Common Crawl + Wikipedia (for Turkish). Released in 2017, periodically updated. cc.tr.300.bin (6 GB) — 300d, 2M+ vocab including n-grams. Wikipedia + web crawl mix.

GloVe co-occurrence matrix doesn't fit in RAM, what should I do?

Sparse storage (CSR/CSC). Compute co-occurrence on-the-fly with mmap'd corpus. Hadoop/Spark for big corpus. Or: implementation feature — GloVe official code uses disk-backed co-occurrence file (text format), training reads stream.

Is FastText init used in modern LLM training?

Rare. Modern LLMs are end-to-end pre-trained — embedding learns from random init alongside transformer. No production model uses Word2Vec/FastText init. Educational scenarios and specialized small models still use static embedding init.

GloVe + FastText: Global Co-Occurrence Matrix + Subword N-Gram Extension

GloVe (Pennington 2014) global co-occurrence matrix approach vs Word2Vec local window: mathematical formulation, weighted least squares objective, X_ij interpretation. FastText (Bojanowski 2017) subword n-gram embedding: 'merhaba' = 'mer' + 'erh' + ... OOV problem solution, ideal for Turkish morphological languages. Performance comparison, which scenario for each.

Şükrü Yusuf KAYA

65 min read

5/13/2026

Advanced

GloVe + FastText: Global Co-Occurrence Matrisi + Subword N-Gram Genişletme

🌐 Word2Vec'ten sonra iki büyük gelişme — global + subword

Word2Vec local context window'larla çalışır. Stanford NLP grubunun sorusu: global co-occurrence istatistiklerini doğrudan kullansak? Cevap: GloVe (Global Vectors, Pennington 2014). Facebook AI'nın başka bir sorusu: subword bilgiyi kullansak — 'merhaba'yı 'mer', 'erh', 'rha' n-gramlarından compose etsek? Cevap: FastText (Bojanowski 2017). FastText Türkçe gibi morfolojik dillerde Word2Vec'i yener — bilim. 65 dakika sonra: GloVe'un weighted least squares objective'ini, FastText'in subword compositionality'sini, Türkçe için pratik avantajlarını derinlemesine kavramış olacaksın.

Ders Haritası (11 Bölüm)#

Word2Vec'in eksiği — local vs global statistics tartışması
Co-occurrence matrisi — X_ij interpretasyonu
GloVe objective — weighted least squares
GloVe vs Word2Vec — empirical karşılaştırma
FastText motivasyonu — OOV ve morfoloji
Subword n-gram modeli — character n-gram embedding
FastText training — extension of Skip-Gram
Türkçe için FastText — morfolojik dil testi
Pratik kullanım — gensim FastText demo
Modern era — niye eski tek-statik embedding'ler hâlâ önemli
Seçim kılavuzu — Word2Vec/GloVe/FastText hangisi ne zaman

1. Word2Vec'in Eksiği — Local vs Global#

1.1 Word2vec local window#

For each sentence in corpus:
    For each token w_t:
        Look at window [w_{t-k}, w_{t+k}]
        Update embedding based on local context

Local: window 5-10 token. Corpus level co-occurrence statistics implicit hesaplanmaz.

1.2 Local'in dezavantajı#

Distant relationships kaybolur ('Türkiye'nin başkenti Ankara'dır' — Türkiye, Ankara aynı window'da değil)
Statistical efficiency düşük: aynı co-occurrence bilgisi her cümlede tekrarlanır
Long-tail patterns nadir görüldüğü için kaçırılır

1.3 GloVe iddiası#

Corpus seviyesinde tam co-occurrence istatistikleri hesapla, sonra bunlardan optimize et. Daha statistically efficient.

1.4 X_ij — co-occurrence count#

X_{ij} = i ve j kelimelerinin co-occurrence (window içinde beraber görülme) sayısı.

Örnek (Türkçe corpus):

X["İstanbul", "Boğazı"] = 12,543      # çok beraber geçer
X["İstanbul", "Ankara"]  = 3,210       # bazen
X["İstanbul", "kuantum"] = 4            # nadiren

Matrix shape: V × V. 692K vocab → 692K × 692K = 478B entries. Çoğu sıfır (sparse).

1.5 Sparse matrix#

Gerçek X matrisi sparse: çoğu kelime çifti hiç co-occur etmiyor. Praktikte CSR (Compressed Sparse Row) format storage.

3. GloVe Objective — Weighted Least Squares#

3.1 GloVe insight#

Vector dot product ↔ co-occurrence logarithm:

u_i^T v_j + b_i + b_j ≈ log X_{ij}

u_i = input word vector, v_j = context word vector (W_in ve W_out'un GloVe karşılığı).

3.2 Loss function#

J = Σ_{i,j} f(X_{ij}) (u_i^T v_j + b_i + b_j - log X_{ij})^2

Key components:

Squared error: weighted least squares regression
f(X_{ij}): weighting function — frequent pair'leri daha çok cezalandır
bias terms (b_i, b_j): scalar bias her vector için

3.3 Weighting function f#

f(x) = (x / x_max)^α      if x < x_max
        1                  if x >= x_max

x_max = 100 typical. α = 3/4.

Niye:

x = 0: f = 0 (sıfır co-occurrence loss'a katkı yok)
0 < x < 100: gradient artar
x >= 100: plateau (the/a gibi çok frequent pair'ler dominate etmesin)

3.4 Training procedure#

def train_glove(X, V, d, epochs=50):
    u = np.random.uniform(-0.5/d, 0.5/d, (V, d))
    v = np.random.uniform(-0.5/d, 0.5/d, (V, d))
    b_u = np.zeros(V)
    b_v = np.zeros(V)
    
    # AdaGrad optimizer
    for epoch in range(epochs):
        for i, j, x_ij in X.nnz_iter():
            weight = min(1, (x_ij / x_max) ** 0.75)
            log_x = np.log(x_ij)
            diff = u[i] @ v[j] + b_u[i] + b_v[j] - log_x
            loss_grad = weight * diff
            # AdaGrad updates
            u[i] -= lr * loss_grad * v[j]
            v[j] -= lr * loss_grad * u[i]
            b_u[i] -= lr * loss_grad
            b_v[j] -= lr * loss_grad
    return u + v  # final embedding: input + context vectors sum

3.5 Final embedding#

GloVe: W = u + v (her iki vector matrisini topla). Word2Vec: sadece u (input).

3.6 Avantajlar#

Statistical efficiency: corpus tek pass
Global context yakalanır
AdaGrad optimizer'la deterministic convergence

3.7 Dezavantajlar#

Co-occurrence matrix RAM-heavy (sparse olsa bile)
Word2Vec'le mukayese empirical olarak benzer kalite (clear winner yok)
Implementation karmaşıklığı yüksek

5. FastText Motivasyonu — OOV ve Morfoloji#

5.1 Word2Vec problem#

Word2Vec her whole word için bir vector öğrenir.

Problems:

OOV (out-of-vocab): training'de görülmemiş kelime → vector yok
Morphological variants: 'anlaşmak', 'anlaştık', 'anlaşamadık' — her biri ayrı vector. İlişkili olduklarını model bilmez.

5.2 Türkçe ekstrem örnek#

'anlaşamadıklarımızdan' — Wikipedia corpus'unda 5 kere geçiyor (rare). Word2Vec bu kelimeyi 'yetersiz örnek' diye atlayabilir. Ama bu kelime morfolojik olarak:

anlaş + ama + dık + larımız + dan

Her morfem common — bunlardan compose edersek kelimenin anlamı çıkar.

5.3 FastText fikri#

Kelimeyi character n-gram'lara böl:

'merhaba' = '<me', 'mer', 'erh', 'rha', 'hab', 'aba', 'ba>'

('<' ve '>' word boundary marker.)

Her n-gram için ayrı vector:

v('merhaba') = v('<me') + v('mer') + v('erh') + v('rha') + v('hab') + v('aba') + v('ba>')

Yani word vector = sum of subword n-gram vectors.

5.4 OOV avantajı#

Unseen 'anlaşamayanların' kelimesi için:

v('anlaşamayanların') = v('<an') + v('anl') + v('nla') + ... + v('rın') + v('ın>')

N-gram'lar training'de görüldüyse — vector compose edilebilir! OOV yok.

5.5 Türkçe için ideal#

Agglutinative morfoloji = subword pattern recurrence yüksek. Türkçe FastText word2vec'ten genelde daha iyi quality.

6. Subword N-Gram Modeli#

6.1 N-gram extraction#

def get_ngrams(word, n_min=3, n_max=6):
    word = '<' + word + '>'   # boundary markers
    ngrams = set()
    for n in range(n_min, n_max + 1):
        for i in range(len(word) - n + 1):
            ngrams.add(word[i:i+n])
    ngrams.add(word)   # whole word
    return list(ngrams)

print(get_ngrams('merhaba'))
# ['<me', 'mer', 'erh', 'rha', 'hab', 'aba', 'ba>',
#  '<mer', 'merh', 'erha', 'rhab', 'haba', 'aba>', 
#  '<merh', ..., '<merhaba>']

6.2 N-gram count for typical word#

5-character word, n in [3, 6]:

3-grams: 5 - 3 + 1 + 2 (boundary) = 5
4-grams: 4 + 2 = 6
5-grams: 3 + 2 = 5
6-grams: 2 + 2 = 4
Word itself: 1
Total: ~21 n-grams

6.3 N-gram vocab size#

Max n-gram count typically 2M. Vocab matrix V_ngram × d. Memory:

V_ngram = 2M
d = 300
Params: 600M

Word2Vec 692K vocab × 300 = 207M. FastText ~3x daha büyük. Trade-off: OOV elimination + morphology + cost.

6.4 FastText training (Skip-Gram extension)#

def skipgram_fasttext_loss(center_word, context_word, W_subword, W_out):
    # Center word's vector = sum of its n-grams
    center_ngrams = get_ngrams(center_word)
    v_center = sum(W_subword[ng] for ng in center_ngrams)
    
    # Context: standard word vector (not subword-decomposed)
    u_context = W_out[context_word]
    
    # Same as Word2Vec from here
    score = sigmoid(u_context @ v_center)
    # ...

Key: center word subword-decomposed, context word whole-word vector. Training paths farklı.

6.5 Inference (final word vector)#

def word_vector(word, W_subword):
    ngrams = get_ngrams(word)
    return sum(W_subword[ng] for ng in ngrams) / len(ngrams)

If word in training: stable, high-quality vector. If OOV: still computable from n-grams (lower quality but non-zero).

6.6 Pretrained FastText for Turkish#

Facebook FastText official release: 157 dil için pretrained model.

https://fasttext.cc/docs/en/crawl-vectors.html

Download:

wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.tr.300.bin.gz
gunzip cc.tr.300.bin.gz

File boyutu: ~6 GB. Vocab 2M+, n-gram subwords dahil.

6.7 Gensim FastText#

from gensim.models import FastText
model = FastText(
    sentences=turkish_corpus,
    vector_size=300,
    window=5,
    min_count=5,
    workers=8,
    sg=1,           # Skip-gram
    min_n=3,        # min n-gram
    max_n=6,        # max n-gram
    epochs=10,
)
model.save('fasttext-tr.model')

8. Türkçe için FastText — Morfolojik Dil Testi#

8.1 Test 1: OOV handling#

# Word2Vec
from gensim.models import Word2Vec
w2v = Word2Vec.load('word2vec-tr-100d.model')
try:
    vec = w2v.wv['anlaşamayanların']   # OOV
except KeyError:
    print('OOV in word2vec')

# FastText
from gensim.models import FastText
ft = FastText.load('fasttext-tr.model')
vec = ft.wv['anlaşamayanların']        # works!
print(f'Vector shape: {vec.shape}')    # [300]

Word2Vec OOV → KeyError. FastText → vector ürün.

8.2 Test 2: Morphological similarity#

sim_w2v = w2v.wv.similarity('anlaşmak', 'anlaştık')
sim_ft = ft.wv.similarity('anlaşmak', 'anlaştık')
print(f'Word2Vec: {sim_w2v:.3f}')   # 0.55 (decent)
print(f'FastText: {sim_ft:.3f}')    # 0.87 (excellent)

FastText subword overlap üzerinden inherent morphological awareness.

8.3 Test 3: Rare word quality#

# 'kahvecioğlu' surname — corpus'ta 3 kez geçer
sim_w2v = w2v.wv.similarity('kahveci', 'kahvecioğlu')
sim_ft = ft.wv.similarity('kahveci', 'kahvecioğlu')
print(f'Word2Vec: {sim_w2v:.3f}')   # 0.42 (low, rare word)
print(f'FastText: {sim_ft:.3f}')    # 0.79 (high, subword shared)

8.4 Empirical: Türkçe analogies#

WordSim353-TR benchmark (Türkçe):

Word2Vec (10M token): 0.61 Spearman correlation
GloVe: 0.60
FastText: 0.71 (+10 puan)

Morfolojik dil için FastText clear winner.

10. Modern Era — Niye Eski Static Embedding Hâlâ Önemli#

10.1 Modern alternatif: contextual embedding#

BERT, GPT, Llama her token'ın vector'ı context-dependent. 'banka' (finansal) ≠ 'banka' (oturma) — farklı context'lerde farklı vector.

10.2 Static embedding'in halen avantajları#

Lightweight: 300d FastText 6 GB, BERT-base 440 MB ama 110M params (inference slower)
Deterministic: aynı kelime hep aynı vector — caching kolay
Lookup hızı: O(1) constant time vs BERT O(seq_len^2) attention
Mobile/edge: küçük cihazlarda runnable
Pre-training-free: kendi corpus'unda hızlı eğit

10.3 Pratik kullanım scenarios (2026)#

Semantic search (small scale): FastText + cosine sim
Text classification (TF-IDF alternative): word2vec avg pool + LinearSVC
OOV handling in toolkits: FastText fallback
Educational: linguistics + NLP intro courses
Specialized domains: legal/medical with custom corpus

10.4 Hybrid approach#

Some systems: BERT contextual + FastText static fallback for unknown tokens. Best of both.

11. Seçim Kılavuzu#

Senaryo 1: Genel-amaç NLP (modern)#

→ BERT/transformer embedding (sentence-transformers, OpenAI text-embedding-3, Cohere). Static embedding'leri unut.

Senaryo 2: Lightweight semantic search, mobile#

→ FastText (Türkçe için optimal — Facebook pretrained 157 dil)

Senaryo 3: Educational/research#

→ Word2Vec + GloVe karşılaştırma (Mikolov 2013 + Pennington 2014 paper'lar)

Senaryo 4: Custom domain (legal/medical Türkçe)#

→ FastText eğit kendi corpus'unda. Subword morfoloji avantajı.

Senaryo 5: Multilingual transfer#

→ MUSE (Facebook), fastText.cc aligned multilingual embeddings

Senaryo 6: LLM pre-training input#

→ Modern LLM embedding katmanı (end-to-end trained, Modül 7.1). Static embedding init için kullanılabilir.

Karar matrisi#

Need	Best Choice
Production semantic search	sentence-transformers / OpenAI
Mobile / edge	FastText (Türkçe için ideal)
Educational	Word2Vec gensim demo
Türkçe rare word handling	FastText (morfoloji)
Multilingual alignment	MUSE / fastText.cc
Modern LLM init	nn.Embedding from scratch

✅ Ders 7.3 Özeti — GloVe + FastText

GloVe (Pennington 2014): global co-occurrence matrix + weighted least squares. Word2Vec'le empirical karşılaştırma — winner net değil. FastText (Bojanowski 2017): subword character n-gram embedding. 'merhaba' = sum of '<me', 'mer', ..., 'ba>' vectors. Türkçe morfolojik diller için ideal: OOV problemi yok, rare word quality yüksek, %10+ accuracy artışı. Modern era'da contextual embedding (BERT, LLM) hakim ama static embedding'ler lightweight/educational/mobile için hâlâ pratik. Facebook FastText.cc 157 dil için pretrained model sağlıyor. Ders 7.4'te modern LLM embedding katmanına geri döneceğiz.

Sıradaki Ders: Modern LLM Embedding Katmanı#

Ders 7.4'te: end-to-end trained embedding (Llama-3, GPT-4), embedding tying (input/output paylaşımı), positional embedding nasıl eklenir, embedding scaling (RMSNorm öncesi).

Frequently Asked Questions

Empirical tie. GloVe marginal advantage in statistical efficiency (single corpus pass), Word2Vec in implementation simplicity. For modern era, both are educational/legacy — use transformer embeddings for production.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...