Embedding Geometry: Cosine Similarity, Euclidean Distance, Isotropy and BERTology Findings
Topology of embedding vector space: cosine similarity vs Euclidean distance vs dot product (which when, mathematical relationships), isotropy (vectors balanced across directions, Gao 2019 'representation degeneration'), anisotropy problem in BERT/GPT embeddings, mitigation (whitening, normalization). BERTology findings: which information in which layer (Rogers 2020). Practical analysis for Turkish.
Şükrü Yusuf KAYA
70 min read
Advanced📐 Embedding uzayının topolojisi — vektörlerin yaşadığı evrenin geometrisi
Embedding vector'ler 4096-dimensional space'te yaşar. Bu space'in geometrisi ne? Hangi mesafe metrikleri anlamlı? Isotropic mi (vectors balanced) yoksa anisotropic mi (vectors clustered in a narrow cone)? 2019'da Gao et al. şok edici bulgu: pre-trained transformer embeddings highly anisotropic — neredeyse tüm vektörler dar bir cone'da toplanıyor. Bu, similarity hesaplarını distort ediyor. Mitigation: whitening, normalization. BERTology bulguları (Rogers 2020): farklı layer'lar farklı linguistic bilgi taşıyor — embedding katmanı 'surface form', son layer'lar 'semantic'. 70 dakika sonra: embedding geometry'nin matematik + empirical bulgularını derinlemesine kavramış olacaksın.
Ders Haritası (12 Bölüm)#
- Distance metrics — cosine vs Euclidean vs dot product
- Cosine similarity matematik — geometric intuition
- Euclidean distance — ne zaman tercih edilir
- Dot product — magnitude bilgisi taşır
- Hangi metric ne zaman — pratik karar matrisi
- Isotropy concept — uniform distribution across directions
- Anisotropy problem — Gao 2019, BERT embedding cone
- Anisotropy quantification — measurement protocols
- Whitening — anisotropy mitigation
- BERTology — hangi layer'da hangi bilgi (Rogers 2020)
- Türkçe embedding analiz — pratik demo
- Edge cases — high-dim curse, low-dim collapse
1. Distance Metrics — Üç Klasik#
1.1 Cosine similarity#
cos(u, v) = (u · v) / (||u|| × ||v||)
Range: [-1, 1]. 1 = same direction, 0 = orthogonal, -1 = opposite.
Magnitude'dan bağımsız. Sadece direction.
1.2 Euclidean distance (L2)#
d(u, v) = ||u - v|| = sqrt(Σ (u_i - v_i)^2)
Range: [0, ∞). 0 = identical. Magnitude etkili.
1.3 Dot product#
u · v = Σ u_i × v_i = ||u|| × ||v|| × cos(u, v)
Range: (-∞, ∞). Magnitude × direction.
1.4 İlişkiler#
Normalized vectors için (||u|| = ||v|| = 1):
- cos(u, v) = u · v
- d(u, v)^2 = 2 - 2 × cos(u, v) = 2 - 2 × (u · v)
Yani normalized vectors için: cosine, dot product, Euclidean monotonically equivalent.
Unnormalized vectors için: hepsi farklı bilgi taşır.
1.5 Hangi metric ne zaman#
| Senaryo | Metric |
|---|---|
| Semantic search (text embeddings) | Cosine (magnitude ne kadar 'önemli' olduğunu ölçer, semantic için yanıltıcı) |
| Recommendation (user/item) | Dot product (her ikisi magnitude'lu) |
| Image similarity | Cosine veya Euclidean (depending on normalization) |
| Clustering | Euclidean (k-means natural) |
| Word2Vec analogies | Cosine (vector arithmetic) |
| LLM attention scoring | Dot product (efficiency için) |
6-7. Isotropy + Anisotropy Problem#
6.1 Isotropy tanımı#
Embedding space isotropic ise: vectors uniform distributed across all directions. Yani, vocab vector'lerin centroid sıfıra yakın, variance her direction'da eşit.
Mathematical: covariance matrix ~ I (identity), eigenvalues balanced.
6.2 Niye isotropy önemli#
- Fair similarity: anisotropic space'te tüm vectors birbirine yakın görünür
- Semantic clarity: 'cat' ve 'dog' yakın olmalı, 'cat' ve 'mathematics' uzak — anisotropic'te ayrım azalır
- Downstream task: classifier head'leri için stable feature
6.3 Gao 2019: representation degeneration#
'Representation Degeneration Problem in Training Natural Language Generation Models' paper'ı.
Bulgular:
- Pre-trained transformer embedding'ler highly anisotropic
- Vectors narrow cone içinde clustered
- Average cosine similarity arbitrary pairs için 0.6-0.8 (random uniform should be ~0)
6.4 Empirical numbers#
GPT-2 medium embedding (50K vocab, 1024d):
- Random pair cosine sim mean: 0.73
- Top eigenvalue / sum of eigenvalues: 0.85 (one direction dominates)
- Vectors all pointing to similar direction
BERT-base (30K vocab, 768d):
- Random pair cosine sim mean: 0.55
- Less anisotropic than GPT-2 but still significant
Llama-3-8B (128K vocab, 4096d):
- Random pair cosine sim mean: 0.42
- Improving — modern training mitigates anisotropy
6.5 Anisotropy nedenleri#
- Softmax bias toward frequent words: frequent words'in vectors central position'da, rare words periphery'de
- Optimization geometry: SGD natural anisotropic local minima
- Layer norm effects: each layer slight rotation, cumulative anisotropic shift
6.6 Anisotropy quantification#
def isotropy_score(E): # E: [V, d] embedding matrix # Compute mean cosine similarity between random pairs normalized = E / E.norm(dim=1, keepdim=True) centered = normalized - normalized.mean(dim=0, keepdim=True) sample = centered[torch.randperm(len(centered))[:1000]] sims = (sample @ sample.T).abs() sims = sims[torch.triu_indices(1000, 1000, offset=1).unbind()] return 1.0 - sims.mean().item() # higher = more isotropic
Isotropy score 1.0 = perfect uniform. < 0.5 = severely anisotropic.
6.7 Mitigation: whitening#
def whiten_embeddings(E): # E: [V, d] mu = E.mean(dim=0, keepdim=True) centered = E - mu cov = (centered.T @ centered) / (len(E) - 1) U, S, V = torch.svd(cov) # Whitening matrix W = U @ torch.diag(1.0 / (S.sqrt() + 1e-6)) @ V.T return centered @ W # whitened embeddings
Whitening: PCA + scale. Sonuçta isotropic vectors.
6.8 Whitening empirical etki#
GPT-2 embedding whitened:
- Cosine sim mean: 0.73 → 0.03 (near-zero, isotropic)
- Downstream task (STS-B): F1 0.58 → 0.71 (+13 puan)
Semantic search için whitening dramatic improvement.
6.9 BERT-flow, Su Su 2021#
BERT-flow: invertible transformation BERT embedding'lerini isotropic Gaussian'a map eder. Whitening'in trainable versionu. STS benchmarks'ta %5+ improvement.
6.10 SBERT (sentence-BERT)#
Reimers & Gurevych 2019 — siamese network ile sentence embedding fine-tune. İçinde implicit isotropy regularization. Production semantic search için ideal.
10. BERTology — Hangi Layer'da Hangi Bilgi#
10.1 Rogers, Kovaleva, Rumshisky 2020#
'A Primer in BERTology: What We Know About How BERT Works'.
Meta-paper: 100+ BERTology studies'in synthesis.
10.2 Layer-wise findings#
BERT-base 12 layer + embedding. Different layers different linguistic info:
| Layer | Information |
|---|---|
| Embedding (0) | Surface form (lexical) |
| 1-3 | Syntactic features (POS tagging) |
| 4-7 | Syntactic deeper (dependency parsing) |
| 8-11 | Semantic features (semantic roles, coreference) |
| 12 (last) | Task-specific (for pre-training MLM) |
10.3 Probe experiments#
Metodoloji: BERT layer i'nin output'unu input olarak linear probe (classifier) eğit, çeşitli NLP tasks. Hangi layer hangi tasks için en iyi accuracy.
Findings:
- POS tagging: peaks at layer 3-4
- Dependency parsing: peaks at layer 6-7
- Semantic role labeling: peaks at layer 9-10
- NER: peaks at layer 11
- Coreference: peaks at layer 11-12
10.4 Embedding katmanı niye 'sığ'#
Embedding katmanı (layer 0) sadece surface form taşıyor — 'cat' kelimesi 'cat' demek, semantic context yok.
Semantic depth transformer layer'larında oluşur. Embedding katmanı sadece başlangıç.
10.5 Modern GPT layer-wise#
GPT, BERT'e benzer pattern:
- Embedding: surface
- Erken layer: syntax
- Orta layer: semantic
- Geç layer: task-specific (next-token prediction)
10.6 Implications for fine-tuning#
- Frozen embedding fine-tune: çoğu task için OK (surface info'ya bağımlı task'lar yok)
- Last layer fine-tune: çoğu task için yeterli (semantic info burada)
- Selective unfreezing: middle layers task'a bağlı ihtiyaç
10.7 Türkçe için BERTology#
BERT-base-Turkish için aynı pattern bulundu (Yıldız et al. 2021):
- Morphological features layer 2-4
- Syntactic dependency layer 6-8
- Semantic features layer 9-11
Türkçe morfoloji erken layer'larda öğreniliyor — multi-layer transformer'ın bu hierarchy'i Türkçe için de geçerli.
python
# Türkçe embedding geometry analiziimport torchimport torch.nn.functional as Ffrom transformers import AutoTokenizer, AutoModel model_name = "dbmdz/bert-base-turkish-cased"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModel.from_pretrained(model_name)model.eval() # 1. Embedding matrix extractE = model.embeddings.word_embeddings.weight.detach()print(f"Embedding shape: {E.shape}") # [32000, 768] # 2. Isotropy scoreimport randomdef isotropy_score(E, n_pairs=10000): normalized = F.normalize(E, dim=1) centered = normalized - normalized.mean(dim=0, keepdim=True) indices = torch.randint(0, len(centered), (n_pairs, 2)) sims = (centered[indices[:, 0]] * centered[indices[:, 1]]).sum(dim=1) return 1.0 - sims.abs().mean().item() print(f"Isotropy (raw): {isotropy_score(E):.4f}") # 3. Türkçe synonym pair similaritypairs = [ ("merhaba", "selam"), ("araba", "otomobil"), ("köpek", "kedi"), ("mavi", "lacivert"), ("İstanbul", "Ankara"), ("kahve", "çay"),] for w1, w2 in pairs: t1 = tokenizer.tokenize(w1) t2 = tokenizer.tokenize(w2) if len(t1) == 1 and len(t2) == 1: id1 = tokenizer.convert_tokens_to_ids(t1[0]) id2 = tokenizer.convert_tokens_to_ids(t2[0]) sim = F.cosine_similarity(E[id1].unsqueeze(0), E[id2].unsqueeze(0)).item() print(f"{w1} <-> {w2}: cos = {sim:.3f}") # 4. Random pair baselinerandom_pairs = [random.sample(range(len(E)), 2) for _ in range(100)]random_sims = [F.cosine_similarity(E[i].unsqueeze(0), E[j].unsqueeze(0)).item() for i, j in random_pairs]import statisticsprint(f"\nRandom pair cos avg: {statistics.mean(random_sims):.3f}") # should be ~0 if isotropic, higher otherwise # 5. Whiteningdef whiten(E): mu = E.mean(dim=0, keepdim=True) centered = E - mu cov = (centered.T @ centered) / (len(E) - 1) U, S, V = torch.svd(cov) W = U @ torch.diag(1.0 / (S.sqrt() + 1e-6)) @ V.T return centered @ W E_white = whiten(E)print(f"\nIsotropy after whitening: {isotropy_score(E_white):.4f}")Türkçe BERT embedding geometry analiz
✅ Ders 7.5 Özeti — Embedding Geometry
Distance metrics: cosine (direction-only, semantic için ideal), Euclidean (magnitude-aware, clustering için), dot product (magnitude × direction, attention için). Normalized vectors için 3'ü monotone equivalent. Isotropy = vectors uniform across directions. Anisotropy problem: Gao 2019, pre-trained transformer embedding'ler narrow cone'da clustered. Mitigation: whitening, BERT-flow, SBERT. Whitening dramatic semantic search improvement (+13% F1 on STS-B). BERTology: layer-wise specialization — embedding=surface, early=syntax, mid=semantic, late=task-specific. Türkçe BERT için aynı pattern. Modül 7'nin son dersi 7.6'da Türkçe semantic search capstone projesine geçeceğiz.
Sıradaki Ders: Türkçe Semantic Search Capstone#
Ders 7.6 (Modül 7 capstone): sentence-transformers ile Türkçe semantic search demosu sıfırdan. Cosine similarity, FAISS vector index, production deployment. Mini-RAG sistemi.
Frequently Asked Questions
Magnitude info matters in attention scoring — magnitude reflects 'importance' of frequent words. Cosine normalizes loses this. Plus efficiency: cosine = normalize + dot product, extra ops. Raw dot product + scale (1/sqrt(d_k)) standard in attention.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup