TR Embedding FT: BGE-M3, jina-v3, nomic-embed TR Adaptation + MTEB-TR Eval

TR embedding model FT for RAG: BGE-M3 (multilingual, good TR baseline), jina-embeddings-v3, nomic-embed-text. TR-specific query/document pair generation, contrastive learning (InfoNCE), MTEB-TR benchmark. BGE-M3 TR FT 6h on RTX 4090.

Şükrü Yusuf KAYA

28 min read

5/14/2026

Advanced

TR Embedding FT: BGE-M3, jina-v3, nomic-embed TR Adaptation + MTEB-TR Eval

1. TR Embedding Baseline Tablo (MTEB-TR 2026)#

Model	Size	TR-MTEB Avg	Lisans
BGE-M3	568M	62.1	MIT
jina-embeddings-v3	570M	60.4	CC-BY-NC
nomic-embed-text-v2-multilingual	137M	55.8	Apache 2.0
multilingual-e5-large	559M	58.2	MIT
TR-spesifik FT (BGE-M3 base + 50K TR pairs)	568M	66.8 (+4.7)	Apache 2.0

Karar: BGE-M3 baseline. Production'da custom domain için FT etmek %5-8 boost verir.

python

# === BGE-M3 TR Fine-Tuning ===
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
 
model = SentenceTransformer("BAAI/bge-m3", device="cuda")
 
# Dataset: (query, positive_doc, negative_doc) triplet
train_examples = []
for query, pos_doc, neg_docs in tr_dataset:
    for neg in neg_docs[:7]:    # 1 pos + 7 hard negatives
        train_examples.append(InputExample(texts=[query, pos_doc, neg]))
 
train_dataloader = DataLoader(train_examples, batch_size=8, shuffle=True)
 
# Loss — MultipleNegativesRankingLoss (InfoNCE variant)
loss = losses.MultipleNegativesRankingLoss(model)
 
model.fit(
    train_objectives=[(train_dataloader, loss)],
    epochs=3,
    warmup_steps=500,
    optimizer_params={"lr": 2e-5},
    output_path="bge-m3-tr-finetuned",
)
 
# Eval — MTEB-TR
from mteb import MTEB
benchmark = MTEB(tasks=["mteb_tr_*"])
results = benchmark.run(model)

BGE-M3 TR contrastive FT

✅ Teslim

5K TR query-doc pair üret. 2) BGE-M3'ü FT et. 3) MTEB-TR'da baseline ile karşılaştır. 4) Sonraki ders: 9.8 — TR Reranker FT.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Pillar topics this article maps to

Pillar Topic

RAG (Retrieval-Augmented Generation) Architecture

RAG (Retrieval-Augmented Generation) is an architecture that grounds large-language-model answers in chunks retrieved from the organization's own documents or data sources, providing both freshness and citations.

TR Embedding FT: BGE-M3, jina-v3, nomic-embed TR Adaptation + MTEB-TR Eval

1. TR Embedding Baseline Tablo (MTEB-TR 2026)#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Pillar topics this article maps to

RAG (Retrieval-Augmented Generation) Architecture

Subscribe to Newsletter