Data Mixing Math: Sampling Temperature, DoReMi, Domain Reweighting

How to mix multiple datasets? Naïve concatenation = the large dataset dominates. Sampling temperature, proportional mixing, DoReMi (Xie et al. 2023) algorithm for dynamic reweighting. Turkish SFT mix example: 40% TR-Alpaca + 25% OASST + 20% ShareGPT-TR + 15% custom — why these percentages?

Şükrü Yusuf KAYA

30 min read

5/14/2026

Advanced

Data Mixing Math: Sampling Temperature, DoReMi, Domain Reweighting

🎯 Bu derste

Sosyal: 'concat dataset, train, finish' tutorialları yanıltıyor. 50K Alpaca + 200K OASST + 5K custom karıştırırsan model %85 OASST conversational chitchat yapar, custom kullanım case'ini öğrenmez. Karıştırma matematiği lazım.

1. Sampling Temperature#

Verilen domain dağılımı

{D_1, D_2, ...}

size'ları

{N_1, N_2, ...}

ile:

p_i = N_i^τ / Σ N_j^τ

τ = 1.0 → proportional (büyük dataset dominates) τ = 0.5 → square-root smoothing (klasik multilingual approach, mBERT) τ = 0.0 → uniform (her dataset eşit)

TR örneği: TR-Alpaca 52K + OASST-TR 8K + custom-TR 5K + ShareGPT-TR 30K

τ	TR-Alpaca	OASST	Custom	ShareGPT
1.0	55.3%	8.5%	5.3%	31.9%
0.5	39.2%	15.4%	12.2%	33.2%
0.0	25%	25%	25%	25%

Cookbook'un kuralı: τ ≈ 0.3-0.5 sweet spot. Çok-kıt domain'ler için repetition ekle (her epoch'ta 2x rate).

2. DoReMi — Dynamic Reweighting (Xie et al. 2023)#

Manuel τ seçmek yerine optimal weights'i öğren. DoReMi reference model'ler arasındaki gradient diversity'yi maksimize edecek mix'i bulur.

Algoritma:

Initial uniform weights
α = [1/k, ...]
Reference model R'i mix(α) ile eğit
Proxy model'i her domain ayrı ayrı eğit, R ile karşılaştır → her domain'in 'kazanç'ı = excess loss
α'yı kazanç oranına göre güncelle (gradient ascent)
2'ye dön

Sayısal sonuçlar (Xie et al. 2023): Manuel τ'ya göre %3-4 daha iyi downstream perplexity, %5-6 daha iyi MMLU.

Pratik: RTX 4090'da DoReMi çalıştırmak için proxy model (160M-450M) yeter; cookbook'ta Phase 1 olarak öneriliyor (büyük dataset hazırlığında).

python

# === Practical mixed dataset (cookbook default) ===
from datasets import load_dataset, interleave_datasets
 
# 4 ayrı TR dataset
alpaca_tr = load_dataset("malhajar/alpaca-gpt4-tr", split="train")       # 52K
oasst_tr  = load_dataset("OpenAssistant/oasst1", split="train")            # 8K TR-filter
oasst_tr  = oasst_tr.filter(lambda x: x["lang"] == "tr")
custom_tr = load_dataset("user/custom-tr-sft", split="train")              # 5K
sharegpt_tr = load_dataset("user/sharegpt-tr", split="train")              # 30K
 
# τ=0.4 ile karıştır
import numpy as np
sizes = np.array([len(alpaca_tr), len(oasst_tr), len(custom_tr), len(sharegpt_tr)])
tau = 0.4
weights = (sizes ** tau) / (sizes ** tau).sum()
print(f"Mix weights: {weights}")
# [0.42, 0.16, 0.13, 0.29]
 
mixed = interleave_datasets(
    [alpaca_tr, oasst_tr, custom_tr, sharegpt_tr],
    probabilities=weights.tolist(),
    seed=42,
    stopping_strategy="all_exhausted",   # tüm dataset'ler bitene kadar
)
print(f"Mixed size: {len(mixed)}")

τ=0.4 mixing pratiği

✅ Teslim

Kendi mix'in için τ değerlerini dene (0.0, 0.3, 0.5, 1.0). 2) Aynı model'i 4 farklı mix'le SFT et, MT-Bench-TR'de karşılaştır. 3) Sonraki ders: 2.9 — Sequence Packing & Variable-Length Attention.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Data Mixing Math: Sampling Temperature, DoReMi, Domain Reweighting

1. Sampling Temperature#

2. DoReMi — Dynamic Reweighting (Xie et al. 2023)#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter