ORPO: Odds Ratio Preference Optimization — Single-Stage SFT+Alignment
ORPO (Hong et al. 2024) — DPO alternative without SFT base requirement. SFT loss + odds-ratio preference loss in one stage. No ref model → memory savings. Reference-free training, λ hyperparameter, ORPO Lab on RTX 4090.
Şükrü Yusuf KAYA
26 min read
Advanced1. ORPO Loss Formula#
L_ORPO = L_SFT(y_w) + λ · L_OR(y_w, y_l) L_SFT(y) = -log π_θ(y | x) # standart SFT L_OR(y_w, y_l) = -log σ(log(odds(y_w | x) / odds(y_l | x))) odds(y | x) = π_θ(y | x) / (1 - π_θ(y | x))
Avantajlar:
- Ref model yok (DPO'da gerekli) → ~40% memory tasarrufu
- SFT base gerek yok — base model + ORPO tek aşama
- Aynı kalite (literature: DPO ile karşılaştırılabilir)
Dezavantaj: λ = 0.1-0.2 (sweet spot), aşırı duyarlı.
python
# === ORPO Lab — TRL ORPOTrainer ===from trl import ORPOConfig, ORPOTrainerfrom unsloth import FastLanguageModel model, tok = FastLanguageModel.from_pretrained( "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", max_seq_length=4096, dtype="bfloat16", load_in_4bit=True,)model = FastLanguageModel.get_peft_model( model, r=32, lora_alpha=64, target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],) dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs") cfg = ORPOConfig( output_dir="llama-3.1-8b-orpo", num_train_epochs=2, per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=8e-6, # ORPO için 5e-6 ila 1e-5 bf16=True, optim="paged_adamw_8bit", max_length=4096, max_prompt_length=2048, beta=0.1, # OR loss weight (≡ λ) logging_steps=5, save_steps=100, report_to="wandb",) trainer = ORPOTrainer(model=model, args=cfg, train_dataset=dataset, tokenizer=tok)trainer.train()# 60-70 dakika, peak 14 GBORPO TRL Lab
✅ Teslim
- Llama 3.1 8B base'ini doğrudan ORPO ile alignment'a koş (SFT skip). 2) DPO (SFT + DPO) ile karşılaştır. 3) Sonraki ders: 11.5 — KTO Kahneman-Tversky.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations