# LLM Fine-Tuning: A Comprehensive 2026 Guide to LoRA, QLoRA, DPO, and Modern Alignment > Source: https://sukruyusufkaya.com/en/blog/llm-fine-tuning-lora-qlora-dpo > Updated: 2026-05-13T19:58:09.245Z > Type: blog > Category: yapay-zeka **TLDR:** The most current, detailed 2026 Turkish guide to adapting an LLM to your domain. Covers when fine-tuning is necessary, the math behind LoRA, 4-bit training with QLoRA, why DPO beats PPO, modern alternatives (ORPO/KTO/IPO), Turkish dataset sources, GPU/cloud cost modeling, production pipelines, 3 anonymized Turkish enterprise case studies, and KVKK-compliant training. For developers, MLOps engineers, and AI architects. ## 1. What is Fine-Tuning and When is it Necessary? Three main strategies adapt LLMs to your use case: **prompt engineering**, **RAG**, and **fine-tuning**. The first two leave the model unchanged; fine-tuning **updates model weights through additional training**. In the right situations, it produces enormous value; in the wrong ones, it is a waste of money. ### When to Fine-Tune? A practical decision framework: **Practical rule.** 70% of needs are solved by prompt engineering, 25% more by prompt + RAG. The remaining **5%** is where fine-tuning produces real value: locking in style/format, guaranteed structured output, lowering latency/cost (distillation), domain-specific language (Turkish law, medicine), and new behavior (agent tasks, tool use). ### Why Try Prompt and RAG First? Fine-tuning has five side effects: high upfront cost (GPU hours, data, evals), model "freezing" (re-do work on each new base model), catastrophic forgetting risk, data-management complexity (KVKK + IP + quality), and harder evaluation. That is why OpenAI, Anthropic, and Google all officially recommend **prompt + RAG first, fine-tuning later**. ## 2. The Full LLM Training Pipeline A modern LLM goes through four training stages, each with a distinct purpose, dataset type, and cost. Enterprise fine-tuning usually happens at **Stage 4**. ### Supervised Fine-Tuning (SFT) The most basic form — standard next-token prediction training on instruction-response pairs. Most enterprise fine-tunes are SFT (style, format, domain knowledge). ### Preference Optimization Human evaluators see two responses (A, B) for the same prompt and mark the better one. The model is then pushed toward "good" responses via: - **RLHF (PPO)** — classic; trains a reward model and applies PPO. Complex and resource-heavy. - **DPO** — skips the reward model; supervised loss directly on preference pairs. Simple, effective, the standard since 2024. - **ORPO / KTO / IPO** — derivatives and alternatives detailed below. ## 3. PEFT — Parameter-Efficient Fine-Tuning Fully fine-tuning a 70B-parameter model requires updating all 70B weights — needs 800GB+ VRAM, only large labs reach that. **PEFT** solves this by updating only a **small parameter subset**. PEFT members: **LoRA**, **QLoRA**, **AdaLoRA**, **IA-3**, **Prefix Tuning**, **Prompt Tuning**, **DoRA** (2024), **MoRA** (2024). ## 4. LoRA — Low-Rank Adaptation Published in 2021 by Microsoft researchers (Hu et al.), LoRA has become **the gold standard of modern fine-tuning**. ### 4.1. Math (Brief) In full fine-tuning, a weight matrix W (e.g., 4096×4096) is updated directly: W_new = W + ΔW. LoRA's assumption: ΔW can be **low-rank**. LoRA expresses ΔW as the product of two small matrices:

ΔW ≈ B × A
B: 4096 × r
A: r × 4096
r << 4096 (usually 4, 8, 16, 32, 64)

Only **A and B are updated** during training; original W is frozen. At inference, W + B × A is computed (or merged). ### 4.2. LoRA Hyperparameters **Rank (r)** — size of LoRA matrices. Common: 8 (default), 16, 32, 64. Higher rank = more capacity but overfitting risk. **Alpha (α)** — scaling factor. ΔW_effective = (α/r) × B × A. Practical: α = 2r. **Target modules** — which layers get LoRA? - q_proj, v_proj — attention query/value only (minimal) - q_proj, k_proj, v_proj, o_proj — all attention - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj — attention + MLP (most thorough) **Tip.** All linear layers gives best results. Attention-only loses 5-10% quality on most tasks. ### 4.3. Full Fine-Tuning vs LoRA LoRA's **small output** (50MB-1GB) is especially valuable — you can run 10 different LoRA adapters on the same model, switching at runtime. ## 5. QLoRA — 4-bit Quantization + LoRA Published in 2023 by Dettmers et al., QLoRA pairs LoRA with **quantization** to make 70B models trainable on **a single A100 GPU**. The engine of the personal/small-team fine-tuning explosion. ### 5.1. Three Main Components **4-bit NF4 (Normal Float 4) quantization.** Model weights stored at 4-bit instead of 16-bit. NF4 is more accurate than standard 4-bit — optimized for normal-distributed data. **Double Quantization (DQ).** Even the quantization constants are quantized for additional memory savings. **Paged Optimizers.** Move optimizer state between RAM and GPU in pages to reduce OOM errors. ### 5.2. Practical QLoRA Cost (2026) **Costs are training only.** Data prep, eval, and iteration usually add 2-5x to total. ## 6. DPO — Direct Preference Optimization Published in 2023 by Rafailov et al., DPO offers a **much simpler mathematical formulation** than classic RLHF/PPO. The 2024-2026 modern alignment standard. ### 6.1. PPO (Classic RLHF) vs DPO ### 6.2. DPO Dataset Structure You need **chosen/rejected** pairs.

{
  "prompt": "How would you respond to a customer complaint?",
  "chosen": "An empathetic, solution-focused, short, clear response...",
  "rejected": "A defensive, generic, overly long response..."
}

Usually 500-5,000 preference pairs suffice; quality matters more than quantity. ### 6.3. DPO Derivatives (2024-2026) After DPO, many derivatives appeared: - **ORPO (Odds Ratio Preference Optimization)** — Combines SFT and preference optimization in one step. Hong et al. (2024). - **KTO (Kahneman-Tversky Optimization)** — Uses **single-answer reward/penalty** signals instead of preference pairs. Ethayarajh et al. (2024). - **IPO (Identity Preference Optimization)** — Regularization against DPO over-fitting. Azar et al. (2023). - **CPO (Contrastive Preference Optimization)** — Stronger reject signal. Xu et al. (2024). - **simPO (Simple Preference Optimization)** — Skips reference model. Meng et al. (2024). For **standard enterprise fine-tuning**: **SFT + DPO** is the most stable 2026 choice. For **combining SFT and DPO in one stage**: **ORPO**. If **producing dual responses is expensive** (preference pairs hard to make): **KTO** (single-answer + binary feedback). PPO is valuable only for academic research or frontier-model training — not worth the complexity for enterprise products. ## 7. Practical Fine-Tuning Pipeline A 7-stage pipeline from zero to production: ### 7.1. Training Frameworks **Practical pick.** **Unsloth** for developers/researchers (speed + ease). **LLaMA Factory** for production teams (broad scope). **Together** or **Modal** for cloud ease. **Axolotl + self-hosted GPU** for compliance-critical enterprises. ### 7.2. Data Preparation — The Invisible Success Factor **Data quality determines 70% of fine-tune outcome.** Training is the last step. Practical advice: manual > synthetic for quality but 10-50x more costly; use Self-Instruct, DataDreamer, Distilabel, Lilac for modern data-prep; isolate eval from training set; ensure class balance. ## 8. Turkish Fine-Tuning — Practical Notes 5 key nuances absent from global guides: ### 8.1. Tokenizer Efficiency Turkish morphology makes a word 2-5 tokens in typical tokenizers. In fine-tuning: 2x sequence length needed; 30-50% higher training cost; less content fits the context. **Fix:** Turkish-specific tokenizer (BERTurk) or vocabulary extension. Adding 3K-5K Turkish tokens to Llama/Mistral BPE vocab improves Turkish efficiency 30-50%. ### 8.2. Turkish Dataset Sources Belebele Turkish, Cosmos QA TR, xCOPA Turkish, WMT translation pairs, Wikipedia Turkish, MultiWOZ TR, Hugging Face Turkish datasets (100+), Cezeri instruction-tuning data, plus your enterprise data (most valuable). ### 8.3. Base Model Selection (For Turkish) **Practical pick.** General Turkish instruction-tune: **Qwen 2.5 14B** or **Llama 4 8B/70B**. NLP-specific: **BERTurk**. ### 8.4. Turkish Style Locking "siz" vs "sen", tone (formal/informal), regional dialects, sentence-order preferences — must be controlled in fine-tuning. Editor-level quality QA is mandatory. ### 8.5. Domain-Specific Turkish Examples Turkish law (TBK, TMK, KVKK + case law), tax (VUK, VAT, GVK), health (anonymized medical reports), e-commerce (Trendyol/Hepsiburada catalogs), banking (BDDK + customer interactions). ## 9. Hardware, Cloud, Cost ### 9.1. GPU Choice (2026) ### 9.2. Cloud Platforms **Modal** (Python-native, pay-as-you-go), **RunPod** (cheapest spot), **Together AI** (managed FT + inference), **Replicate** (ready templates), **AWS SageMaker / GCP Vertex AI / Azure ML** (enterprise), **Lambda Cloud** (on-demand H100/H200). ### 9.3. Typical Cost Scenarios - **Turkish style alignment, Llama 4 8B QLoRA, 5K samples:** ~$15-40 training + ~$50-100 data + ~$30 eval = **~$100-200 total** - **Domain-specific Mistral Small 3 fine-tune, 20K samples:** ~$80-200 training + ~$300-800 data + ~$100 eval = **~$500-1,200** - **Llama 4 70B QLoRA + DPO, 50K samples:** ~$300-600 training (2 phases) + $1,000-3,000 data + $200-500 eval = **~$2,000-5,000** **Reminder:** data prep + eval is 60-70% of cost. GPU hours are the smallest line item. ## 10. Case Studies (Anonymized Turkish Enterprises) ### Case 1 — Turkish Bank: Turkish Legal Document Assistant **Problem.** Contract analysis on GPT-5 missed Turkish legal jargon (TBK, TMK references, court vocabulary). **Solution.** Llama 4 70B QLoRA fine-tune: - **Data:** 8,000 anonymized contracts + 3,000 Turkish Supreme Court decisions + 2,000 legal Q&A pairs - **Method:** SFT + DPO (lawyers ranked 1,500 response pairs) - **Duration:** 6 weeks (4 weeks data, 2 weeks training + eval) - **Cost:** ~$8,000 (with labeling) **Result.** Turkish legal accuracy 72% → 91%. Contract analysis time per lawyer 14 hours/week → 5 hours. ### Case 2 — E-Commerce: Category Classification + Description **Problem.** Manual category selection + Turkish description writing took hours per new product. Prompt engineering on GPT-4o-mini was insufficient (12,000 sub-categories). **Solution.** Qwen 2.5 14B QLoRA fine-tune: - **Data:** 250,000 existing products (name + description → category + tags + SEO description) - **Method:** SFT (DPO not needed) - **Training:** 2x A100 80GB, 18 hours - **Cost:** ~$1,200 **Result.** Category classification accuracy 78% → 96%. Average human-intervention time per product 15 min → 1 min. Monthly 80K products processed at 90% lower cost than ChatGPT API (self-hosted Qwen + LoRA). ### Case 3 — Healthcare: Medical-Report Structuring **Problem.** Converting clinical notes to structured format (ICD-10 codes, diagnosis + treatment + medication) was 80% accurate on GPT-5; healthcare needs 95%+. **Solution.** Mistral Small 3 ORPO fine-tune: - **Data:** 15,000 anonymized clinical notes + expert-physician-approved structured outputs - **Method:** ORPO (SFT + DPO in one stage) - **KVKK safeguards:** all patient data anonymized; on-prem training; audit-logged eval - **Cost:** ~$3,500 (with physician labeling) **Result.** Medical-structuring accuracy 97%. KVKK + health regulation compliance. Enabled B2B integration with Turkish insurers. ## 11. Common Mistakes and Anti-Patterns ### 11.1. "Fine-Tune First, Ask Questions Later" The most common mistake. Always **eval prompt + RAG first**; know how well those two layers do before reaching for fine-tuning. ### 11.2. Training with Too Little Data Trying to style fine-tune with under 500 samples. Usually fails. Minimum 1,000 high-quality; ideal 5,000-10,000. ### 11.3. Catastrophic Forgetting Wrong learning rate (too high) or too many epochs (3+) breaks the model's base capabilities. Track general benchmark performance during training. ### 11.4. Test Set Leakage If part of the training data leaks into eval, the fine-tune score is artificially inflated but fails in production. Split at cleanup; never mix during training. ### 11.5. KVKK-Non-Compliant Data Fine-tuning with prompts that contain customer/employee personal data. **KVKK breach + the learned personal data becomes embedded in model weights.** Always anonymize. ### 11.6. No Versioning Not versioning fine-tune adapters and datasets. Use **HF Hub, W&B, MLflow** to track every experiment. ### 11.7. Shipping Without Eval "Loss went down — it works" before going live. Loss is not eval; measure actual task success with an eval set. ### 11.8. Wrong Base Model Choice Fine-tuning an English-only model for Turkish tasks. The base model **should already know Turkish**; fine-tuning adapts it to your domain, not teaches it Turkish from scratch. ## 12. Fine-Tuning vs Distillation **Distillation** — training a small model (student) on the outputs of a large model (teacher). The 2025-2026 most practical fine-tune pattern: 1. Generate synthetic data with a large model (Claude Opus 4.7) 2. SFT the small model (Llama 4 8B) on that data 3. Small model = cheap + fast + 85-90% of the large model's quality ## 13. Modern Fine-Tuning Trends (2026) - **Synthetic-data dominance** — generation with GPT-5/Claude/Gemini instead of human labeling - **Distillation everywhere** — knowledge transfer from frontier to small models - **Self-Reward models** — the model rates its own outputs to create training data - **Verifier models** — automatic quality control on fine-tune outputs - **RLAIF (RL from AI Feedback)** — another AI's preferences instead of humans - **Continual learning** — keeping the model updated without catastrophic forgetting - **PEFT advances** — DoRA, MoRA, LoftQ; 2024-2025 improvements over LoRA ## 14. KVKK-Compliant Fine-Tuning ### 14.1. Risks - **Data embeds in the model** — practically impossible to "delete" after fine-tuning - **Membership inference attacks** — training-set membership can be inferred from outputs - **Data leakage** — the model sometimes regurgitates training data almost verbatim ### 14.2. Mitigations 1. **Anonymization** — strip PII (national ID, name, phone, email) 2. **Differential privacy** — add noise during training (quality vs privacy trade-off) 3. **Federated learning** — train without centralizing data (advanced) 4. **Data residency** — train on Turkey or EU GPUs 5. **Audit logs** — which data was used in which training ### 14.3. Under the EU AI Act If the fine-tuned model is **high-risk** (credit scoring, HR selection, etc.): - Technical documentation (Annex IV) - Training-data governance - Risk assessment - Human oversight - Conformity assessment See our compliance guide on this site for details. ## 15. Frequently Asked Questions **Try RAG first.** Fine-tune only for: (a) style/format/behavior locking, (b) teaching a small model a large model's behavior for low latency, (c) Turkish domain language (law, medicine), (d) guaranteed structured output. For knowledge base + fresh data, RAG is always faster/cheaper.

**QLoRA in 95% of cases.** Only full FT if: (a) working on a frontier model with large GPUs, (b) you genuinely need every bit of quality. LoRA (without quantization) when 16-bit GPU suffices and speed matters.

**DPO** is the standard enterprise pick. **ORPO** combines SFT + DPO into one stage. **KTO** when producing dual responses is expensive. In 2026, DPO or ORPO covers most needs.

For Turkish: **Qwen 2.5 14B** or **Llama 4 8B/70B**. Prefer Apache 2.0/MIT licenses for commercial use. Do not pick without an eval set.

Style alignment: 1,000-3,000 high-quality samples; domain knowledge: 5,000-15,000; behavior change: 10,000+. Quality > quantity, always.

Typical Turkish range: **$200-$5,000** (model size + data labeling + eval). Synthetic data can cut cost by 60%. Data labeling is usually the most expensive line item.

**vLLM** (fastest, production-grade), **TGI** (Hugging Face), **Ollama** (easy self-hosted), **LMDeploy** (TensorRT-LLM-based). LoRA adapters can be merged into the base model or loaded at runtime.

Low learning rate (1e-4 LoRA, 5e-5 SFT), few epochs (1-3), general-benchmark eval during training (MMLU, HumanEval), prefer LoRA (less forgetting than full FT). Mixed batches (new + general data) help.

OpenAI has SFT + limited DPO via API; easy but closed-source (model doesn't leave their servers) + expensive. Anthropic has no public fine-tune API (limited Enterprise). Self-hosted is usually better for KVKK + cost control.

3 layers: **(1)** automated metrics (perplexity, exact match, BLEU/ROUGE for translation/summarization), **(2)** LLM-as-judge (pairwise compare with GPT-5 / Claude Opus 4.7), **(3)** human evaluation (50-200 samples). Combined they give reliable signal.

Synthetic data is widespread and effective in 2026. Risks: **(a)** teacher-model biases transfer, **(b)** diversity may shrink (model collapse). Hybrid recommended: 70% synthetic + 30% human-labeled.

LoRA / QLoRA: NO. Adapter is ~50MB-1GB; once merged, base size stays. Full FT: stays at base size (~140GB for Llama 70B).

Version with **Hugging Face Hub** (private repo), **MLflow Model Registry**, **W&B Artifacts**. vLLM and TGI support multi-adapter loading at runtime — swap 10 different LoRAs on one model quickly.

Depends on task: **classic NLP** (classification, NER, sentiment) → BERTurk (small + fast + enough). **Generative tasks** (writing, translation, Q&A) → fine-tune an LLM (Qwen, Llama, Mistral).

Yes. **Continuous fine-tuning** pipeline: collect user feedback → monitor eval scores → retrain automatically when below threshold → A/B test → rollout. MLflow + Argo Workflows + Modal/Together is a practical combo. ## 16. Next Steps To shape LLM fine-tuning strategy in your company or move an existing fine-tune to production quality: 1. **Fine-Tune Use-Case Assessment.** Is fine-tuning really needed? Is RAG/prompt enough? Investment math + 4-hour workshop. 2. **Data + Pipeline Setup.** Turkish data collection, labeling strategy, training-platform choice, eval harness — end-to-end pipeline design. 3. **Production Fine-Tune Audit.** For existing fine-tunes: 360° audit on quality, KVKK compliance, cost, observability. Reach out via the contact form. --- This is a living document; the fine-tuning ecosystem (new methods, frameworks, base models) shifts every quarter, so it is **updated quarterly**.