LLM Fine-Tuning: A Comprehensive 2026 Guide to LoRA, QLoRA, DPO, and Modern Alignment
The most current, detailed 2026 Turkish guide to adapting an LLM to your domain. Covers when fine-tuning is necessary, the math behind LoRA, 4-bit training with QLoRA, why DPO beats PPO, modern alternatives (ORPO/KTO/IPO), Turkish dataset sources, GPU/cloud cost modeling, production pipelines, 3 anonymized Turkish enterprise case studies, and KVKK-compliant training. For developers, MLOps engineers, and AI architects.
One-line answer: Fine-tuning is the advanced AI-engineering discipline that, in the right situations — when RAG and prompt engineering fall short — permanently bends an LLM's behavior toward your organization's DNA.
- Fine-tuning is the additional training that locks specific dimensions of an LLM's behavior — style, format, behavior, domain knowledge — without changing its core capabilities. It is the right answer for ~5% of needs.
- LoRA (Low-Rank Adaptation) trains small adapter matrices instead of full weights; with 0.1-1% of parameters updated, it delivers 90-95% of full fine-tuning quality.
- QLoRA pairs LoRA with 4-bit quantization, making a 70B model fine-tunable on a single A100 GPU — the engine behind the post-2023 personal/small-team fine-tuning boom.
- DPO (Direct Preference Optimization) replaces classic RLHF's PPO + reward-model loop with a simple supervised loss on preference pairs; the 2024-2026 modern alignment standard.
- For Turkish enterprises, fine-tuning typically costs $200-$5,000; data preparation determines 70% of cost and quality — training is only the last step.
1. What is Fine-Tuning and When is it Necessary?
Three main strategies adapt LLMs to your use case: prompt engineering, RAG, and fine-tuning. The first two leave the model unchanged; fine-tuning updates model weights through additional training. In the right situations, it produces enormous value; in the wrong ones, it is a waste of money.
- Fine-Tuning
- The process of updating a pretrained language model's (foundation model's) weights via additional training on a custom dataset and task. Aligns the model to a specific domain, style, format, or behavior while preserving the existing knowledge base. Covers methods like full fine-tuning, LoRA, QLoRA, DPO, and ORPO.
- Also known as: Model Adaptation
When to Fine-Tune?
A practical decision framework:
| Need | Prompt Eng | RAG | Fine-tuning |
|---|---|---|---|
| Lock in style/format | Partial | - | Ideal |
| Add domain knowledge | - | Ideal | Limited |
| Access fresh data | - | Ideal | - |
| Teach new behavior | Partial | - | Ideal |
| Reduce latency | - | - | Yes (small model) |
| Save tokens | - | - | Ideal |
| Setup time | Hours | Weeks | Weeks-months |
| Cost | Very low | Medium | High (one-time) |
Practical rule. 70% of needs are solved by prompt engineering, 25% more by prompt + RAG. The remaining 5% is where fine-tuning produces real value: locking in style/format, guaranteed structured output, lowering latency/cost (distillation), domain-specific language (Turkish law, medicine), and new behavior (agent tasks, tool use).
Why Try Prompt and RAG First?
Fine-tuning has five side effects: high upfront cost (GPU hours, data, evals), model "freezing" (re-do work on each new base model), catastrophic forgetting risk, data-management complexity (KVKK + IP + quality), and harder evaluation. That is why OpenAI, Anthropic, and Google all officially recommend prompt + RAG first, fine-tuning later.
2. The Full LLM Training Pipeline
A modern LLM goes through four training stages, each with a distinct purpose, dataset type, and cost.
| Stage | Purpose | Data Type | Time/Cost |
|---|---|---|---|
| 1. Pretraining | General language | Trillions of tokens (internet, books, code) | Months, millions $ |
| 2. Supervised Fine-Tuning (SFT) | Instruction following | Thousands of high-quality Q&A pairs | Days, thousands $ |
| 3. Preference Optimization (RLHF/DPO/ORPO) | Human preference | Preference pairs (A > B) | Days, thousands $ |
| 4. Continued Fine-tuning (yours) | Domain/style alignment | Hundreds-thousands of examples | Hours-days, $50-5,000 |
Enterprise fine-tuning usually happens at Stage 4.
Supervised Fine-Tuning (SFT)
The most basic form — standard next-token prediction training on instruction-response pairs. Most enterprise fine-tunes are SFT (style, format, domain knowledge).
Preference Optimization
Human evaluators see two responses (A, B) for the same prompt and mark the better one. The model is then pushed toward "good" responses via:
- RLHF (PPO) — classic; trains a reward model and applies PPO. Complex and resource-heavy.
- DPO — skips the reward model; supervised loss directly on preference pairs. Simple, effective, the standard since 2024.
- ORPO / KTO / IPO — derivatives and alternatives detailed below.
3. PEFT — Parameter-Efficient Fine-Tuning
Fully fine-tuning a 70B-parameter model requires updating all 70B weights — needs 800GB+ VRAM, only large labs reach that. PEFT solves this by updating only a small parameter subset.
- PEFT (Parameter-Efficient Fine-Tuning)
- A family of techniques that fine-tune a small subset of parameters rather than the entire weights of pretrained large models. Includes LoRA, QLoRA, AdaLoRA, IA-3, Prefix Tuning, Prompt Tuning. Reduces compute by 10-100x with typically only 5-10% quality drop.
- Also known as: Parameter-Efficient Fine-Tuning
PEFT members: LoRA, QLoRA, AdaLoRA, IA-3, Prefix Tuning, Prompt Tuning, DoRA (2024), MoRA (2024).
4. LoRA — Low-Rank Adaptation
Published in 2021 by Microsoft researchers (Hu et al.), LoRA has become the gold standard of modern fine-tuning.
4.1. Math (Brief)
In full fine-tuning, a weight matrix W (e.g., 4096×4096) is updated directly: W_new = W + ΔW. LoRA's assumption: ΔW can be low-rank.
LoRA expresses ΔW as the product of two small matrices:
ΔW ≈ B × A
B: 4096 × r
A: r × 4096
r << 4096 (usually 4, 8, 16, 32, 64)Only A and B are updated during training; original W is frozen. At inference, W + B × A is computed (or merged).
4.2. LoRA Hyperparameters
Rank (r) — size of LoRA matrices. Common: 8 (default), 16, 32, 64. Higher rank = more capacity but overfitting risk.
Alpha (α) — scaling factor. ΔW_effective = (α/r) × B × A. Practical: α = 2r.
Target modules — which layers get LoRA?
q_proj, v_proj— attention query/value only (minimal)q_proj, k_proj, v_proj, o_proj— all attentionq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj— attention + MLP (most thorough)
Tip. All linear layers gives best results. Attention-only loses 5-10% quality on most tasks.
4.3. Full Fine-Tuning vs LoRA
| Dimension | Full FT | LoRA |
|---|---|---|
| Trained params | 70B (full) | ~0.5B (0.7%) |
| VRAM need | 800GB+ | 48-80GB |
| Training time | 1x | 0.5-0.7x |
| Quality | 100% (baseline) | 90-95% |
| Data need | More | Less (1K-10K samples) |
| Output size | ~140GB | ~50MB-1GB (adapter only) |
| Multi-task | Hard | Multi-adapter swap |
LoRA's small output (50MB-1GB) is especially valuable — you can run 10 different LoRA adapters on the same model, switching at runtime.
5. QLoRA — 4-bit Quantization + LoRA
Published in 2023 by Dettmers et al., QLoRA pairs LoRA with quantization to make 70B models trainable on a single A100 GPU. The engine of the personal/small-team fine-tuning explosion.
5.1. Three Main Components
4-bit NF4 (Normal Float 4) quantization. Model weights stored at 4-bit instead of 16-bit. NF4 is more accurate than standard 4-bit — optimized for normal-distributed data.
Double Quantization (DQ). Even the quantization constants are quantized for additional memory savings.
Paged Optimizers. Move optimizer state between RAM and GPU in pages to reduce OOM errors.
5.2. Practical QLoRA Cost (2026)
| Model | GPU | Time (10K samples) | Est. Cost |
|---|---|---|---|
| Llama 3 8B | 1x RTX 4090 (24GB) | 2-4 hours | $5-15 (RunPod) |
| Llama 3 70B | 1x A100 80GB | 8-12 hours | $50-150 (Modal/RunPod) |
| Llama 4 70B | 1x H100 80GB | 6-10 hours | $80-200 |
| Mixtral 8x7B | 1x A100 80GB | 10-15 hours | $80-200 |
| Qwen 2.5 72B | 1x H100 80GB | 8-12 hours | $120-250 |
Costs are training only. Data prep, eval, and iteration usually add 2-5x to total.
6. DPO — Direct Preference Optimization
Published in 2023 by Rafailov et al., DPO offers a much simpler mathematical formulation than classic RLHF/PPO. The 2024-2026 modern alignment standard.
- DPO (Direct Preference Optimization)
- A method that, on a human preference dataset (chosen/rejected pairs), skips reward-model training and PPO steps and uses a supervised-style loss directly. Published in 2023 by Stanford and CMU researchers; dramatically reduces the operational complexity of classic RLHF. Has been the standard in the open-model ecosystem since 2024.
- Also known as: Direct Preference Optimization
6.1. PPO (Classic RLHF) vs DPO
| Dimension | RLHF (PPO) | DPO |
|---|---|---|
| Reward Model | Required (separate training) | Not needed |
| Pipeline stages | 3 (SFT + RM + PPO) | 2 (SFT + DPO) |
| Training stability | Low (hyperparam sensitive) | High |
| Compute cost | ~5x SFT | ~1.5x SFT |
| Code complexity | High | Low |
| Quality (frontier) | Historically best | Equal or superior (recent research) |
6.2. DPO Dataset Structure
You need chosen/rejected pairs.
{
"prompt": "How would you respond to a customer complaint?",
"chosen": "An empathetic, solution-focused, short, clear response...",
"rejected": "A defensive, generic, overly long response..."
}Usually 500-5,000 preference pairs suffice; quality matters more than quantity.
6.3. DPO Derivatives (2024-2026)
After DPO, many derivatives appeared:
- ORPO (Odds Ratio Preference Optimization) — Combines SFT and preference optimization in one step. Hong et al. (2024).
- KTO (Kahneman-Tversky Optimization) — Uses single-answer reward/penalty signals instead of preference pairs. Ethayarajh et al. (2024).
- IPO (Identity Preference Optimization) — Regularization against DPO over-fitting. Azar et al. (2023).
- CPO (Contrastive Preference Optimization) — Stronger reject signal. Xu et al. (2024).
- simPO (Simple Preference Optimization) — Skips reference model. Meng et al. (2024).
7. Practical Fine-Tuning Pipeline
A 7-stage pipeline from zero to production:
Production Fine-Tuning Pipeline — 7 Stages
A step-by-step path from zero to production-quality fine-tuning.
- 1
1. Use-Case Definition + Baseline
Why fine-tuning? How well does prompt + RAG work? Define baseline metrics.
- 2
2. Data Collection
500-10,000 high-quality samples. Manual labeling, cleaned from existing data, or synthetic (a large model teaching a smaller one).
- 3
3. Data Cleaning + QA
Dedupe, fix labels, strip PII (KVKK). Split train/val/test (usually 80/10/10).
- 4
4. Format + Tokenization
Chat template (Llama, Mistral, ChatML), system prompt structure, sequence length, tokenizer checks.
- 5
5. Training
Framework choice (Unsloth, Axolotl, LLaMA Factory). Hyperparams: learning rate (1e-4 LoRA, 5e-5 SFT), batch size, epochs (1-3), LoRA r/alpha. Cloud GPU or local.
- 6
6. Evaluation
Automated metrics (perplexity, BLEU, custom) + LLM-as-judge + human eval. Pre-production eval set is mandatory.
- 7
7. Deployment
Serve via vLLM, TGI, or Ollama. A/B test (existing vs fine-tune). Monitor performance + cost.
7.1. Training Frameworks
| Framework | Speed | Ease | Scope |
|---|---|---|---|
| Unsloth | 2-5x fast (Triton optimization) | High (simple Python) | LoRA, QLoRA, SFT, DPO |
| Axolotl | Standard | Medium (YAML config) | Full spectrum, including full FT |
| LLaMA Factory | Standard | High (CLI + UI) | LoRA, QLoRA, RLHF, DPO, ORPO, KTO |
| Hugging Face TRL | Standard | Medium (Python library) | Full spectrum, latest techniques |
| Together / Replicate / Modal | Cloud | Very high (managed) | LoRA, limited control |
| OpenAI Fine-tuning API | Cloud | Very high | SFT + limited DPO, closed-source |
Practical pick. Unsloth for developers/researchers (speed + ease). LLaMA Factory for production teams (broad scope). Together or Modal for cloud ease. Axolotl + self-hosted GPU for compliance-critical enterprises.
7.2. Data Preparation — The Invisible Success Factor
Data quality determines 70% of fine-tune outcome. Training is the last step. Practical advice: manual > synthetic for quality but 10-50x more costly; use Self-Instruct, DataDreamer, Distilabel, Lilac for modern data-prep; isolate eval from training set; ensure class balance.
8. Turkish Fine-Tuning — Practical Notes
5 key nuances absent from global guides:
8.1. Tokenizer Efficiency
Turkish morphology makes a word 2-5 tokens in typical tokenizers. In fine-tuning: 2x sequence length needed; 30-50% higher training cost; less content fits the context.
Fix: Turkish-specific tokenizer (BERTurk) or vocabulary extension. Adding 3K-5K Turkish tokens to Llama/Mistral BPE vocab improves Turkish efficiency 30-50%.
8.2. Turkish Dataset Sources
Belebele Turkish, Cosmos QA TR, xCOPA Turkish, WMT translation pairs, Wikipedia Turkish, MultiWOZ TR, Hugging Face Turkish datasets (100+), Cezeri instruction-tuning data, plus your enterprise data (most valuable).
8.3. Base Model Selection (For Turkish)
| Model | Turkish Score | Size | License | Fine-tune Friendly |
|---|---|---|---|---|
| Llama 4 8B | Medium-good | 8B | Meta open | High |
| Llama 4 70B | Good | 70B | Meta open | High |
| Mistral Small 3 | Good | 22B | Apache 2.0 | High |
| Qwen 2.5 14B | High (multilingual) | 14B | Apache 2.0 | High |
| Qwen 2.5 72B | Very high | 72B | Apache 2.0 | High |
| DeepSeek V3 | High | 671B (MoE) | MIT | Medium (large) |
| BERTurk | Excellent (NLP) | Base | MIT | For NLP tasks |
Practical pick. General Turkish instruction-tune: Qwen 2.5 14B or Llama 4 8B/70B. NLP-specific: BERTurk.
8.4. Turkish Style Locking
"siz" vs "sen", tone (formal/informal), regional dialects, sentence-order preferences — must be controlled in fine-tuning. Editor-level quality QA is mandatory.
8.5. Domain-Specific Turkish Examples
Turkish law (TBK, TMK, KVKK + case law), tax (VUK, VAT, GVK), health (anonymized medical reports), e-commerce (Trendyol/Hepsiburada catalogs), banking (BDDK + customer interactions).
9. Hardware, Cloud, Cost
9.1. GPU Choice (2026)
| GPU | VRAM | Typical Cloud Price (USD/hr) | Max Model with QLoRA |
|---|---|---|---|
| RTX 4090 | 24GB | $0.40-0.80 | 7B-13B |
| RTX 5090 | 32GB | $0.60-1.20 | 13B-22B |
| A100 40GB | 40GB | $1.20-2.00 | 13B-34B |
| A100 80GB | 80GB | $1.80-3.50 | 34B-70B |
| H100 80GB | 80GB | $3.50-6.00 | 34B-70B (fast) |
| H200 | 141GB | $5-9 | 70B+ (comfortable) |
| GB200/B200 (Blackwell) | 192GB | $8-15 | 100B+ MoE |
9.2. Cloud Platforms
Modal (Python-native, pay-as-you-go), RunPod (cheapest spot), Together AI (managed FT + inference), Replicate (ready templates), AWS SageMaker / GCP Vertex AI / Azure ML (enterprise), Lambda Cloud (on-demand H100/H200).
9.3. Typical Cost Scenarios
- Turkish style alignment, Llama 4 8B QLoRA, 5K samples: ~$15-40 training + ~$50-100 data + ~$30 eval = ~$100-200 total
- Domain-specific Mistral Small 3 fine-tune, 20K samples: ~$80-200 training + ~$300-800 data + ~$100 eval = ~$500-1,200
- Llama 4 70B QLoRA + DPO, 50K samples: ~$300-600 training (2 phases) + $1,000-3,000 data + $200-500 eval = ~$2,000-5,000
Reminder: data prep + eval is 60-70% of cost. GPU hours are the smallest line item.
10. Case Studies (Anonymized Turkish Enterprises)
Case 1 — Turkish Bank: Turkish Legal Document Assistant
Problem. Contract analysis on GPT-5 missed Turkish legal jargon (TBK, TMK references, court vocabulary).
Solution. Llama 4 70B QLoRA fine-tune:
- Data: 8,000 anonymized contracts + 3,000 Turkish Supreme Court decisions + 2,000 legal Q&A pairs
- Method: SFT + DPO (lawyers ranked 1,500 response pairs)
- Duration: 6 weeks (4 weeks data, 2 weeks training + eval)
- Cost: ~$8,000 (with labeling)
Result. Turkish legal accuracy 72% → 91%. Contract analysis time per lawyer 14 hours/week → 5 hours.
Case 2 — E-Commerce: Category Classification + Description
Problem. Manual category selection + Turkish description writing took hours per new product. Prompt engineering on GPT-4o-mini was insufficient (12,000 sub-categories).
Solution. Qwen 2.5 14B QLoRA fine-tune:
- Data: 250,000 existing products (name + description → category + tags + SEO description)
- Method: SFT (DPO not needed)
- Training: 2x A100 80GB, 18 hours
- Cost: ~$1,200
Result. Category classification accuracy 78% → 96%. Average human-intervention time per product 15 min → 1 min. Monthly 80K products processed at 90% lower cost than ChatGPT API (self-hosted Qwen + LoRA).
Case 3 — Healthcare: Medical-Report Structuring
Problem. Converting clinical notes to structured format (ICD-10 codes, diagnosis + treatment + medication) was 80% accurate on GPT-5; healthcare needs 95%+.
Solution. Mistral Small 3 ORPO fine-tune:
- Data: 15,000 anonymized clinical notes + expert-physician-approved structured outputs
- Method: ORPO (SFT + DPO in one stage)
- KVKK safeguards: all patient data anonymized; on-prem training; audit-logged eval
- Cost: ~$3,500 (with physician labeling)
Result. Medical-structuring accuracy 97%. KVKK + health regulation compliance. Enabled B2B integration with Turkish insurers.
11. Common Mistakes and Anti-Patterns
11.1. "Fine-Tune First, Ask Questions Later"
The most common mistake. Always eval prompt + RAG first; know how well those two layers do before reaching for fine-tuning.
11.2. Training with Too Little Data
Trying to style fine-tune with under 500 samples. Usually fails. Minimum 1,000 high-quality; ideal 5,000-10,000.
11.3. Catastrophic Forgetting
Wrong learning rate (too high) or too many epochs (3+) breaks the model's base capabilities. Track general benchmark performance during training.
11.4. Test Set Leakage
If part of the training data leaks into eval, the fine-tune score is artificially inflated but fails in production. Split at cleanup; never mix during training.
11.5. KVKK-Non-Compliant Data
Fine-tuning with prompts that contain customer/employee personal data. KVKK breach + the learned personal data becomes embedded in model weights. Always anonymize.
11.6. No Versioning
Not versioning fine-tune adapters and datasets. Use HF Hub, W&B, MLflow to track every experiment.
11.7. Shipping Without Eval
"Loss went down — it works" before going live. Loss is not eval; measure actual task success with an eval set.
11.8. Wrong Base Model Choice
Fine-tuning an English-only model for Turkish tasks. The base model should already know Turkish; fine-tuning adapts it to your domain, not teaches it Turkish from scratch.
12. Fine-Tuning vs Distillation
Distillation — training a small model (student) on the outputs of a large model (teacher). The 2025-2026 most practical fine-tune pattern:
- Generate synthetic data with a large model (Claude Opus 4.7)
- SFT the small model (Llama 4 8B) on that data
- Small model = cheap + fast + 85-90% of the large model's quality
13. Modern Fine-Tuning Trends (2026)
- Synthetic-data dominance — generation with GPT-5/Claude/Gemini instead of human labeling
- Distillation everywhere — knowledge transfer from frontier to small models
- Self-Reward models — the model rates its own outputs to create training data
- Verifier models — automatic quality control on fine-tune outputs
- RLAIF (RL from AI Feedback) — another AI's preferences instead of humans
- Continual learning — keeping the model updated without catastrophic forgetting
- PEFT advances — DoRA, MoRA, LoftQ; 2024-2025 improvements over LoRA
14. KVKK-Compliant Fine-Tuning
14.1. Risks
- Data embeds in the model — practically impossible to "delete" after fine-tuning
- Membership inference attacks — training-set membership can be inferred from outputs
- Data leakage — the model sometimes regurgitates training data almost verbatim
14.2. Mitigations
- Anonymization — strip PII (national ID, name, phone, email)
- Differential privacy — add noise during training (quality vs privacy trade-off)
- Federated learning — train without centralizing data (advanced)
- Data residency — train on Turkey or EU GPUs
- Audit logs — which data was used in which training
14.3. Under the EU AI Act
If the fine-tuned model is high-risk (credit scoring, HR selection, etc.):
- Technical documentation (Annex IV)
- Training-data governance
- Risk assessment
- Human oversight
- Conformity assessment
See our compliance guide on this site for details.
15. Frequently Asked Questions
16. Next Steps
To shape LLM fine-tuning strategy in your company or move an existing fine-tune to production quality:
- Fine-Tune Use-Case Assessment. Is fine-tuning really needed? Is RAG/prompt enough? Investment math + 4-hour workshop.
- Data + Pipeline Setup. Turkish data collection, labeling strategy, training-platform choice, eval harness — end-to-end pipeline design.
- Production Fine-Tune Audit. For existing fine-tunes: 360° audit on quality, KVKK compliance, cost, observability.
Reach out via the contact form.
References
- LoRA: Low-Rank Adaptation of Large Language Models — Hu et al., Microsoft Research ·
- QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al., University of Washington ·
- DPO: Your Language Model is Secretly a Reward Model — Rafailov et al., Stanford ·
- ORPO: Monolithic Preference Optimization without Reference Model — Hong et al., KAIST ·
- KTO: Model Alignment as Prospect Theoretic Optimization — Ethayarajh et al., Stanford ·
- IPO: A General Theoretical Paradigm — Azar et al., Google DeepMind ·
- InstructGPT: Training language models with human feedback — Ouyang et al., OpenAI ·
- DoRA: Weight-Decomposed Low-Rank Adaptation — Liu et al., NVIDIA ·
- Constitutional AI: Harmlessness from AI Feedback — Bai et al., Anthropic ·
- Self-Instruct: Aligning Language Models with Self-Generated Instructions — Wang et al., University of Washington ·
- Unsloth Documentation — Unsloth AI, Unsloth ·
- Hugging Face TRL — Hugging Face, Hugging Face ·
- Axolotl — Axolotl, Axolotl ·
- LLaMA Factory — LLaMA Factory, GitHub ·
- KVKK - Law No. 6698 — Republic of Turkiye - KVKK, Republic of Turkiye ·
- EU AI Act — European Commission, EU ·
This is a living document; the fine-tuning ecosystem (new methods, frameworks, base models) shifts every quarter, so it is updated quarterly.
Consulting Pathways
Consulting pages closest to this article
For the most logical next step after this article, you can review the most relevant solution, role, and industry landing pages here.
AI Evaluation, Guardrails and Observability
A comprehensive evaluation layer to measure, observe and control AI accuracy, safety and performance.
Enterprise RAG Systems Development
Production-grade RAG systems that provide grounded, secure and auditable access to internal knowledge.
Enterprise AI Architecture Consulting for CTOs
Technical leadership consulting to move AI initiatives from isolated PoCs into secure, scalable and production-ready architecture.