Skip to content
Artificial Intelligence·35 min·May 12, 2026·119

LLM Fine-Tuning: A Comprehensive 2026 Guide to LoRA, QLoRA, DPO, and Modern Alignment

The most current, detailed 2026 Turkish guide to adapting an LLM to your domain. Covers when fine-tuning is necessary, the math behind LoRA, 4-bit training with QLoRA, why DPO beats PPO, modern alternatives (ORPO/KTO/IPO), Turkish dataset sources, GPU/cloud cost modeling, production pipelines, 3 anonymized Turkish enterprise case studies, and KVKK-compliant training. For developers, MLOps engineers, and AI architects.

SYK
Şükrü Yusuf KAYA
AI Expert · Enterprise AI Consultant
TL;DR

One-line answer: Fine-tuning is the advanced AI-engineering discipline that, in the right situations — when RAG and prompt engineering fall short — permanently bends an LLM's behavior toward your organization's DNA.

  • Fine-tuning is the additional training that locks specific dimensions of an LLM's behavior — style, format, behavior, domain knowledge — without changing its core capabilities. It is the right answer for ~5% of needs.
  • LoRA (Low-Rank Adaptation) trains small adapter matrices instead of full weights; with 0.1-1% of parameters updated, it delivers 90-95% of full fine-tuning quality.
  • QLoRA pairs LoRA with 4-bit quantization, making a 70B model fine-tunable on a single A100 GPU — the engine behind the post-2023 personal/small-team fine-tuning boom.
  • DPO (Direct Preference Optimization) replaces classic RLHF's PPO + reward-model loop with a simple supervised loss on preference pairs; the 2024-2026 modern alignment standard.
  • For Turkish enterprises, fine-tuning typically costs $200-$5,000; data preparation determines 70% of cost and quality — training is only the last step.

1. What is Fine-Tuning and When is it Necessary?

Three main strategies adapt LLMs to your use case: prompt engineering, RAG, and fine-tuning. The first two leave the model unchanged; fine-tuning updates model weights through additional training. In the right situations, it produces enormous value; in the wrong ones, it is a waste of money.

Definition
Fine-Tuning
The process of updating a pretrained language model's (foundation model's) weights via additional training on a custom dataset and task. Aligns the model to a specific domain, style, format, or behavior while preserving the existing knowledge base. Covers methods like full fine-tuning, LoRA, QLoRA, DPO, and ORPO.
Also known as: Model Adaptation

When to Fine-Tune?

A practical decision framework:

Fine-Tuning vs Other Adaptation Methods
NeedPrompt EngRAGFine-tuning
Lock in style/formatPartial-Ideal
Add domain knowledge-IdealLimited
Access fresh data-Ideal-
Teach new behaviorPartial-Ideal
Reduce latency--Yes (small model)
Save tokens--Ideal
Setup timeHoursWeeksWeeks-months
CostVery lowMediumHigh (one-time)

Practical rule. 70% of needs are solved by prompt engineering, 25% more by prompt + RAG. The remaining 5% is where fine-tuning produces real value: locking in style/format, guaranteed structured output, lowering latency/cost (distillation), domain-specific language (Turkish law, medicine), and new behavior (agent tasks, tool use).

Why Try Prompt and RAG First?

Fine-tuning has five side effects: high upfront cost (GPU hours, data, evals), model "freezing" (re-do work on each new base model), catastrophic forgetting risk, data-management complexity (KVKK + IP + quality), and harder evaluation. That is why OpenAI, Anthropic, and Google all officially recommend prompt + RAG first, fine-tuning later.

2. The Full LLM Training Pipeline

A modern LLM goes through four training stages, each with a distinct purpose, dataset type, and cost.

LLM Training Stages (Full Picture)
StagePurposeData TypeTime/Cost
1. PretrainingGeneral languageTrillions of tokens (internet, books, code)Months, millions $
2. Supervised Fine-Tuning (SFT)Instruction followingThousands of high-quality Q&A pairsDays, thousands $
3. Preference Optimization (RLHF/DPO/ORPO)Human preferencePreference pairs (A > B)Days, thousands $
4. Continued Fine-tuning (yours)Domain/style alignmentHundreds-thousands of examplesHours-days, $50-5,000

Enterprise fine-tuning usually happens at Stage 4.

Supervised Fine-Tuning (SFT)

The most basic form — standard next-token prediction training on instruction-response pairs. Most enterprise fine-tunes are SFT (style, format, domain knowledge).

Preference Optimization

Human evaluators see two responses (A, B) for the same prompt and mark the better one. The model is then pushed toward "good" responses via:

  • RLHF (PPO) — classic; trains a reward model and applies PPO. Complex and resource-heavy.
  • DPO — skips the reward model; supervised loss directly on preference pairs. Simple, effective, the standard since 2024.
  • ORPO / KTO / IPO — derivatives and alternatives detailed below.

3. PEFT — Parameter-Efficient Fine-Tuning

Fully fine-tuning a 70B-parameter model requires updating all 70B weights — needs 800GB+ VRAM, only large labs reach that. PEFT solves this by updating only a small parameter subset.

Definition
PEFT (Parameter-Efficient Fine-Tuning)
A family of techniques that fine-tune a small subset of parameters rather than the entire weights of pretrained large models. Includes LoRA, QLoRA, AdaLoRA, IA-3, Prefix Tuning, Prompt Tuning. Reduces compute by 10-100x with typically only 5-10% quality drop.
Also known as: Parameter-Efficient Fine-Tuning

PEFT members: LoRA, QLoRA, AdaLoRA, IA-3, Prefix Tuning, Prompt Tuning, DoRA (2024), MoRA (2024).

4. LoRA — Low-Rank Adaptation

Published in 2021 by Microsoft researchers (Hu et al.), LoRA has become the gold standard of modern fine-tuning.

4.1. Math (Brief)

In full fine-tuning, a weight matrix W (e.g., 4096×4096) is updated directly: W_new = W + ΔW. LoRA's assumption: ΔW can be low-rank.

LoRA expresses ΔW as the product of two small matrices:

Code Snippet
ΔW ≈ B × A
B: 4096 × r
A: r × 4096
r << 4096 (usually 4, 8, 16, 32, 64)

Only A and B are updated during training; original W is frozen. At inference, W + B × A is computed (or merged).

4.2. LoRA Hyperparameters

Rank (r) — size of LoRA matrices. Common: 8 (default), 16, 32, 64. Higher rank = more capacity but overfitting risk.

Alpha (α) — scaling factor. ΔW_effective = (α/r) × B × A. Practical: α = 2r.

Target modules — which layers get LoRA?

  • q_proj, v_proj — attention query/value only (minimal)
  • q_proj, k_proj, v_proj, o_proj — all attention
  • q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj — attention + MLP (most thorough)

Tip. All linear layers gives best results. Attention-only loses 5-10% quality on most tasks.

4.3. Full Fine-Tuning vs LoRA

Full Fine-Tuning vs LoRA (Llama 3 70B Example)
DimensionFull FTLoRA
Trained params70B (full)~0.5B (0.7%)
VRAM need800GB+48-80GB
Training time1x0.5-0.7x
Quality100% (baseline)90-95%
Data needMoreLess (1K-10K samples)
Output size~140GB~50MB-1GB (adapter only)
Multi-taskHardMulti-adapter swap

LoRA's small output (50MB-1GB) is especially valuable — you can run 10 different LoRA adapters on the same model, switching at runtime.

5. QLoRA — 4-bit Quantization + LoRA

Published in 2023 by Dettmers et al., QLoRA pairs LoRA with quantization to make 70B models trainable on a single A100 GPU. The engine of the personal/small-team fine-tuning explosion.

5.1. Three Main Components

4-bit NF4 (Normal Float 4) quantization. Model weights stored at 4-bit instead of 16-bit. NF4 is more accurate than standard 4-bit — optimized for normal-distributed data.

Double Quantization (DQ). Even the quantization constants are quantized for additional memory savings.

Paged Optimizers. Move optimizer state between RAM and GPU in pages to reduce OOM errors.

5.2. Practical QLoRA Cost (2026)

QLoRA Cost Estimates (2026)
ModelGPUTime (10K samples)Est. Cost
Llama 3 8B1x RTX 4090 (24GB)2-4 hours$5-15 (RunPod)
Llama 3 70B1x A100 80GB8-12 hours$50-150 (Modal/RunPod)
Llama 4 70B1x H100 80GB6-10 hours$80-200
Mixtral 8x7B1x A100 80GB10-15 hours$80-200
Qwen 2.5 72B1x H100 80GB8-12 hours$120-250

Costs are training only. Data prep, eval, and iteration usually add 2-5x to total.

6. DPO — Direct Preference Optimization

Published in 2023 by Rafailov et al., DPO offers a much simpler mathematical formulation than classic RLHF/PPO. The 2024-2026 modern alignment standard.

Definition
DPO (Direct Preference Optimization)
A method that, on a human preference dataset (chosen/rejected pairs), skips reward-model training and PPO steps and uses a supervised-style loss directly. Published in 2023 by Stanford and CMU researchers; dramatically reduces the operational complexity of classic RLHF. Has been the standard in the open-model ecosystem since 2024.
Also known as: Direct Preference Optimization

6.1. PPO (Classic RLHF) vs DPO

RLHF (PPO) vs DPO
DimensionRLHF (PPO)DPO
Reward ModelRequired (separate training)Not needed
Pipeline stages3 (SFT + RM + PPO)2 (SFT + DPO)
Training stabilityLow (hyperparam sensitive)High
Compute cost~5x SFT~1.5x SFT
Code complexityHighLow
Quality (frontier)Historically bestEqual or superior (recent research)

6.2. DPO Dataset Structure

You need chosen/rejected pairs.

Code Snippet
{
  "prompt": "How would you respond to a customer complaint?",
  "chosen": "An empathetic, solution-focused, short, clear response...",
  "rejected": "A defensive, generic, overly long response..."
}

Usually 500-5,000 preference pairs suffice; quality matters more than quantity.

6.3. DPO Derivatives (2024-2026)

After DPO, many derivatives appeared:

  • ORPO (Odds Ratio Preference Optimization) — Combines SFT and preference optimization in one step. Hong et al. (2024).
  • KTO (Kahneman-Tversky Optimization) — Uses single-answer reward/penalty signals instead of preference pairs. Ethayarajh et al. (2024).
  • IPO (Identity Preference Optimization) — Regularization against DPO over-fitting. Azar et al. (2023).
  • CPO (Contrastive Preference Optimization) — Stronger reject signal. Xu et al. (2024).
  • simPO (Simple Preference Optimization) — Skips reference model. Meng et al. (2024).

7. Practical Fine-Tuning Pipeline

A 7-stage pipeline from zero to production:

How to

Production Fine-Tuning Pipeline — 7 Stages

A step-by-step path from zero to production-quality fine-tuning.

Total time:
  1. 1

    1. Use-Case Definition + Baseline

    Why fine-tuning? How well does prompt + RAG work? Define baseline metrics.

  2. 2

    2. Data Collection

    500-10,000 high-quality samples. Manual labeling, cleaned from existing data, or synthetic (a large model teaching a smaller one).

  3. 3

    3. Data Cleaning + QA

    Dedupe, fix labels, strip PII (KVKK). Split train/val/test (usually 80/10/10).

  4. 4

    4. Format + Tokenization

    Chat template (Llama, Mistral, ChatML), system prompt structure, sequence length, tokenizer checks.

  5. 5

    5. Training

    Framework choice (Unsloth, Axolotl, LLaMA Factory). Hyperparams: learning rate (1e-4 LoRA, 5e-5 SFT), batch size, epochs (1-3), LoRA r/alpha. Cloud GPU or local.

  6. 6

    6. Evaluation

    Automated metrics (perplexity, BLEU, custom) + LLM-as-judge + human eval. Pre-production eval set is mandatory.

  7. 7

    7. Deployment

    Serve via vLLM, TGI, or Ollama. A/B test (existing vs fine-tune). Monitor performance + cost.

7.1. Training Frameworks

2026 Fine-Tuning Framework Comparison
FrameworkSpeedEaseScope
Unsloth2-5x fast (Triton optimization)High (simple Python)LoRA, QLoRA, SFT, DPO
AxolotlStandardMedium (YAML config)Full spectrum, including full FT
LLaMA FactoryStandardHigh (CLI + UI)LoRA, QLoRA, RLHF, DPO, ORPO, KTO
Hugging Face TRLStandardMedium (Python library)Full spectrum, latest techniques
Together / Replicate / ModalCloudVery high (managed)LoRA, limited control
OpenAI Fine-tuning APICloudVery highSFT + limited DPO, closed-source

Practical pick. Unsloth for developers/researchers (speed + ease). LLaMA Factory for production teams (broad scope). Together or Modal for cloud ease. Axolotl + self-hosted GPU for compliance-critical enterprises.

7.2. Data Preparation — The Invisible Success Factor

Data quality determines 70% of fine-tune outcome. Training is the last step. Practical advice: manual > synthetic for quality but 10-50x more costly; use Self-Instruct, DataDreamer, Distilabel, Lilac for modern data-prep; isolate eval from training set; ensure class balance.

8. Turkish Fine-Tuning — Practical Notes

5 key nuances absent from global guides:

8.1. Tokenizer Efficiency

Turkish morphology makes a word 2-5 tokens in typical tokenizers. In fine-tuning: 2x sequence length needed; 30-50% higher training cost; less content fits the context.

Fix: Turkish-specific tokenizer (BERTurk) or vocabulary extension. Adding 3K-5K Turkish tokens to Llama/Mistral BPE vocab improves Turkish efficiency 30-50%.

8.2. Turkish Dataset Sources

Belebele Turkish, Cosmos QA TR, xCOPA Turkish, WMT translation pairs, Wikipedia Turkish, MultiWOZ TR, Hugging Face Turkish datasets (100+), Cezeri instruction-tuning data, plus your enterprise data (most valuable).

8.3. Base Model Selection (For Turkish)

Base Models for Turkish Fine-Tuning
ModelTurkish ScoreSizeLicenseFine-tune Friendly
Llama 4 8BMedium-good8BMeta openHigh
Llama 4 70BGood70BMeta openHigh
Mistral Small 3Good22BApache 2.0High
Qwen 2.5 14BHigh (multilingual)14BApache 2.0High
Qwen 2.5 72BVery high72BApache 2.0High
DeepSeek V3High671B (MoE)MITMedium (large)
BERTurkExcellent (NLP)BaseMITFor NLP tasks

Practical pick. General Turkish instruction-tune: Qwen 2.5 14B or Llama 4 8B/70B. NLP-specific: BERTurk.

8.4. Turkish Style Locking

"siz" vs "sen", tone (formal/informal), regional dialects, sentence-order preferences — must be controlled in fine-tuning. Editor-level quality QA is mandatory.

8.5. Domain-Specific Turkish Examples

Turkish law (TBK, TMK, KVKK + case law), tax (VUK, VAT, GVK), health (anonymized medical reports), e-commerce (Trendyol/Hepsiburada catalogs), banking (BDDK + customer interactions).

9. Hardware, Cloud, Cost

9.1. GPU Choice (2026)

GPU Options for Fine-Tuning (2026)
GPUVRAMTypical Cloud Price (USD/hr)Max Model with QLoRA
RTX 409024GB$0.40-0.807B-13B
RTX 509032GB$0.60-1.2013B-22B
A100 40GB40GB$1.20-2.0013B-34B
A100 80GB80GB$1.80-3.5034B-70B
H100 80GB80GB$3.50-6.0034B-70B (fast)
H200141GB$5-970B+ (comfortable)
GB200/B200 (Blackwell)192GB$8-15100B+ MoE

9.2. Cloud Platforms

Modal (Python-native, pay-as-you-go), RunPod (cheapest spot), Together AI (managed FT + inference), Replicate (ready templates), AWS SageMaker / GCP Vertex AI / Azure ML (enterprise), Lambda Cloud (on-demand H100/H200).

9.3. Typical Cost Scenarios

  • Turkish style alignment, Llama 4 8B QLoRA, 5K samples: ~$15-40 training + ~$50-100 data + ~$30 eval = ~$100-200 total
  • Domain-specific Mistral Small 3 fine-tune, 20K samples: ~$80-200 training + ~$300-800 data + ~$100 eval = ~$500-1,200
  • Llama 4 70B QLoRA + DPO, 50K samples: ~$300-600 training (2 phases) + $1,000-3,000 data + $200-500 eval = ~$2,000-5,000

Reminder: data prep + eval is 60-70% of cost. GPU hours are the smallest line item.

10. Case Studies (Anonymized Turkish Enterprises)

Problem. Contract analysis on GPT-5 missed Turkish legal jargon (TBK, TMK references, court vocabulary).

Solution. Llama 4 70B QLoRA fine-tune:

  • Data: 8,000 anonymized contracts + 3,000 Turkish Supreme Court decisions + 2,000 legal Q&A pairs
  • Method: SFT + DPO (lawyers ranked 1,500 response pairs)
  • Duration: 6 weeks (4 weeks data, 2 weeks training + eval)
  • Cost: ~$8,000 (with labeling)

Result. Turkish legal accuracy 72% → 91%. Contract analysis time per lawyer 14 hours/week → 5 hours.

Case 2 — E-Commerce: Category Classification + Description

Problem. Manual category selection + Turkish description writing took hours per new product. Prompt engineering on GPT-4o-mini was insufficient (12,000 sub-categories).

Solution. Qwen 2.5 14B QLoRA fine-tune:

  • Data: 250,000 existing products (name + description → category + tags + SEO description)
  • Method: SFT (DPO not needed)
  • Training: 2x A100 80GB, 18 hours
  • Cost: ~$1,200

Result. Category classification accuracy 78% → 96%. Average human-intervention time per product 15 min → 1 min. Monthly 80K products processed at 90% lower cost than ChatGPT API (self-hosted Qwen + LoRA).

Case 3 — Healthcare: Medical-Report Structuring

Problem. Converting clinical notes to structured format (ICD-10 codes, diagnosis + treatment + medication) was 80% accurate on GPT-5; healthcare needs 95%+.

Solution. Mistral Small 3 ORPO fine-tune:

  • Data: 15,000 anonymized clinical notes + expert-physician-approved structured outputs
  • Method: ORPO (SFT + DPO in one stage)
  • KVKK safeguards: all patient data anonymized; on-prem training; audit-logged eval
  • Cost: ~$3,500 (with physician labeling)

Result. Medical-structuring accuracy 97%. KVKK + health regulation compliance. Enabled B2B integration with Turkish insurers.

11. Common Mistakes and Anti-Patterns

11.1. "Fine-Tune First, Ask Questions Later"

The most common mistake. Always eval prompt + RAG first; know how well those two layers do before reaching for fine-tuning.

11.2. Training with Too Little Data

Trying to style fine-tune with under 500 samples. Usually fails. Minimum 1,000 high-quality; ideal 5,000-10,000.

11.3. Catastrophic Forgetting

Wrong learning rate (too high) or too many epochs (3+) breaks the model's base capabilities. Track general benchmark performance during training.

11.4. Test Set Leakage

If part of the training data leaks into eval, the fine-tune score is artificially inflated but fails in production. Split at cleanup; never mix during training.

11.5. KVKK-Non-Compliant Data

Fine-tuning with prompts that contain customer/employee personal data. KVKK breach + the learned personal data becomes embedded in model weights. Always anonymize.

11.6. No Versioning

Not versioning fine-tune adapters and datasets. Use HF Hub, W&B, MLflow to track every experiment.

11.7. Shipping Without Eval

"Loss went down — it works" before going live. Loss is not eval; measure actual task success with an eval set.

11.8. Wrong Base Model Choice

Fine-tuning an English-only model for Turkish tasks. The base model should already know Turkish; fine-tuning adapts it to your domain, not teaches it Turkish from scratch.

12. Fine-Tuning vs Distillation

Distillation — training a small model (student) on the outputs of a large model (teacher). The 2025-2026 most practical fine-tune pattern:

  1. Generate synthetic data with a large model (Claude Opus 4.7)
  2. SFT the small model (Llama 4 8B) on that data
  3. Small model = cheap + fast + 85-90% of the large model's quality
  • Synthetic-data dominance — generation with GPT-5/Claude/Gemini instead of human labeling
  • Distillation everywhere — knowledge transfer from frontier to small models
  • Self-Reward models — the model rates its own outputs to create training data
  • Verifier models — automatic quality control on fine-tune outputs
  • RLAIF (RL from AI Feedback) — another AI's preferences instead of humans
  • Continual learning — keeping the model updated without catastrophic forgetting
  • PEFT advances — DoRA, MoRA, LoftQ; 2024-2025 improvements over LoRA

14. KVKK-Compliant Fine-Tuning

14.1. Risks

  • Data embeds in the model — practically impossible to "delete" after fine-tuning
  • Membership inference attacks — training-set membership can be inferred from outputs
  • Data leakage — the model sometimes regurgitates training data almost verbatim

14.2. Mitigations

  1. Anonymization — strip PII (national ID, name, phone, email)
  2. Differential privacy — add noise during training (quality vs privacy trade-off)
  3. Federated learning — train without centralizing data (advanced)
  4. Data residency — train on Turkey or EU GPUs
  5. Audit logs — which data was used in which training

14.3. Under the EU AI Act

If the fine-tuned model is high-risk (credit scoring, HR selection, etc.):

  • Technical documentation (Annex IV)
  • Training-data governance
  • Risk assessment
  • Human oversight
  • Conformity assessment

See our compliance guide on this site for details.

15. Frequently Asked Questions

16. Next Steps

To shape LLM fine-tuning strategy in your company or move an existing fine-tune to production quality:

  1. Fine-Tune Use-Case Assessment. Is fine-tuning really needed? Is RAG/prompt enough? Investment math + 4-hour workshop.
  2. Data + Pipeline Setup. Turkish data collection, labeling strategy, training-platform choice, eval harness — end-to-end pipeline design.
  3. Production Fine-Tune Audit. For existing fine-tunes: 360° audit on quality, KVKK compliance, cost, observability.

Reach out via the contact form.

References

  1. , Microsoft Research ·
  2. , University of Washington ·
  3. , Stanford ·
  4. , KAIST ·
  5. , Stanford ·
  6. , Google DeepMind ·
  7. , OpenAI ·
  8. , NVIDIA ·
  9. , Anthropic ·
  10. , University of Washington ·
  11. , Unsloth ·
  12. , Hugging Face ·
  13. , Axolotl ·
  14. , GitHub ·
  15. , Republic of Turkiye ·
  16. , EU ·

This is a living document; the fine-tuning ecosystem (new methods, frameworks, base models) shifts every quarter, so it is updated quarterly.

Consulting Pathways

Consulting pages closest to this article

For the most logical next step after this article, you can review the most relevant solution, role, and industry landing pages here.

Comments

Comments

Connected pillar topics

Pillar topics this article maps to