Fine-Tuning Cookbook (Model-by-Model)
User manual for this cookbook: 5-component lesson anatomy (Theory/Math/Lab/Debug/Bench), Stage taxonomy (Spike → Reference → Production → Research), reproducibility contract (bit-exact runs), why the RTX 4090 baseline, GPU budgeting math.
Table of Contents
Part 0 — Engineering Foundations
- 1
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
User manual for this cookbook: 5-component lesson anatomy (Theory/Math/Lab/Debug/Bench), Stage taxonomy (Spike → Reference → Production → Research), reproducibility contract (bit-exact runs), why the RTX 4090 baseline, GPU budgeting math.
- 2
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
ML's most expensive time sink: irreproducible results. This lesson: seed management, cuDNN/cuBLAS deterministic flags, ATen non-deterministic op detection, dataloader worker seeding, cost of deterministic scatter/gather — all with practical code and real logs.
- 3
Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes
Second half of reproducibility: pin lib versions, understand the CUDA matrix, write Docker/Apptainer recipes. Where uv beats pip+poetry by 10-100x, the CUDA 12.4 / PyTorch 2.5 stack for RTX 4090, compatibility matrix for FT frameworks (TRL, Unsloth, Axolotl).
- 4
Container & Slurm Recipes: Bridging Single 4090 to Cloud Multi-Node
How to take a recipe you prepared on a single 4090 to an 8×H100 cluster: Slurm sbatch template, multi-node NCCL setup, EFA/InfiniBand sanity check, real hourly prices for Lambda/RunPod/CoreWeave/Vast, preemption-tolerant training, checkpoint manifest, FAULT_TOLERANCE principles.
- 5
Experiment Tracking Architecture: W&B + Hydra + DVC — The Engineering of Sweeps
Disciplining ML experiments: config-driven runs with Hydra, sweep + system metrics + offline mode with W&B, dataset/checkpoint versioning with DVC, alias/lineage tracking. The cookbook's 'reportable Lab' standard.
Part I — Hardware & Memory Engineering
- 1
The Anatomy of GPU Memory Budgeting: W + G + O + A + B — Managing the 24GB on RTX 4090 at the Atom Level
The most common phrase in fine-tuning: 'OOM'. This lesson ends random OOMs forever. Break down the Weights/Grads/Optimizer/Activations/Buffers budget; understand mathematically why AdamW needs 8 bytes/param, Lion 4, and NF4 fits at 0.5. Fit Llama 3.1 8B into 24GB with 4 different methods.
- 2
Anatomy of Activation Memory: Why O(L·s·h) and the Real Savings of FlashAttention
Activation memory: forward pass's most misleading memory consumer. Layer-by-layer breakdown (attn intermediates, FFN, norm, residual), FlashAttention's saved-memory math (O(s²)→O(s)), the 'sqrt(L) savings' myth of grad-checkpoint, packing + variable-length attention.
- 3
Gradient Checkpointing Trade-off Lab: Compressing Memory by Crediting Compute
Decision tree for gradient checkpointing: per-layer, segment-based, custom selective? Re-entrant vs non-re-entrant difference, torch.utils.checkpoint vs HF Trainer kwargs, selective checkpointing. 5-strategy bench on RTX 4090 + Llama 3.1 8B.
- 4
Mixed Precision Architecture: bf16 vs fp16 vs fp8 — Why Pure bf16 on RTX 4090?
fp16's loss scaling complexity, bf16's 'master fp32' pattern, fp8 (Ada supports but H100 is native), TF32 matmul precision flag, autocast nuances — cookbook's clear choice of pure bf16 for RTX 4090. NaN cost and training stability math.
- 5
PCIe vs NVLink vs InfiniBand: The Invisible Impact of Bandwidth on Training
Bandwidth is invisible on a single 4090 but at scale-out it alone can slow training. PCIe 4.0/5.0 lane math, NVLink (and why 4090 doesn't have it), NVSwitch topology, InfiniBand 400G, threshold where NCCL all-reduce becomes network-bound, p2p_access detection, GPU-direct.
- 6
Storage I/O Engineering: The Art of Letting Your Dataset Slow Down Training (and Prevention)
Dataset bottleneck: GPU is 30% idle waiting for disk. NVMe Gen3/Gen4/Gen5 throughput, dataset format choice (parquet vs arrow vs webdataset), HuggingFace datasets caching, num_workers tuning, prefetch_factor, persistent_workers, pinned memory, FSx vs S3 vs local — recipe to run RTX 4090 + 50K Turkish dataset with 0 idle.
- 7
Profiling Stack: torch.profiler + Nsight Systems + Nsight Compute + MFU Calculation
Optimization without profiling is hot air. Python-level timing with torch.profiler, kernel-level timeline with Nsight Systems (nsys), kernel-internal metrics with Nsight Compute (ncu), MFU (Model FLOPs Utilization) calculation. Cookbook certification: each Lab MFU > 35%.
- 8
Cost Engineering: Local 4090 vs Cloud H100? — Breakeven, Spot, and TCO Math
The silent question every FT engineer asks: 'Should I do this on local 4090 or send it to cloud?' Cookbook's decisive math: RTX 4090 amortization, electricity, cloud hourly tables, spot risk calculation, breakeven duration, hybrid strategy (4090 dev + cloud production).
Part II — Tokenizer & Data Engineering
- 1
BPE / SentencePiece / Unigram: The Math of Tokenizer Algorithms and Training a TR-Aware Tokenizer from Scratch
BPE's merge table, SentencePiece's language-agnostic byte/char model, Unigram's EM training; why each results in different token efficiency. Training a 50K-vocab BPE on 1.5GB Turkish corpus on RTX 4090 (~12 min). Mathematical proof of why TR-aware tokenizer beats Llama-3's default by 1.6x.
- 2
Vocabulary Extension: Add 8K TR Tokens to Llama-3 Tokenizer (Embedding Init Strategies)
Llama-3 default tokenizer is 128K — multilingual but inefficient for TR. The 'extension' approach: add 8K TR-specific tokens to Llama-3's vocab, expand embedding from 128K→136K, intelligently init new rows (mean-init, SVD-init, byte-decomp). Practical lab + perplexity delta on RTX 4090.
- 3
Tokenizer Distillation: Cross-Model Token Mapping and TR Token Efficiency Measurement
When distilling, teacher and student tokenizers differ → label mismatch. Building cross-tokenizer mapping table for token-level distillation, GPT-4 → Llama-3 distill example, comparison of TR token efficiency (Llama-3 vs Qwen 2.5 vs Gemma 3 vs Mistral vs Phi-4).
- 4
Chat Template Anatomy: Jinja, Special Tokens, and Token-by-Token Breakdown
Chat template = the format LLM understands as 'conversation'. Token-by-token anatomy of Llama-3, Qwen 2.5, Gemma 3, Mistral, Phi-4 chat templates. What apply_chat_template does under the hood, token IDs of system/user/assistant roles, tool-calling extensions, multimodal turn formats.
- 5
Loss Masking: The Real Implementation of 'Loss Only on Response'
Loss masking is the cornerstone of SFT. How IGNORE_INDEX=-100 interacts with PyTorch CrossEntropyLoss, how instruction tokens are masked while response is kept, source-code reading of Unsloth's train_on_responses_only, turn-by-turn masking in multi-turn conversations, edge cases.
- 6
Dataset Quality Pipeline: MinHash Dedupe + Perplexity Filter + Toxicity + Educational-Value
Garbage in, garbage out. SFT dataset quality pipeline: MinHash LSH for near-duplicates (~30-40% are duplicates), KenLM 5-gram perplexity filter, HateBERT-TR toxicity, FineWeb-style educational-value scorer. Clean 1M-row TR dataset in 25 min on RTX 4090.
- 7
Synthetic Data: Self-Instruct, Evol-Instruct, OSS-Instruct, MAGPIE (TR Adaptation)
Instruction data is scarce for TR. Solution: synthetic generation. TR adaptation of Self-Instruct (Stanford 2022), Evol-Instruct (WizardLM), OSS-Instruct (Magicoder), MAGPIE (2024). Teacher model selection ethics (GPT-4 ToS), prompt engineering, automated quality control.
- 8
Data Mixing Math: Sampling Temperature, DoReMi, Domain Reweighting
How to mix multiple datasets? Naïve concatenation = the large dataset dominates. Sampling temperature, proportional mixing, DoReMi (Xie et al. 2023) algorithm for dynamic reweighting. Turkish SFT mix example: 40% TR-Alpaca + 25% OASST + 20% ShareGPT-TR + 15% custom — why these percentages?
- 9
Sequence Packing & Variable-Length Attention: The Trick That Boosts Throughput by 40%
Padding tokens are wasted compute. Packing: concat multiple short examples into one sequence. Variable-length attention (flash_attn_varlen_func) with block-diagonal mask. TRL SFTTrainer packing=True internals, cu_seqlens tensor anatomy, throughput bench.
- 10
Streaming & Sharded Datasets: Training on 500GB+ Without Disk
A 1TB dataset fits on 4090's 2TB NVMe, but tokenizing and caching needs 5TB. Solution: streaming. HF datasets.IterableDataset, WebDataset .tar shards, MosaicML Streaming (MDS), S3 streaming, resumable streaming, multi-worker collator pattern.
- 11
Long-Context Dataset Engineering: NIAH, RULER, and Data for 128K Context FT
Actually using Llama 3.1's 128K context: how to produce long-context SFT data? NIAH synthetic, RULER benchmark recipes, long-form QA datasets, code-repo concatenation, repository-level context. Long-context QLoRA (128K seq) on RTX 4090 — 22GB peak with packing.
- 12
DPO / KTO Dataset Engineering: The Engineering of Chosen/Rejected Triplet Generation
DPO and KTO need 'chosen' (good) and 'rejected' (bad) response pairs. Generation methods: AI Feedback Loop (RLAIF), regex-graded pairs (math/code), human-in-the-loop, hard-negative mining, length-controlled pairs. UltraFeedback analysis, TR DPO dataset build, KTO's unpaired advantage.
Part III — Small Open Models (1B–8B)
- 1
Llama 3.1 / 3.2 / 3.3 8B — The Workhorse of RTX 4090: GQA + 128K Context + Turkish Recipe
Anatomy of Llama 3.1/3.2/3.3 8B-Instruct: 32-layer × 4096-hidden, GQA (8 KV-head), RoPE θ=500K, SwiGLU, RMSNorm, 128K context. QLoRA NF4 + Unsloth on RTX 4090 with 50K Turkish Alpaca for 1 epoch ~50 min. TR-MMLU baseline 32.4 → fine-tune 39.8 (+23%). Full recipe.
- 2
Llama 3.2 1B / 3B — Edge & Mobile FT: Tied Embeddings + Distillation + GGUF Q4
Llama 3.2 1B/3B — distilled from Llama 3.1 8B. Tied embeddings, edge inference. Full FT possible on RTX 4090 (1B=2GB, 3B=6GB W). 8-15 tok/s on iPhone/Pixel with GGUF Q4_K_M. TR-MMLU numbers and dataset strategies.
- 3
Qwen 2.5 / Qwen3 1.5B/3B/7B — Multilingual Champion (Turkish Token Efficiency)
Qwen 2.5 / Qwen3 — Alibaba's open-weight family. 151K vocab (TR-friendly), Apache 2.0, easier than Llama for FT. Qwen2.5-7B QLoRA on RTX 4090: 1 epoch ~40 min. TR-MMLU baseline 38.1 → fine-tune 44.2 (+16%). Qwen3 14B + YaRN.
- 4
Qwen3 14B / 32B Base + YaRN: Long-Context FT (32K → 128K) Marginally Feasible on RTX 4090
QLoRA FT of Qwen3 14B with 32K context on RTX 4090 — peak 21 GB, marginal fit. YaRN rope-scaling math, long-context SFT dataset (NIAH + RULER), where 32B is impossible on 4090. Cloud 1×H100 80GB alternative.
- 5
Mistral 7B v0.3 + Mistral Small 3 (24B): Sliding Window Deprecation + Tool-Calling
Mistral 7B v0.3 (Apache 2.0, 32K), Mistral Small 3 (24B, Apache 2.0). v0.3 sliding window deprecation, function-calling, tool-token training. Mistral 7B QLoRA on RTX 4090: ~45 min. Mistral Small 3 marginal fit.
- 6
Gemma 3 1B / 4B / 12B / 27B: Google's 256K Vocab + Multimodal (4B+)
Gemma 3 — Google's 2025 open models. 256K vocab, 4B+ multimodal (SigLIP vision tower), GeGLU, RMSNorm, 128K context, ShieldGemma. Gemma 3 4B/12B QLoRA on RTX 4090. No system role (prepend to user), Gemma 3 ToS attention.
- 7
Phi-4 + Phi-4-mini: Microsoft's Synthetic-Curriculum Model — Why Fragile in TR?
Phi-4 14B + Phi-4-mini 3.8B — Microsoft's 'textbook quality' synthetic-data-trained models. Strong in math + code, weak in general TR. Why? Synthetic data heavily English. Phi-4 QLoRA Lab on RTX 4090 + where it shines (math reasoning, code completion).
- 8
SmolLM3 1.7B: Tiny Tier — Production Model Running on 8GB RAM Devices
SmolLM3 (HuggingFace, Mar 2025) — 1.7B params, hybrid GQA, 256K context (YaRN), 100% open (data, training pipeline, weights). Edge target: 8GB RAM phone / RPi 5 / IoT. Full FT on RTX 4090 in 25 min. Q4_K_M GGUF → 1.0 GB.
- 9
DeepSeek-R1-Distill (Llama-8B / Qwen-7B): Reasoning Trace Distillation — Learning 'Think Tokens'
DeepSeek-R1-Distill — Llama/Qwen bases distilled from R1 (671B) traces. <think>...</think> format, CoT trace dataset, compressing R1's reasoning into 7-8B. Your own reasoning FT on RTX 4090: 1000 R1 traces suffice.
- 10
Yi-1.5 / InternLM2.5 / Aya Expanse: Underdog Comparative TR-MMLU
Llama / Qwen / Gemma are popular but not the only options. Yi-1.5 (01.AI), InternLM2.5 (Shanghai AI Lab), Aya Expanse (Cohere) — which shines in TR? Same recipe comparison on RTX 4090.
- 11
Comparative Lab: Same Recipe + Same Data on 10 Models — Let the Table Decide
Part III capstone: FT 10 models (Llama 3.x, Qwen 2.5/3, Mistral, Gemma 3, Phi-4, SmolLM3, R1-Distill, Aya Expanse) on the same 50K TR Alpaca with same hyperparams. Loss curve overlay, TR-MMLU + MT-Bench table, GPU hours, electricity, quality/cost ratio.
Part IV — Mid-Large Models (13B-70B+) + Distributed Internals
- 1
PyTorch FSDP Anatomy: FULL_SHARD vs SHARD_GRAD_OP vs HYBRID_SHARD + Mixed Precision Policy
FSDP — modern PyTorch's distributed training weapon. 3 sharding strategies, MixedPrecision policy, BackwardPrefetch, auto_wrap_policy. Llama 3.3 70B QLoRA recipe on 8×H100 SXM.
- 2
FSDP2 (fully_shard): Per-Parameter Sharding + DTensor + 2024+ PyTorch Innovation
FSDP2 (PyTorch 2.4+) — evolution of FSDP. Per-parameter sharding (FlatParameter pattern dropped), DTensor backbone, FQN-based resumable checkpointing, easier mixed precision. Llama 3.3 70B + FSDP2 + DCP recipe.
- 3
DeepSpeed ZeRO Stage 1/2/3 + ZeRO-Infinity: NVMe Offload + 70B on Single GPU?
ZeRO (Microsoft) — father of sharding, predates FSDP. Stage 1 (optimizer shard), 2 (+ gradient), 3 (+ param, FULL_SHARD equivalent). ZeRO-Infinity NVMe spillover → 70B single GPU theoretically possible (slow but possible). Decision matrix: ZeRO vs FSDP.
- 4
Tensor Parallelism (Megatron): Column-Parallel + Row-Parallel Linear — Splitting the Matrix
Megatron-LM (NVIDIA) Tensor Parallel: matrix split *within itself* across GPUs. Column-parallel linear (output channels split), row-parallel (input channels), all-reduce/gather pattern. TP=2 vs TP=4 on 8×H100. FSDP+TP = 2D parallelism.
- 5
Pipeline Parallelism: GPipe + 1F1B + Interleaved — Bubble Overhead Math
Pipeline Parallel: model layers distributed across GPUs. Forward+Backward streamed. GPipe (simple + bubble overhead), 1F1B (memory efficient), Interleaved 1F1B (Megatron, halves bubble). 70B + 4-node × 8 GPU scenario.
- 6
Sequence Parallel + Context Parallel: Ulysses + Ring Attention + 1M Context
Breaking long-context FT's physics limit: split sequence/context across GPUs. DeepSpeed-Ulysses (sequence parallel — head-wise), Ring Attention (Berkeley), Megatron Sequence Parallel. Enable 1M context. Technical foundation of Kimi-1.5's (Moonshot) 2M context recipe.
- 7
Llama 3.3 70B QLoRA + FSDP: 8×H100 SXM Recipe (5.6h 1 Epoch)
Full Lab recipe for Llama 3.3 70B-Instruct: 8×H100 SXM cloud (Lambda $24/h), QLoRA NF4 + FSDP FULL_SHARD, bitsandbytes 4-bit, gradient checkpointing, paged AdamW. 50K TR Alpaca 1 epoch in 5.6h. TR-MMLU base 55.4 → 60.8.
- 8
Qwen 2.5 32B / 72B Math + Code Mastery: GSM8K + MATH-500 + HumanEval FT Recipe
Qwen 2.5 32B/72B — math + code baseline beating Llama 70B. Math-heavy dataset mix (GSM8K + MATH + AIME + MetaMathQA), step-by-step CoT format, code execution loop, hyperparameter differences. 4×H100 80GB QLoRA 32B recipe (~3h).
- 9
Command-R / Command-R+ + Granite 3: RAG-Native + Citation FT + Enterprise Tier
Cohere Command-R (35B) / Command-R+ (104B) — RAG-tuned baseline, native citation token training. IBM Granite 3 — Apache 2.0 enterprise tier, governance-focused. RAG-FT dataset format, citation accuracy measurement, tool-calling, Command-R+ QLoRA recipe on 4×H100 80GB.
- 10
Hybrid SSM Models: Falcon-Mamba + Zamba2 — Long Context Without KV-Cache
State Space Model (SSM, Mamba) — alternative architecture to Transformer. No KV-cache, inference O(N) (Transformer O(N²)). Falcon-Mamba 7B, Zamba2 (Mamba + transformer hybrid). FT pattern differs from Transformer: state reset, gradient flow, learning rate sensitivity. RTX 4090 recipe.
- 11
Multi-Node Run + Fault-Tolerant Training: 2 Node × 8 H100 NCCL Cluster
Reality of cluster training: nodes fail, NCCL hangs, checkpoints get corrupted. Cookbook's fault-tolerant recipe: NCCL_TIMEOUT, watchdog, signal handling (SIGUSR1), elastic launcher, graceful preemption resume. Survival kit for 70B model 2-day training.
Part V — MoE Internals & Fine-Tuning
- 1
MoE Mathematics: Top-K Router + Softmax + Noise + Auxiliary Load-Balancing Loss
Router is the heart of MoE. Top-K routing math derivation (Shazeer 2017, Switch Transformer 2021), token-to-expert assignment, expert capacity factor (overflow vs underutilization), load balancing loss, softmax temperature, top-K=2 vs top-K=1. Mixtral 8×7B's actual router config.
- 2
Mixtral 8×7B / 8×22B FT: Router Collapse Problem + Aux Loss Weight Calibration
Most common Mixtral FT bug: **router collapse** — one expert dominates, others dead as training progresses. Capacity overflow, dynamic aux loss adaptation, expert balance metrics, FSDP + MoE compatibility (expert parallelism). Mixtral 8×7B QLoRA recipe on 4×H100 80GB (~4h).
- 3
DeepSeek-V3 / R1 (671B, 37B Active): Shared Expert + Fine-Grained Routing — Where to LoRA?
DeepSeek-V3 (671B params, 37B active) — best open example of modern MoE. Shared expert (common knowledge for every token) + 256 routed experts (fine-grained). DeepSeek-R1 same arch + RL for reasoning. Impossible on RTX 4090; cookbook's cloud recipe 16×H100 NDR IB + ZeRO-Infinity + expert parallelism.
- 4
Qwen3-MoE + Llama-4-MoE Pattern: Generic MoE FT Recipe (8×H100 Baseline)
Qwen3-MoE (30B-A3B, 235B-A22B) and Llama-4-MoE (Behemoth, Maverick, Scout) — 2025's new MoE generation. 'Generic MoE FT pattern' — apply same discipline to any MoE. Common chat template, router-aware LoRA, expert-targeted SFT. 8×H100 baseline recipe.
- 5
Sparse Upcycling: Converting Dense Model to MoE — Qwen2-MoE Technique Reconstruction
Sparse Upcycling (Komatsuzaki et al. 2022) — convert dense pre-trained model to MoE then continue pre-training to specialize. Copy existing FFN N times, add router, continue training. Cheaper than scratch pre-train. Qwen 2.5 7B → 7B-MoE (8 expert) conversion lab on RTX 4090.
- 6
Expert Specialization Probe: Token Routing Statistics + Language/Domain Specialization
MoE's secret: some experts 'specialize' in math, code, Turkish, formal writing. Probe to measure specialization: feed domain-specific test prompts (math, code, TR-chat), quantify which experts activate. TR specialization map of Mixtral 8×7B.
- 7
MoE Quantization & Inference: Expert Offload + Dynamic Routing Under Quant
MoE inference differs from dense: some experts 'cold' (rarely used) → CPU/disk offload. Dynamic routing × quantization interaction (router's quant tolerance), MoE-specific vLLM tuning, Mixtral AWQ + sparse expert loading. Mixtral 8×7B serving on RTX 4090 (~140 tok/s).
Part VI — Vision-Language Multimodal FT
- 1
VLM Architecture Anatomy: Vision Encoder + Projector + LLM Backbone — Detailed Dissection
VLM's 3 main components: Vision encoder (SigLIP-400M, ViT-G/14, EVA-CLIP), Projector (MLP / Q-former / Resampler / Cross-attention), LLM backbone. Token interleave format, image token allocation, position encoding harmony, 2D/M-RoPE patches. Architecture table for each popular VLM family.
- 2
LLaVA-1.5 / 1.6 / OneVision: 2-Stage Training + Projector Pretrain + Instruction Tune
LLaVA's classic 2-stage training recipe: (1) Projector-only pretrain on 558K image-caption pairs, (2) end-to-end instruction tune. Freeze strategy ablation (vision frozen vs unfrozen, LLM frozen vs unfrozen). LLaVA-1.6 Mistral 7B FT on RTX 4090.
- 3
Llama 3.2 Vision 11B / 90B: Cross-Attention Adapter + Multi-Image FT
Llama 3.2 Vision — Meta's cross-attention adapter approach (different from LLaVA's MLP). Vision encoder ViT-H/14 joins LLM via **interleaved cross-attention layers**. Multi-image FT, image+text interleave format. 11B QLoRA marginal on RTX 4090 (~22 GB), 90B cloud only.
- 4
Qwen 2.5-VL: Dynamic Resolution + M-RoPE + Turkish OCR FT (Invoice/Petition)
Qwen 2.5-VL (3B/7B/72B) — modern multimodal champion. **Dynamic resolution** (no 224×224 fixed), **M-RoPE** (temporal + height + width), document understanding, video, multilingual. End-to-end Turkish invoice/petition OCR FT: dataset prep, vision tower freeze, LoRA target, accuracy measurement.
- 5
Pixtral 12B + Pixtral Large: Mistral Multimodal — Resolution-Free + Apache 2.0
Pixtral 12B (Mistral Nemo 12B + 400M ViT) + Pixtral Large (124B) — Mistral's open multimodal. Apache 2.0, resolution-free, EU AI Act-compliance friendly. 7-32 image per context, 128K context. Pixtral 12B QLoRA marginal on RTX 4090 (~22 GB).
- 6
InternVL2.5 + Idefics3 + Phi-4-Multimodal: Comparative Architecture Tour
Less popular but important VLMs: InternVL2.5 (Shanghai AI Lab, 8B-78B), Idefics3 (HuggingFace), Phi-4-Multimodal (Microsoft, 5.4B vision+text). Architecture + FT pattern comparison. Which shines for niche use-cases (medical/document/scientific).
- 7
When to Freeze the Vision Tower? — Probing Lab + Downstream Eval
VLM FT's most debated decision: freeze the vision encoder or not? Frozen → vision capability preserved, training fast, less risk. Unfrozen → +2-5% quality but 3-5x slower training + overfit risk. Ablation: 5 freeze strategies comparison, RTX 4090 + Qwen 2.5-VL 7B.
- 8
Document VLM FT: DocVQA + ChartQA + TableVQA + Turkish Invoice/Petition Dataset
Document AI use-cases: DocVQA, ChartQA, TableVQA. TR-specific dataset generation: synthetic invoice + petition + contract images, structured field extraction. Qwen 2.5-VL 7B baseline → FT → field accuracy 76% → 94%.
- 9
Grounding FT: Bounding-Box Token Format + RefCOCO-Style Task
VLM's 'pointing' capability: 'point to the dog' → [0.32, 0.45, 0.58, 0.71]. Bbox token format: <bbox>x1,y1,x2,y2</bbox> or normalized 0-1000 coordinates. RefCOCO dataset, grounding evaluation (IoU), Qwen 2.5-VL's native grounding support.
- 10
Video LLM FT: LLaVA-NeXT-Video + VideoLLaMA3 + Frame Sampling Strategy
Video LLM — image's temporal extension. LLaVA-NeXT-Video, VideoLLaMA3, Qwen 2.5-VL native video. Frame sampling (uniform vs adaptive), temporal token compression, long-video Q&A (>1h). Video LLM FT on RTX 4090 — practical with short clips (10-30s).
Part VII — Speech & Audio Fine-Tuning
- 1
Whisper Architecture: Log-Mel Spectrogram + Encoder-Decoder + Language Tokens
Whisper (OpenAI 2022) — speech recognition's gold standard. Anatomy: 80-bin log-mel spectrogram input, 12-32 layer encoder + decoder transformer, BPE tokenizer (50K + multilingual + tasks), language tokens, task tokens, timestamp tokens. Model variants: tiny (39M) → large-v3 (1.5B) → turbo (809M).
- 2
Whisper Large-v3 / Turbo TR FT: Common Voice + Bilkent + Mozilla TR + Custom Corpus
Turkish Whisper FT — comfortable on RTX 4090 (large-v3 ~6 GB, turbo ~3 GB). Common Voice TR (180h), Bilkent TR corpus, Mozilla TR. WER (Word Error Rate) measurement, TR-specific tokenize fixes. Baseline WER 12% → FT WER 6% (~2× improvement).
- 3
Turkish Dialect FT: Karadeniz / Aegean / Eastern Anatolian Pronunciation + Dataset Collection
Standard Turkish baseline Whisper is good but struggles with dialects (Black Sea 'cik' suffix, Eastern Anatolian, Aegean). Dialect speech recording protocol (consent), 50-100h regional corpus, FT + per-dialect WER. Production: customer service, healthcare (village services).
- 4
Streaming ASR: faster-whisper + distil-whisper — Real-Time Latency Budget < 200ms
Whisper is fast offline (batch) but not optimized for streaming. Solution: **faster-whisper** (CTranslate2 + INT8), **distil-whisper** (50% layers reduced student). Latency budget < 200 ms first-token, 70× real-time. Turkish streaming setup on RTX 4090: chunking, VAD, partial hypotheses.
- 5
Audio LLM: Qwen2-Audio + Phi-4-Multimodal Audio Branch — Audio Understanding + Reply
Audio LLM = beyond Whisper. Not just transcribe, but **understands** audio content and replies. Qwen2-Audio (Alibaba, 7B), Phi-4-Multimodal audio branch. Audio-specific tasks: emotion recognition, music understanding, environmental audio Q&A. Qwen2-Audio FT recipe on RTX 4090.
- 6
TTS FT: XTTS-v2 + F5-TTS + Kokoro + Parler-TTS — Turkish Voice Cloning (Consent + KVKK)
Text-to-Speech FT — insufficient TR baselines. XTTS-v2 (Coqui), F5-TTS (zero-shot voice cloning), Kokoro (StyleTTS2-based), Parler-TTS (description-controlled). Personal voice clone with 5-10 min reference audio. 1-3h FT on RTX 4090. **Ethics: consent + KVKK + deepfake risk**.
- 7
Speaker ID + Diarization FT: pyannote.audio + WavLM — Multi-Speaker Separation
Meeting/call center transcripts: 'who's speaking + what'. pyannote.audio (HF), WavLM speaker embeddings, diarization pipeline (VAD → embedding → clustering). Call center case: customer vs operator separation, FT on RTX 4090 + 100h TR call dataset.
Part VIII — Code Models & Repo-Level FT
- 1
FIM (Fill-in-the-Middle) Format: Prefix + Suffix → Middle Token Logic
Spine of code completion: FIM. Classic LLM next-token prediction insufficient for code — in real IDE cursor is in middle, prefix + suffix exist. FIM training format. Dataset prep: random split + transform existing code. Bayraghani et al. 2022 paper foundation.
- 2
Qwen2.5-Coder 7B/14B/32B: Repo-Level Context + FIM Native FT
Qwen2.5-Coder family — 2025's strongest open code LLM. FIM native, 128K context, optimized for repo-level. 32B HumanEval 92.7%, SWE-Bench-Lite 31.6%. 7B QLoRA on RTX 4090 in 40 min; 32B on cloud H100 80GB single-GPU.
- 3
DeepSeek-Coder-V2 16B / 236B: MoE Code Model + Multi-File Context
DeepSeek-Coder-V2 (DeepSeek 2024) — MoE arch (16B / 236B), one of strongest open code LLMs with Apache 2.0. 338 programming languages, 128K context, multi-file repo understanding. 16B (2.4B active) QLoRA possible on RTX 4090; 236B cloud only.
- 4
StarCoder 2 + CodeLlama: BigCode RAIL License Labyrinth + 600+ Programming Languages
StarCoder 2 (BigCode + ServiceNow + HF, 2024) — 600+ programming languages, BigCode RAIL license. CodeLlama (Meta, 2023) — Llama 2 base, older. License nuances. Cookbook recommendation: Qwen2.5-Coder > DeepSeek-Coder-V2 > StarCoder 2 > CodeLlama.
- 5
Codestral + Codestral Mamba: Mistral's Code Stack — Apache 2.0 Only Apache
Codestral 22B (Mistral 2024, non-commercial) + **Codestral Mamba 7B** (Apache 2.0, Mamba SSM arch). Codestral Mamba — only Apache 2.0 Mistral code model. SSM arch applied to code, long-context advantages.
- 6
Custom Stack FT Lab: Repo-Tuned Model on Mid-Size Repo (~50K LoC)
FT for company internal codebase: 50K LoC Python+TypeScript repo. File hierarchy preservation, internal symbol awareness, test file pairing, commit history mining (good/bad code), 7B model 4-6h FT on RTX 4090.
- 7
Code Eval: HumanEval + MBPP + BigCodeBench + LiveCodeBench + SWE-Bench-Lite
Code LLM standard benchmark suite: HumanEval (164 Python), MBPP (974 Python), BigCodeBench (1140 calls, 139 libs), LiveCodeBench (data-leak resistant), SWE-Bench-Lite (300 real GitHub issues). Pass@1 vs pass@10, code execution sandbox. Running bench on RTX 4090.
- 8
Code-LLM Safety: Secret Leak Memorization Probe + License-Tainted Code Filter
Code LLMs can memorize API keys, passwords, SSH private keys from training data → leak in production. Detection: memorization probe (random snippets from training set → does model continue?), license-tainted code (GPL viral) filtering. BigCode StarCoder leak incident lessons.
Part IX — Turkish-First & Localization Engineering
- 1
TR Corpus Building: mC4-TR + OSCAR-TR + KAPAR + Wikipedia + Common Crawl + Library Scraping
Collecting 100GB+ Turkish corpus: mC4-TR (35GB), OSCAR-TR (45GB), KAPAR (parliamentary transcripts), Wikipedia TR (2GB), Common Crawl filter (50-200GB potential), library scraping (TR State Library, open works). License and KVKK attention. Practical download/tokenize pipeline.
- 2
TR Quality Pipeline: KenLM Perplexity + Slur/PII Filter + Educational-Value
From raw TR corpus to quality FT data: KenLM 5-gram TR perplexity (gibberish/MT artifact filter), TR slur filter, TR PII detection (TC ID, phone, email), educational-value scorer (FineWeb adaptation). Clean 100GB TR corpus in 4h on RTX 4090.
- 3
Tokenizer Extension Lab: Llama-3 → +8K TR Tokens + Embedding Init
Part II Lesson 2.2's TR-specific full Lab. Add 8K most-frequent TR tokens to Llama 3.1 tokenizer, try byte-decomposition + SVD init, measure perplexity delta, downstream SFT after 500M token continual pre-train: tokens/word 3.2 → 2.1.
- 4
Continual Pre-training TR: Catastrophic Forgetting Mitigation + Replay Buffer
Main risk of continual pre-train: forgetting English while learning TR. Replay buffer (10-15% EN per batch), LR warmup, why LR should be 1/10-1/50 of original pre-train. 2B token TR continual PT on Llama 8B feasible in 24h on RTX 4090.
- 5
TR SFT: Quality > Quantity — 5K Curated TR Data > 100K Noisy
Main insight of TR SFT: less but high-quality data beats more but noisy. 5K human-curated TR > 100K MT-translated bad Alpaca. How to mix TR-Alpaca, OASST-TR, Mukayese, custom domain TR data. Curated 5K dataset: 1 epoch in 12 min on RTX 4090.
- 6
TR Models Reverse Engineering: Trendyol-LLM + Cosmos-LLaMA + KanaryaTR
Turkey's open TR LLMs: Trendyol-LLM (Trendyol e-commerce-focused), Cosmos-LLaMA (Cosmos AI Lab), KanaryaTR (Boğaziçi NLP), TURNA, AnatoliaLLM. Reverse-engineering each: model card, training pipeline, base + data + technique. What can you take for yourself.
- 7
TR Embedding FT: BGE-M3, jina-v3, nomic-embed TR Adaptation + MTEB-TR Eval
TR embedding model FT for RAG: BGE-M3 (multilingual, good TR baseline), jina-embeddings-v3, nomic-embed-text. TR-specific query/document pair generation, contrastive learning (InfoNCE), MTEB-TR benchmark. BGE-M3 TR FT 6h on RTX 4090.
- 8
TR Reranker FT: bge-reranker + jina-reranker — Pair Generation Recipe
Second stage of RAG pipeline: reranker. bge-reranker-v2-m3 (TR baseline) + jina-reranker-v2 + custom TR FT. Query-doc relevance score, cross-encoder architecture, hard-negative mining, 50K TR pairs in 4h on RTX 4090.
- 9
TR Agglutination Pitfalls: Suffix Tokenization + İ/I/ı/i Casefold Bug
Turkish is agglutinative — suffixes attached. Tokenizers often err on 'evlerimizdekiler'. İ/I/ı/i casefold bug, apostrophe normalize (TR vs ASCII), UTF-8 NFC vs NFD inconsistency. Cookbook's 'silent killer' bug list for TR engineers.
- 10
TR Benchmarking Suite: TR-MMLU + Mukayese + TruthfulQA-TR + BBQ-TR + Custom
Standard suite for evaluating FT models in TR: TR-MMLU (general knowledge, Boğaziçi), Mukayese (TR NLP tasks), TruthfulQA-TR (hallucination), BBQ-TR (bias). Automated with lm-eval-harness. CI integration, regression alarms.
Part X — Quantization Engineering
- 1
Quantization Mathematics: Symmetric/Asymmetric, Per-Tensor/Per-Channel/Per-Group, QAT vs PTQ
Mathematical foundations of quantization: float→int mapping formula, symmetric vs asymmetric, per-tensor vs per-channel vs per-group granularity, QAT vs PTQ, bit-width choice. Quantization characteristic of every tensor in Llama 8B's 32 layers on RTX 4090.
- 2
GPTQ Algorithm: Optimal Brain Quantization + Hessian Update — Llama 8B in 12 Min on RTX 4090
GPTQ (Frantar et al. 2022) — LLM weight quantization standard. Optimal Brain Quantization theory (LeCun 1990), Hessian inverse update, error compensation, group quantization. Quantize Llama 3.1 8B in 12 min on RTX 4090. WikiText-2 perplexity delta < 2%.
- 3
AWQ Algorithm: Activation-Aware Salient Channel Scaling — Respecting Outliers
AWQ (Lin et al. 2023) — activation-aware alternative to GPTQ. 'Salient channel scaling' technique that protects activation outliers. Quantize Llama 3.1 8B in 8 min on RTX 4090 via autoawq, slightly better WikiText-2 PPL than GPTQ + easier vLLM serving.
- 4
GGUF K-Quants Block Structure: Q2_K → Q8_K + llama-quantize Perplexity Table
GGUF — llama.cpp's native format, common for CPU/edge inference. K-quants block structure (Q2_K → Q8_K), separate struct per bit-width, llama-quantize for conversion, perplexity-vs-size curve. bf16 → Q4_K_M conversion 5 min on RTX 4090, Q4 GGUF 4.6 GB → CPU/Pi/iPhone deploy.
- 5
EXL2 (ExLlamaV2): Variable Bitrate Quantization — Which Layer at Which Bit?
EXL2 — ExLlamaV2's native format. Different bit-width per layer; sensitive layers get more bits. Measure layer sensitivity via calibration, optimal allocation within budget. Fastest LLM inference for single-user on RTX 4090 (1.5-2x vs vLLM at batch=1).
- 6
FP8 Training: H100 Native, Premature on RTX 4090 — Transformer Engine Internals
FP8 = the future of AI compute. H100 native (FP8 Tensor Cores + WGMMA + Transformer Engine). RTX 4090 (Ada) supports FP8 GEMM but ecosystem unripe — fallbacks common, training pipeline buggy. Cookbook rule: bf16 training on 4090, FP8 inference (vLLM). FP8 training on H100 detailed in Part XIII.
- 7
Int4 QLoRA NF4 Internals: Double Quantization + Paged Optimizer + Bitsandbytes Source Tour
NF4 (4-bit NormalFloat) — the core of QLoRA. Optimal 4-bit quantization for normally-distributed weights. Double-quantization (also quantize the scale tensor) for additional 0.4 bit/param savings. Paged AdamW (overflow to CPU RAM). Bitsandbytes source-code tour.
- 8
FP8 Inference: vLLM SmoothQuant + TensorRT-LLM — Production-Ready on RTX 4090
FP8 training premature but FP8 inference production-grade in 2026. vLLM native FP8 (Llama 3.1+/Qwen 2.5+ support), TensorRT-LLM SmoothQuant, AWQ-marlin INT4 vs FP8 comparison. Llama 3.1 8B FP8 conversion + serving on RTX 4090 (~120 tok/s vs bf16 95).
- 9
Calibration Dataset Engineering: Domain-Aware Quantization — Ideal Set for Your Domain
GPTQ/AWQ quality heavily depends on calibration data. WikiText-2 default but varies by production use-case. TR calibration in TR production → 30% better TR-MMLU post-quant. Code domain GitHub Python. Math domain GSM8K. Calibration size sweet spot (128-512).
- 10
Round-trip Eval: Pre/Post Quant Table — TR-MMLU + MT-Bench + Niche Benchmark
Part X capstone: Quantize the same model in bf16, AWQ int4, GPTQ int4, EXL2 4.5bpw, GGUF Q4_K_M, FP8 and compare. TR-MMLU, MT-Bench-TR, niche custom benchmark (Turkish call center sample). Decision matrix: which quant for your use-case?
Part XI — Alignment & Preference Optimization
- 1
Classical RLHF: Reward Model + PPO + KL Constraint — Why Industry Abandoned It
RLHF (Christiano et al. 2017, InstructGPT 2022) — foundation of modern alignment. 3 stages: SFT base + reward model train + PPO with KL constraint. Why it largely vanished from industry? PPO instability, value head maintenance burden, DPO's practical superiority. Mini-RLHF demo with TRL on RTX 4090.
- 2
DPO Math: Bradley-Terry → Loss Function Derivation — Why No Reward Model?
DPO (Rafailov et al. 2023) — mathematical equivalent of RLHF, but SINGLE-stage. Bradley-Terry preference model → KL-constrained RL objective → closed-form policy gradient → SFT-like loss. β hyperparameter's effect on gradient, DPO TRL DPOTrainer Lab on RTX 4090.
- 3
DPO Implementation From Scratch: One-Page Code Without TRL Source
Without using TRL DPOTrainer, write your own DPO loss: log-probabilities computation, reference model handling, loss formula, gradient backprop. ~80 lines of PyTorch. To understand where you can go wrong.
- 4
ORPO: Odds Ratio Preference Optimization — Single-Stage SFT+Alignment
ORPO (Hong et al. 2024) — DPO alternative without SFT base requirement. SFT loss + odds-ratio preference loss in one stage. No ref model → memory savings. Reference-free training, λ hyperparameter, ORPO Lab on RTX 4090.
- 5
KTO (Kahneman-Tversky Optimization): Alignment from One-Sided (Unpaired) Feedback
KTO (Ethayarajh et al. 2024) — feedback you actually get in production: 'thumbs up' / 'thumbs down'. Not pairs. Classical DPO can't use this data. KTO fills the gap: utility function from prospect theory (Kahneman-Tversky). Continuous learning loop in production.
- 6
DPO Family: SimPO + IPO + CPO + RPO + APO — Decision Matrix of 5 Variants
DPO family expanded in 2023-2024: SimPO (Meng et al.) — length-normalized, IPO (Azar et al.) — overfit fix, CPO (Xu et al.) — KL ratio fix, RPO (Iterative) — online iterative, APO (anchored). Loss formula for each, when to use which, quick RTX 4090 comparison.
- 7
GRPO (Group Relative Policy Optimization): DeepSeek-R1's Verifiable Reward Recipe
GRPO (DeepSeek 2024) — simplified variant of PPO. No critic/value head. Sample G different responses per batch, normalize **relative rewards** within group. Verifiable rewards (math correctness, code execution) enable reasoning RL. Qwen-7B + GRPO + GSM8K accuracy +5-8% on RTX 4090.
- 8
Reward Function Engineering: Verifiable, Math, Code, Format, Length, Diversity
Reward function = definition of success for GRPO/PPO. Math (regex/SymPy), code (exec + test), format (chat template adherence), length (anti-rambling), diversity (n-gram penalty), composability. Cookbook's reward function design guide.
- 9
Process Reward Models (PRM): Step-Level Supervision — PRM800K Dataset
PRM = reward per reasoning step. Instead of outcome-only (final answer), teach quality of each intermediate step. OpenAI PRM800K dataset, Math-Shepherd auto PRM generation, Step-DPO. Foundation for test-time tree search (Best-of-N, MCTS). PRM train + use on RTX 4090.
- 10
Constitutional AI + RLAIF: Open Replication of Anthropic's Recipe
Anthropic Constitutional AI (Bai et al. 2022): AI critiquing and improving its own responses by 'principles'. RLAIF: alignment with AI feedback (LLM judge instead of human). Open replication: principle list, self-critique loop, revised dataset, small-scale CAI Lab on RTX 4090.
- 11
Reward Hacking Diagnostics: Gaming Detection, Length Bias, Sycophancy Probe
Models 'hack' reward functions — gain reward via wrong path. Length bias (long answers = high reward), sycophancy (overly agreeable), format gaming, repetition. Detection: ablation, holdout probe, qualitative review. Lessons from Anthropic's 'reward over-optimization' report.
Part XII — Reasoning Model FT (R1-style)
- 1
Reasoning Architecture: <think> Token + Segregated vs Interleaved CoT Decision Matrix
Reasoning models split: (1) **Segregated** — reasoning in <think>...</think> block (DeepSeek-R1, o-series), then final answer; (2) **Interleaved** — reasoning + answer mixed (classic CoT, GPT-4-1106). Each's advantages, FT challenges, user UX. Token budget management.
- 2
Reasoning Trace Dataset Generation: Teacher Distillation + Self-Bootstrapping
Trace data generation for reasoning SFT: (a) Teacher distillation — DeepSeek-R1 (MIT license!), Gemini-thinking, o3 API; (b) Self-bootstrapping — small model generates traces + verifiable filter keeps correct; (c) Hybrid. Llama 3.1 70B teacher local serve + 10K trace generation on RTX 4090 (~24h).
- 3
SFT on Reasoning Traces: Llama-8B + R1-Distilled Traces (8K → 32K Context)
If reasoning trace dataset ready, SFT technically simple but details matter: add <think> tokens to vocab, embedding init, context length 32K (R1 traces 5-15K tokens), loss masking (do think tokens contribute to loss?), epoch count. Llama 3.1 8B + 1000 R1 traces 1 epoch on RTX 4090 ~50 min.
- 4
GRPO RL Stage: Math + Code Reward — Convergence Numbers (Qwen-7B + GSM8K +5-8%)
Reasoning model's last stage: GRPO with RL. GRPO with math correctness + code execution rewards on top of SFT base. Reward shaping (correctness 1.0, format 0.2, length penalty 0.001), advantage normalization, KL constraint. Qwen 2.5 7B-Instruct + GSM8K on RTX 4090: 6-8h, accuracy +5-8%.
- 5
Long-CoT Stability: Repetition Collapse + Think-Loop Mitigation
Reasoning model's most common bug: **think-loop** — model keeps thinking same thing. Repetition collapse, length explosion (8K → 30K). Mitigation: entropy bonus, repetition penalty during training, max_think_tokens enforcement, reward shaping (length penalty), early-stopping heuristics.
- 6
Reasoning Eval: AIME 2024/2025 + MATH-500 + GPQA-Diamond + LiveCodeBench
Reasoning model standard eval suite: AIME 2024 (30 problems, USA Math Olympiad), AIME 2025 (new), MATH-500 (500 high-school competition), GPQA-Diamond (graduate-level science Q&A), LiveCodeBench (monthly-refreshed). pass@1 vs majority voting (pass@64) difference. Cookbook standard eval pipeline.
Part XIII — Custom Kernels & Performance Surgery
- 1
FlashAttention v2/v3 Internals: Tile + Online Softmax + Hopper WGMMA
FlashAttention's mathematical heart: tile-by-tile attention compute, **online softmax** (incremental running max + sum), backward recomputation strategy. v2 → v3 difference: Hopper WGMMA, async memory, FP8 attention. Head-size constraint, deterministic mode, varlen variant.
- 2
Triton Crash Course: Block Pointer + Autotune + Masks — GPU Kernel in 50 Lines
Triton (OpenAI, 2021) — GPU kernel framework as fast as CUDA, easy as Python. \`@triton.jit\`, \`tl.program_id\`, \`tl.arange\`, block pointer arithmetic, autotune decorator, mask-based load/store, shared memory abstraction. Write vector add → matmul → softmax kernels from scratch on RTX 4090.
- 3
Custom Triton Kernel Lab: Cross-Entropy + Ignore-Index — Unsloth-Style Speedup
PyTorch native \`F.cross_entropy(ignore_index=-100)\` one of LLM training's most-called kernels. Naïve implementation can be 30% faster with Triton. Cookbook Lab: fused logits + softmax + CE + grad → single kernel. Pattern Unsloth uses. 8B model FT throughput +15% on RTX 4090.
- 4
Liger Kernel Tour: RMSNorm + SwiGLU + GeGLU + Fused Linear+CE — Source Reading
Liger Kernel (LinkedIn, 2024) — production-grade Triton kernel suite. Fused RMSNorm + dropout, SwiGLU + GeGLU + GeLU, RoPE rotary, fused linear+CE (memory savings), CrossEntropy chunked. Llama 3.1 8B FT throughput +20%, memory -30% on RTX 4090. Source reading: production Triton patterns.
- 5
PagedAttention (vLLM): Block Table + Copy-on-Write + KV-Cache Fragmentation
Deep anatomy of vLLM's killer feature PagedAttention: split KV-cache into 16-token blocks, logical→physical block table, copy-on-write (prefix sharing), 0% fragmentation. CUDA implementation snippets, vLLM source reading. Prefix cache hit-rate 50%+ → throughput +60% on RTX 4090.
- 6
torch.compile + Inductor: Reduce-Overhead + Dynamic Shapes + Recompile Watcher
PyTorch 2.x's flagship feature: torch.compile. Inductor backend (Triton kernel generation), 3 modes (default, reduce-overhead, max-autotune), dynamic shapes (recompile watcher), CUDA graphs, integration into FT training pipeline. Llama 3.1 8B FT throughput +15% on RTX 4090.
- 7
CUDA Graph Capture: Static-Shape Inference Graph + Eliminating Latency Tail
CUDA Graph — technique to eliminate kernel launch overhead. 'Capture' a compute graph once, then 'replay' — each replay 5-10 µs (vs 30-50 µs kernel launch). Critical for inference latency (especially decode fast-path). vLLM uses it. Requires static shapes.
- 8
Speculative Decoding FT: Draft Model + EAGLE-2 + MEDUSA Head Training
FT version of speculative decoding: pair draft model with target, maximize accept rate. EAGLE-2 head training (Li et al. 2024, +94% throughput), MEDUSA multi-head training, training extra heads while target frozen. Llama 8B target + MEDUSA 4-head ~2-3h training on RTX 4090.
Part XIV — Closed-Source API Fine-Tuning
- 1
OpenAI GPT-4o-mini / GPT-4o / GPT-4.1 Fine-Tuning API: JSONL Schema + Cost + Dashboard
Full practice of OpenAI fine-tuning API: JSONL format (chat messages), validation set, hyperparameter override, upload/monitor/download flow. Cost telemetry: training tokens × $25/M (GPT-4o-mini), inference 1.5× base. Your 1000 TR examples fine-tune GPT-4o-mini in 30 min.
- 2
OpenAI o-series Reinforcement Fine-Tuning (RFT): Grader Function Design
OpenAI announced RFT in late 2024: fine-tune o-series models (o1, o3, o4-mini) with reasoning RL. **Grader function** — function you write that gives numerical score to model output (math correctness, code execution, custom rule). Ideal for verifiable domains. JSON-based grader spec.
- 3
OpenAI GPT-5/5.1 Distillation Pipeline: Stored Completions + FT API Hybrid
OpenAI 'Stored Completions' feature (2024+): after GPT-5/5.1 inference, save completions → free dataset for distill. FT GPT-4o-mini on these completions → small-model-big-knowledge transfer. License matters (only completions you generated with your API key).
- 4
Anthropic Claude FT: AWS Bedrock Custom + Prompt-Caching Alternative
Anthropic doesn't provide direct FT API (not in Anthropic Console). Two workarounds: (1) **AWS Bedrock Custom** for Claude FT, (2) **Prompt caching** + few-shot prompting (no FT). Cookbook decision: for most use-cases prompt-caching + system prompt refinement suffices; for real FT use Bedrock route.
- 5
Google Gemini 1.5/2.0/2.5 Tuning (Vertex AI): TR Data Upload + Evaluation Pipeline
Google Gemini 1.5/2.0/2.5 — FT via Vertex AI. TR data upload (GCS), JSONL format (similar to OpenAI), training job submission, native evaluation pipeline. Gemini Flash 1.5/2.0 cost-effective TR FT alternative.
- 6
AWS Bedrock Customization: Nova / Claude / Llama / Mistral / Titan FT
5 model families via AWS Bedrock FT: Amazon Nova (Lite/Micro/Pro), Anthropic Claude (Bedrock-only route), Meta Llama, Mistral, Amazon Titan. Provisioned throughput cost math, S3 dataset upload, IAM policy. Turkey access (Frankfurt region).
- 7
Mistral La Plateforme Fine-Tuning: Mistral-Large 2 + Multi-Locale
FT on Mistral's own cloud platform La Plateforme: Mistral-7B-Instruct, Mistral-Small 3 24B, Mistral-Large 2 123B. JSONL Mistral-specific chat template, multilingual (EU + TR). EU data residency (GDPR compliant). Mid-range cost.
- 8
Cohere Command Custom Model: RAG-Tuned Foundation
Cohere Command R/R+ — RAG-native baseline. Custom Model FT via Cohere console, JSONL format, native citation token training. Production deploy Cohere endpoint or enterprise self-host.
- 9
Third-Party FT: Together AI + Fireworks + OpenPipe + Predibase + Replicate
5 important third-party FT services: Together AI (Llama/Qwen/Mistral, multi-tenant LoRA), Fireworks AI (low-latency serving + FT), OpenPipe (production logging → auto FT), Predibase (enterprise + Ludwig), Replicate (community). Decision matrix: cost/feature/locking.
- 10
Closed-FT vs Self-Hosted FT Decision Matrix: TCO + Latency + Data Residency + KVKK
Cookbook's Part XIV summary decision: closed API FT vs self-hosted open FT. 6-dim comparison: TCO (1-yr estimate), latency (P50/P95), data residency (TR/EU/US), KVKK compliance, model freedom (versioning, license, deploy), quality. Typical decisions for 4 use-cases.
Part XV — Serving Engineering
- 1
vLLM Internals: Continuous Batching + PagedAttention + Prefix Cache
vLLM (Kwon et al. 2023) — gold standard of production LLM serving. Continuous batching: requests added/removed dynamically → GPU idle ends. PagedAttention: KV-cache managed in fixed blocks → 0% fragmentation. Prefix cache: common system prompts not recomputed. Llama 3.1 8B serving on RTX 4090 (175 tok/s batch=1, 920 tok/s batch=16).
- 2
LoRA Hot-Swap Lab: Single Base + N Adapters — 50 Customers Served on a Single 4090
vLLM 0.3+'s killer feature: single base + N LoRA adapters, runtime hot-swap. Separate LoRA per customer, all on same 24GB. Llama 3.1 8B base (~5 GB AWQ) + 30+ adapters (~40 MB each) → 50 customers on single 4090. QPS-vs-latency curve.
- 3
SGLang RadixAttention: Structured Output + JSON-Mode + Multi-Branch Caching
SGLang (Zheng et al. 2024) — alternative competitor to vLLM. RadixAttention: prefix cache organized in Trie/Radix tree → multi-branch sharing. Constrained decoding (regex, JSON schema), native structured output, optimized for agent workflows. Llama 3.1 8B SGLang serving + JSON-only response on RTX 4090.
- 4
TGI (HuggingFace Text Generation Inference): Production HF Endpoint Internals
TGI — HuggingFace's production inference server, powers hf.co/inference-endpoints. Rust + Python hybrid, prometheus metrics, multi-GPU support. More aggressive batching + hard-coded FA2 than vLLM. Llama 3.1 8B serving via TGI docker on RTX 4090.
- 5
TensorRT-LLM: NVIDIA Native Engine — INT8 SmoothQuant + FP8 + In-Flight Batching
TensorRT-LLM — NVIDIA's LLM-specific TensorRT engine. CUDA kernels Hopper/Ada native, fastest inference (+15-30% throughput vs vLLM). Engine build process, INT8 SmoothQuant, FP8 quantization, multi-LoRA. Llama 3.1 8B TRT-LLM engine build (1h) + inference on RTX 4090.
- 6
llama.cpp + Ollama: GGUF Serving + Modelfile + System Prompt Versioning
llama.cpp + Ollama — gold standard for CPU/Apple Silicon/edge. GGUF format, Ollama's Modelfile (system prompt + tools versioning), Ollama API, OpenAI-compatible endpoint. Q4_K_M Llama 8B in Ollama on RTX 4090: 95 tok/s.
- 7
MLX-LM Apple Silicon: FT + Serve on M-Series Mac + Distributed MLX
Apple MLX (2023+) — unified memory ML framework for Apple Silicon. MLX-LM for Llama/Qwen/Gemma FT + inference. 70B inference on M3 Max 128GB, 8B FT on M2 Pro 32GB. Cookbook supplement for Mac users.
- 8
Speculative Decoding Production: Draft + Target Pairing + Accept Rate Measurement
Speculative decoding (Leviathan et al. 2023, Chen et al. 2023) — small draft model predicts 4-8 tokens, target model **verifies**. High accept rate → 2-3x throughput. EAGLE-2 (Li et al. 2024), MEDUSA head training. Llama 3.1 8B target + Llama 3.2 1B draft on RTX 4090: 175 → 290 tok/s.
- 9
Disaggregated Serving: Prefill/Decode Separation — Mooncake + DistServe
Latest trend in modern LLM serving (2024-2026): prefill (input encoding) and decode (generation) on different GPUs. Prefill compute-bound, decode memory-bound — separation gives 30-50% throughput gain. Mooncake (Kimi), DistServe (UCB) recipes. Conceptual in RTX 4090 multi-GPU.
- 10
Edge Inference: ONNX + Jetson + MediaTek NPU + Qualcomm AI Engine
Edge LLM inference is real in 2026: NVIDIA Jetson Orin, MediaTek NPU (Pixel), Qualcomm AI Engine (Snapdragon 8 Gen 3+), Apple Neural Engine. ONNX format for cross-platform, edge-specific quantization (INT8/INT4/W4A8 mixed), latency budget < 200 ms first-token. SmolLM3 1.7B + Pixel 8 Pro deploy recipe.
Part XVI — Production Operations
- 1
Model Registry: HuggingFace Hub Private Repo + MLflow + S3 Layout + Versioning
How to manage 50+ FT model versions in production? HuggingFace Hub private repo + MLflow Model Registry + S3 (chunked artifacts) hybrid. Versioning convention (semver + lineage), tags (production/canary/archive), retention policy. Cookbook's model card template (LoRA adapter + base + recipe).
- 2
A/B + Shadow Traffic: Feature Flag + Canary 1%→5%→25% + Automated Rollback
Safe way to put new FT model in production: shadow traffic (old + new in parallel, compare responses), canary deployment (gradual rampup 1%→5%→25%→100%), feature flag (LaunchDarkly / GrowthBook / Unleash), automated rollback (P95 latency or error rate threshold).
- 3
Online Eval: Judge LLM + Win-Rate Dashboard + Regression Alarms
Real-time model quality measurement in production: Judge LLM (GPT-4o-mini / Llama 3.3 70B) scores every Nth response, win-rate v2 vs v1 dashboard, regression alarms. Open eval kits: PromptFoo, DeepEval, RAGAs. Cookbook's eval suite: daily snapshot + weekly aggregate + alarm if regress > 3 points.
- 4
Drift Detection: Output Distribution Shift + Embedding-Cluster Anomaly
Models 'drift' over time in production: input distribution shifts, output style changes. Detection: response length histogram shift, embedding distance baseline → mean cluster drift, thumbs-down rate trend. Cookbook's weekly drift report — alarm + auto-retrain trigger.
- 5
Continual FT Loop: Weekly Retraining + Replay Buffer + Forgetting Mitigation
Model doesn't stay static in production — new data, new feedback, drift mitigation via **weekly retraining** loop. Replay buffer (30% of old training set) for catastrophic forgetting mitigation, weekly model vs current canary A/B, mandatory cert eval suite.
- 6
Memorization & Membership Inference: Training Data Extraction Probe
FT models may have memorized PII, secrets, copyrighted text from training data. Membership Inference Attack (MIA) test: feed random training snippets, does model continue? Detection thresholds. Mandatory pre-deploy check for KVKK + GDPR compliance.
- 7
Cost Observability: Token-Level Cost + FinOps Tagging + Idle GPU Detector
Bring production LLM TCO under control: per-request token cost tracking, customer-level FinOps tagging, idle GPU detector (alarm if vLLM utilization < 50%), cost-per-query trend, alarm thresholds.
- 8
Incident Drill: 'Model X Hallucinated Yesterday' — Root-Cause Matrix
Most-feared sentence in production: 'Model is returning garbage'. Cookbook's systematic root-cause matrix: model version change, base model update, API provider deprecation, dataset poisoning, prompt injection, sampling temp config drift. Incident response playbook, blameless postmortem template.
Part XVII — Turkey Use-Case Labs
- 1
E-commerce Customer Support Bot: Trendyol/Hepsiburada-Style SLA + Entity Extraction
TR e-commerce-specific customer support bot: 50K real tickets (anonymized) + Trendyol-style SLA (P95 < 3s), entity extraction (order number, product, shipping, return), intent classification (40+ intents), tool-calling (order status API). Llama 3.1 8B + Qwen 2.5 7B comparison, vLLM + LoRA hot-swap deploy.
- 2
TR Code Assistant: Repo with Turkish Comments + Continue.dev IDE Integration
Code assistant specific to Turkish dev ecosystem: FT on TR-commented repos (camelCase awareness, TR jargon), Continue.dev VS Code/JetBrains plugin integration, FIM completion + chat. Qwen2.5-Coder 7B + LoRA, self-host on RTX 4090. Internal company codebase + TR comment format.
- 3
Legal Q&A: TCK + TMK + Constitution + Legislation — RAG + FT Hybrid
TR legal LLM's most critical feature: hallucination KPI < 2% target. Constitution, TCK, TMK, Bankruptcy Law + Supreme Court rulings corpus (~5GB). Retrieval-augmented (BGE-M3 TR FT) + LLM (Qwen 2.5 14B QLoRA) hybrid. Citation token training (mandatory article ref in every answer). Integrated into lawyer workflow.
- 4
Medical Triage TR: Symptom → Preliminary Diagnosis + On-Prem Inference + KVKK + Audit-Log
Hardest parts of health LLM: regulatory (KVKK + health data special category), liability (wrong diagnosis = death), audit-log mandatory, on-prem required. Use case: family physician triage assistant — symptom list → possible preliminary diagnosis + specialist referral. Mistral Small 3 24B + on-prem + LoRA.
- 5
BIST Financial Sentiment + Balance Sheet PDF: Multimodal FT (Qwen2.5-VL)
FT for Turkish stock market (BIST): TR financial news sentiment classification (KAP filings + Bloomberg HT + economy media), reading balance sheet PDFs and extracting financial ratios (Qwen2.5-VL doc understanding), trade signal generation. Pass if quant trade signal confidence < 75%.
- 6
MEB Curriculum Tutor: High School Math / Physics PRM-Augmented Reasoning
MEB curriculum-compliant tutor: 9-12th grade math + physics, **PRM-augmented reasoning** (step-level correctness), adaptive difficulty, student misconception detection. Qwen 2.5 7B + reasoning SFT + PRM. RTX 4090 inference, web app frontend.
- 7
e-Government Citizen Assistant: Intent Classification + Tool-Calling (80+ Intents)
e-Government portal integrated LLM: 80+ intents (tax, insurance, driver's license, passport, property, etc.), tool-calling e-Government APIs, personal data (TC ID) PII handling. KVKK-compliant logging, audit trail, citizen consent. Llama 3.3 8B + custom SFT, on-prem.
- 8
Call Center Speech-to-Action: Whisper TR FT + LLM Intent + Real-Time Pipeline
End-to-end call center pipeline: Whisper Large-v3-Turbo TR FT (faster-whisper streaming) → real-time transcription → LLM intent classification (Qwen 2.5 7B) → action (CRM ticket open, order status, escalation). pyannote diarization (customer vs agent). P95 latency < 1.5s.
- 9
Banking Internal Copilot: On-Prem + KVKK Audit-Log + Prompt Injection Red-Team
Internal copilot for Turkish banking (customer rep + ops team): on-prem (BDDK + KVKK mandatory), audit log (every query + response 7-year retention), prompt injection red-team (attacker tries to access customer data), Mistral Small 3 24B + air-gapped deploy.
- 10
Municipality / Public Sector Doc-QA: Official Documents + E-Signature PDF Parse + FT
Doc-QA for municipality/public sector: zoning plan, property record, council decision, tender file, etc. official documents. E-signed PDF parse (PAdES + CAdES), table + form extraction, structured field QA. Qwen 2.5-VL doc understanding + LoRA, intent routing for citizen applications.
Part XVIII — Compliance, Governance & Red-Teaming
- 1
EU AI Act Classification: General-Purpose vs High-Risk + Annex IV Technical Documentation
EU AI Act (in force 2024): classifies LLMs into 4 categories — prohibited, high-risk, limited risk, minimal. Which category your FT model falls in = compliance budget. If high-risk: Annex IV, CE marking, conformity assessment. Mandatory if selling to EU market from Turkey.
- 2
KVKK Compliance: Anonymization + Right to Erasure + Machine Unlearning
KVKK Article 7: 'Right to Erasure'. Citizen says 'delete me from dataset': re-train expensive (millions). **Machine Unlearning** alternative: SISA approach or gradient ascent method. KVKK Board decisions, practical example (TR banking citizen erasure request).
- 3
Model License Labyrinth: Llama vs Gemma vs Qwen vs Mistral — 'Derivative Work' Debate
Which license when publishing FT model? How does base model's license affect **derivative work**? Llama Community License v3 (>700M MAU restriction), Gemma ToS (responsible use), Qwen2 Apache 2.0 (most flexible), Mistral Research vs Apache (model-specific), OpenAI ToS (output restriction). Cookbook decision matrix.
- 4
Data License Chain: CC-BY-SA Viral Effect + Common Crawl ToS + GitHub Permissive Filter
How does training dataset's license affect FT model? CC-BY-SA viral (derivative must be same license), Common Crawl ToS (research only), GitHub permissive filter (MIT/Apache/BSD only — no GPL). Can model trained on Wikipedia (CC-BY-SA) be CC-BY-SA? Legal gray area.
- 5
Model Card + Datasheet: HuggingFace Template + Google Datasheet + Bias Section
Mandatory for modern open-source LLM publication: **Model Card** (HF) — model properties, training process, evaluation, intended use, limitations, bias. **Datasheet for Datasets** (Gebru 2021) — training data details. Bias section MANDATORY (EU AI Act requirement). Cookbook's TR template.
- 6
Bias Eval TR: BBQ-TR — Gender / Ethnicity / Sect / Age / Socioeconomic Probe + Mitigation
BBQ (Bias Benchmark for QA, Parrish 2022) TR adaptation: gender, ethnicity (Turkish/Kurdish/Arab/Armenian), sect (Sunni/Alevi), age, socioeconomic, physical appearance — 9 categories bias probe. 1200 ambiguous question pairs. Cookbook's mitigation recipe: balanced SFT data + DPO bias-rejection examples.
- 7
Red-Teaming Lab: GCG + PAIR + AutoDAN + Prompt Injection Robustness
Mandatory before production deploy: red-team probe. GCG (Greedy Coordinate Gradient — adversarial suffix attack), PAIR (Prompt Automatic Iterative Refinement — LLM attacks LLM), AutoDAN (jailbreak auto-generation), prompt injection (malicious instruction in RAG context). Cookbook's open red-team corpus + scoring method.
- 8
Watermarking & Provenance: C2PA + SynthID + Model Fingerprinting
Making AI-generated content detectable: SynthID (Google, statistical watermark in token distribution), C2PA (Content Authenticity Initiative — metadata-based), model fingerprinting (training-time backdoor as ownership proof). Mandatory for EU AI Act + emerging regulations.
- 9
DP-SGD (Differential Privacy SGD) + Federated FT: Opacus + Flower
Privacy guarantees for FT with sensitive data: DP-SGD (Opacus library) — add controlled noise to gradients, (ε, δ)-differential privacy guarantee. Federated FT (Flower) — data never reaches server, only gradients. Ideal for KVKK + health + finance. Privacy budget vs accuracy trade-off.
- 10
ROOTS-Style Data Transparency: Reproducibility + Open Science Standards
ROOTS (BigScience BLOOM) — standard for full transparency of training corpus. For cookbook's FT models: dataset card (source, license, processing), data composition table, exclusion criteria. Those applying this standard are long-term trustworthy in open science.
Capstone — Build Your Own LLM
- 1
Capstone Brief: End-to-End FT Project in Your Niche Domain — 12-Step Roadmap
Cookbook's final project: 4-6 week end-to-end FT project. Pick niche domain (health/legal/ecommerce/public/education/finance/literature/sports/games/history/etc.), collect data, extend tokenizer, continual PT, SFT + DPO, quantize, deploy via vLLM, eval, model card, public release. Practically integrates all 19 Parts.
- 2
Final Run Telemetry Report: MFU + Throughput + Loss + Cost Decomposition
Capstone's final deliverable: detailed telemetry report. MFU%, tokens/s, peak GPU memory, loss curve overlay (SFT + DPO), eval table (TR-MMLU + custom), cost decomposition (cloud hours × $ + electricity ₺ + storage), git_sha + data_sha256 + wandb_run_id triple. Cookbook standard: mandatory for certification.
- 3
Peer Review Rubric: Reproducibility + Eval Rigour + Engineering + TR-Domain Fit
Cookbook's peer review system: capstone projects reviewed by community members. 4 categories × 25 points: Reproducibility (lineage triple, env pinning, open repo), Eval rigour (TR-MMLU + domain bench + bias eval), Engineering quality (MFU >35%, code organization), TR-domain fit (real usage potential). 100 total, 70+ → certification.
- 4
Public Release Package: HF Hub + Model Card + Dataset Card + Eval Results + License Attestation
Releasing capstone model to world: public push to HuggingFace Hub, full model card, dataset card, eval_results.csv, Modelfile (Ollama compat), license attestation (base model + dataset chain), badges ('Apache 2.0', 'BBQ-TR tested', 'KVKK compliant'). Twitter/LinkedIn launch template.
- 5
Certification Path: 'FT Engineer Level III' — Cookbook's Official Recognition
Cookbook's closing certification: deliver at least 85% of lessons across all 19 Parts + capstone peer-review score ≥ 70/100 → **'FT Engineer Level III'** certificate. Certificate added to LinkedIn, recorded at sukruyusufkaya.com/certificates. Turkey's only independent FT engineer certification.