Can you explain the clear difference between CPT, SFT, and RAG with a concrete example?

Let me exemplify via the Turkish Code of Obligations. (1) CPT: You 'have the model read' the entire Code of Obligations text + Yargıtay decisions as raw text — it permeates the model's internal knowledge, and concepts like 'contract', 'performance', 'default' are now natively recognized. If a new article is enacted, CPT is needed again. (2) SFT: You train the model in 'question → answer' format: 'Question: What is Article 118 of the Code of Obligations? Answer: Article 118 regulates the assignment of receivables…' — the model learns the response format but not new knowledge. (3) RAG: The Code of Obligations text is loaded into a vector DB; for each question, the 5 most relevant articles are retrieved and added to context — the model itself does not learn but references the correct information. Rule: static + high-volume knowledge (Code + case law) → CPT; behavior format (Q&A style) → SFT; dynamic (new laws) or one-off reference → RAG. Module 1 covers this decision in detail.

How much data and compute is needed to train a Turkish LLM?

Depends on the goal. Minimum for a 'satisfactory Turkish CPT': 8B base + 20-50B Turkish tokens + 8x H100, 1-2 weeks (~$10K-30K). For a 'top-tier production-grade Turkish LLM' (Cosmos / Trendyol-like): 70B base + 200-500B tokens + 16-32x H100 cluster, 1-2 months (~$200K-500K). Most economical: Llama 3.3 8B + 30B Turkish tokens + Aya Expanse-style vocabulary expansion + a single H100 + 1 week (~$2K-5K) — a small but functional Turkish LLM. In Module 12's capstone, you do realistic planning per your own compute budget.

Is vocabulary expansion really necessary? Can a good Turkish LLM be made without it?

Depends on the context. The Llama 3 tokenizer has fertility ~1.8-2.5 for Turkish (a Turkish word splits into ~2 tokens on average); this doubles inference cost and latency. If only quality matters, good results can be achieved without vocabulary expansion (Trendyol AI took this route). However, if cost is critical in production, expansion is mandatory — Aya Expanse's 32B model uses 40% fewer tokens than the Llama 3 tokenizer for Turkish. Module 4 clarifies this decision with concrete numbers via fertility analysis + Pareto frontier.

How do we prevent catastrophic forgetting? Is 5% pre-training mix sufficient?

It depends on general rules and mitigation techniques. A 5% pre-training data mix (replay buffer) reduces forgetting by 30-50% in most domain CPTs but does not eliminate it entirely. Stronger mitigation: 10-20% pre-training mix + LoRA (adapter-based protection) + model souping (weight averaging of pre/post-CPT). Most reliable: LoRA + final-stage model merging (50/50 weight average of CPT model + base model). Module 3 shows how each mitigation reduces forgetting with concrete MMLU regression numbers; for example, in Anthropic's 2024 experiment, 10% replay + LoRA → general MMLU dropped only 0.5 points.

Can I do domain CPT with LoRA, or is full fine-tuning mandatory?

It depends on the volume of knowledge injection. LoRA capacity is limited by rank; small rank (r=8-16) suffices for general behavior tuning but is insufficient for deep domain knowledge (thousands of legal articles, millions of medical cases). Practical rule: <5B token CPT + single domain → LoRA rank 64-128 is sufficient; 10-50B tokens + multi-topic → LoRA rank 256+ or DoRA; 100B+ tokens + new language → full FT is mandatory (Aya Expanse and BloombergGPT took this route). Module 7 evidence-based shows the concrete limits of LoRA capacity.

How is domain-data collection done? Can it be KVKK-compliant?

Three main sources: (1) Public domain — Official Gazette, Yargıtay decisions, TCMB reports, KAP disclosures, university thesis archives (low KVKK risk); (2) Web scraping (news, blogs, forums) — copyright + terms of use + KVKK PII filtering are mandatory; (3) Corporate internal data (patient records, customer interactions) — anonymization + explicit consent + KVKK Article 6 compliance are critical. Module 2 covers KVKK-compliant PII-detection pipeline (Turkish TC ID, IBAN, name detection) in detail, and Module 6 covers sectoral compliance (legal + healthcare + finance KVKK guide) in detail.

Can I train a sectoral finance LLM for Turkey using the BloombergGPT recipe?

Yes — Module 6.2 teaches exactly this. For the Turkish replication of the BloombergGPT (50B-token finance CPT) recipe: TCMB EVDS (economic data), KAP company disclosures, BIST IPO/finance reports, Turkish balance-sheet corpus (~10-30B tokens achievable), Reuters Turkish + Anadolu Agency finance news. With Llama 3.3 70B + 30-50B Turkish finance token CPT, a production-grade Turkish finance LLM is possible on FinanceBench-tr, MMLU-Finance-tr. This scenario is optionally offered in the capstone.

How do I extend 128K context to 1M with YaRN?

Module 8 recipe: (1) Increase the base RoPE freq (10000 → 1M+); (2) Apply YaRN attention scaling correction; (3) Curriculum CPT: 4K → 16K → 64K → 256K → 1M progressive extension; 100M-1B long-context tokens at each stage; (4) Eval at each stage with RULER + needle-in-a-haystack. Llama 3.1 recipe (Meta 2024): 8K base + 800B token long-context CPT → 128K context. For 1M+: 128K base + 100B tokens + LongRoPE → 1M-10M context (Microsoft 2024 approach). Compute: 8x H100, 1-2 weeks, ~$15K-30K.

Are DoReMi and RegMix really used? Isn't manual mix ratio sufficient?

In production, it has become standard for large models. Manual mix (e.g., 40% web + 20% code + 15% math + 10% academic + 15% multilingual) is a heuristic decision; DoReMi and RegMix estimate optimal mix for large models via small-scale proxies and yield 5-15% MMLU/HellaSwag improvements. These methods were used in the Tülu 3 (AllenAI 2025), DCLM (DataComp-LM 2024), Llama 3.1 / Qwen3 production pipelines. Module 9 shows DoReMi and RegMix implementations practically.

RLHF + Reasoning + Mech Interp + CPT — in what order should these four trainings be taken?

Optimal order depends on the scenario, but the recommended general path: (1) First CPT — understanding how the base LLM learns knowledge; (2) Then RLHF/DPO/GRPO — how the model is aligned; (3) Then Reasoning Models — how reasoning capability is produced; (4) Finally Mech Interp — how all these processes are represented inside the model. This order follows technical dependencies. For fast production: first RLHF + Reasoning (most practical), then CPT (if data is ready), Mech Interp last (if research/safety focused). The capstones transfer knowledge across these trainings.

What concrete artifacts will I have at the end of the training?

The following artifacts are produced in the capstone project: (1) an end-to-end CPT pipeline tailored to your domain (Python codebase + Axolotl/OpenRLHF YAML config); (2) a Turkish or domain pre-training corpus (cleaned + dedup + quality-filtered + KVKK PII removed); (3) vocabulary expansion and a custom tokenizer (if applicable); (4) a CPT checkpoint (8B base + Turkish or domain CPT); (5) a 4-dimensional eval report (domain gain + forgetting + long-context + production); (6) a cost analysis (compute hours + dollars + alternative scenarios); (7) a catastrophic-forgetting mitigation decision matrix; (8) a 90-day production deployment roadmap (post-CPT SFT + DPO + RAG integration).

Can the training be customized for our enterprise team?

Yes. Beyond the standard 3-day program, we offer customized private-classroom versions for enterprise clients. Module weights and capstone scenarios are tailored to your team's target domain (legal, healthcare, finance, code, public sector, education), existing LLM stack (Cosmos / Trendyol AI / your own base), compute infrastructure (cloud / on-premise H100/B200 cluster), data volume + quality, compliance requirements (KVKK, EU AI Act, HIPAA, ISO/IEC 42001), and language target (Turkish only vs Turkish + Arabic + others).

About this training

A 3-day advanced Turkish training that covers end to end the Continued Pretraining + Domain Adaptation discipline for those wishing to train a Turkish LLM (Cosmos Llama, Trendyol AI, KUIS-AI, Aya Expanse) or produce a custom LLM for legal/healthcare/finance/code domains. Includes catastrophic-forgetting mitigation, vocabulary expansion, YaRN long-context extension, DoReMi/RegMix data mixing, LoRA/DoRA/QLoRA/GaLore efficient CPT, and domain-benchmark production.

This training is designed for: ML Engineers and AI Researchers who want to train a Turkish LLM (Cosmos / Trendyol AI / KUIS-AI / Aya Expanse style) Enterprise AI teams who want to produce a custom LLM for legal / healthcare / finance / code domains Startup technical leaders who want to build domain-specific LLM architectures like BloombergGPT, Med-PaLM, Harvey AI ML Platform engineers who want to adapt Llama 3.3 / Qwen3 / Gemma 3 / Mistral base to their sector University research groups that want to lead on Turkish LLM benchmarks Data Engineers who need to perform long-context (128K-1M) extension + KVKK-compliant Turkish deployment

Why this course matters: The only program in Turkey that covers the CPT + Domain Adaptation discipline end to end with math + data + mitigation + eval. Comparatively analyzes Cosmos Llama, Trendyol AI, KUIS-AI, Aya Expanse from the CPT methodology perspective. Makes BloombergGPT, Med-PaLM, Harvey AI-style domain-specific LLM recipes Turkish + KVKK-compliant. Covers 2024-2026 frontier techniques like vocabulary expansion + YaRN long-context + DoReMi data mixing. Instills compute-optimal CPT-selection discipline via LoRA / DoRA / QLoRA / GaLore comparison. Deeply covers catastrophic forgetting with Fisher Information Matrix + EWC + replay buffer. Through the capstone project, equips the participant with a CPT pipeline + cost analysis + roadmap applicable in their own domain. Together with RLHF + Reasoning Models + Mech Interp + CPT, completes a four-training frontier set covering the alignment + reasoning + interpretability + knowledge injection ecosystem.

Learning outcomes by the end of the programme: Apply the CPT vs SFT vs RAG decision matrix at enterprise scale. Build a FineWeb-style data pipeline for Turkish + domain. Evidence-based-select catastrophic-forgetting mitigation recipes. Double Turkish efficiency via vocabulary expansion + tokenizer adaptation. Train a Turkish LLM at the Cosmos / Trendyol AI / KUIS-AI / Aya Expanse level. Build a CPT pipeline in the legal, healthcare, finance, or code domain. Make compute-optimal choices among LoRA, DoRA, QLoRA, GaLore. Perform 128K-1M long-context extension with YaRN. Estimate optimal data mix with DoReMi/RegMix. Build a 4-dimensional (domain gain + forgetting + long-context + production) post-CPT eval framework.

Prerequisites and recommended background: Active Python experience (intermediate to advanced), basic use of PyTorch and HuggingFace Transformers Basic experience with LLM fine-tuning (at least conceptual familiarity with SFT, LoRA) Foundational ML math: linear algebra, probability, gradient descent Basic knowledge of transformer architecture (attention, residual stream, RoPE) GPU access: H100 (80GB) or 2-4x A100 recommended for the capstone HuggingFace + Weights & Biases account before the training

The only advanced program in Turkey that covers Turkish LLM (Cosmos, Trendyol AI, KUIS-AI, Aya Expanse) and domain-specific LLM CPT end to end
Production-grade data engineering with FineWeb pipeline + Turkish corpus + KVKK-compliant PII detection
Mathematical construction of catastrophic forgetting + EWC + replay buffer + LoRA-CPT + model souping mitigation
2x token efficiency for Turkish via vocabulary expansion + tokenizer adaptation (FOCUS, Aya Expanse approach)
CPT recipes for legal (Yargıtay/Danıştay), healthcare (DSM-5-TR), finance (TCMB/BIST), code (DeepSeek-Coder) domains
128K-1M long-context extension + RoPE scaling techniques with YaRN
DoReMi + RegMix data mixing + Llama 3.1/Qwen3 cooldown/annealing recipe
Compute-optimal-selection discipline via Full FT, LoRA, DoRA, QLoRA, GaLore comparison

Key Takeaways

Apply the CPT vs SFT vs RAG decision matrix at enterprise scale.
Build a FineWeb-style data pipeline for Turkish + domain.
Evidence-based-select catastrophic-forgetting mitigation recipes.
Double Turkish efficiency via vocabulary expansion + tokenizer adaptation.
Train a Turkish LLM at the Cosmos / Trendyol AI / KUIS-AI / Aya Expanse level.
Build a CPT pipeline in the legal, healthcare, finance, or code domain.
Make compute-optimal choices among LoRA, DoRA, QLoRA, GaLore.
Perform 128K-1M long-context extension with YaRN.
Estimate optimal data mix with DoReMi/RegMix.
Build a 4-dimensional (domain gain + forgetting + long-context + production) post-CPT eval framework.

Advanced Level3 Gün

LLM Continued Pretraining and Domain Adaptation Engineering Training (Turkish LLM + Legal/Healthcare/Finance Domain)

Enroll Now

About This Course

This training is a 3-day advanced Continued Pretraining (CPT) program designed end to end for ML Engineers, AI Researchers, Data Engineers, and ML Platform engineers who want to adapt open-source base LLMs (Llama 3.3, Qwen3, Gemma 3, Mistral) to the Turkish language or to domains like legal, healthcare, finance, and code. In Turkey, projects like Cosmos / Trendyol AI / KUIS-AI that want to train Turkish LLMs are growing rapidly; similarly, law firms need Harvey AI-style case-law reasoning; healthcare institutions need Med-PaLM-style medical expertise; finance companies need BloombergGPT-style sectoral intelligence in custom LLMs. However, a Turkish-language training that covers this discipline end to end with math + data pipeline + mitigation + eval is virtually nonexistent — existing content either stays at the level of academic-paper summaries or remains shallow at example-copy script level. This program is designed to fill that gap as Turkey's most comprehensive production-grade CPT reference training.

The strategic backbone of the program is the first module, which clearly frames the place of the Continued Pretraining discipline in the pre-training → CPT → SFT → DPO/RLHF → deployment flow and its difference from SFT / RLHF / RAG. An evidence-based decision matrix is provided: knowledge injection (teaching new knowledge, learning a new language, static domain knowledge) → CPT optimal; behavior shaping (response style, formatting, instruction following) → SFT sufficient; dynamic / frequently updated knowledge → RAG mandatory; very high-volume + static domain knowledge → CPT + RAG hybrid. Production case studies — BloombergGPT (50B-token finance CPT), Med-PaLM (healthcare), Code Llama (code), Cosmos Llama / Trendyol AI / KUIS-AI / Aya Expanse (Turkish), DeepSeek-Math / Qwen3-Math / Llemma (math) — are analyzed from a strategic perspective.

The second module is dedicated to the data-engineering discipline that determines 70% of CPT success. HuggingFace FineWeb (15T tokens) and FineWeb-Edu methodology; Common Crawl WARC processing and HTML cleaning with Trafilatura; comparison of Cosmopedia, RefinedWeb, RedPajama, DOLMA datasets; for Turkish: Turkish FineWeb, mC4-tr, OSCAR-tr, Wikipedia-tr, Boğaziçi/İTÜ/KUIS open corpus sources; deduplication strategies (exact hash, MinHash LSH fuzzy dedup, embedding-based semantic dedup); quality filtering (Gopher rules + Cosmopedia + fastText classifier); KVKK-compliant PII detection (Turkish TC kimlik, IBAN, phone-number detection); toxicity and contamination detection — every stage is hands-on. The practical recipe for producing 100B-500B tokens from Turkish raw data is provided.

The third module analyzes mathematically the fundamental challenge of CPT — the catastrophic-forgetting problem. From the loss-landscape perspective: drift from pre-train minimum to domain minimum, identification of important parameters with the Fisher Information Matrix, and the plasticity-stability dilemma are covered in detail. Classical mitigation: replay buffer (mix ratio of domain data + pre-training data — practical recommendation 5-20% pre-training mix), EWC (Kirkpatrick 2017 Fisher-weighted L2 regularization), layer-wise learning rate, embedding freeze. Modern approaches: LoRA-based CPT (a small adapter prevents catastrophic forgetting but capacity is limited), model souping / weight averaging (Wortsman 2022), Branch-Train-Merge (BTM, Li 2022) and domain-expert routing — each is evidence-based compared on trade-offs.

The fourth module addresses vocabulary expansion and tokenizer adaptation techniques — especially critical for Turkish. Turkish fertility analysis of Llama 3, Qwen3, Gemma 3 tokenizers (measurement of how many tokens a Turkish word splits into on average; 1.0-1.3 in English, 1.8-2.5 in Turkish — this doubles cost and latency). Mean initialization (new token embedding as the average of existing tokens), FOCUS (Dobler 2023 semantic-aware initialization), and the Aya Expanse 2024 approach (23-language multilingual expansion + frozen base) are covered in detail. Training Turkish + domain tokenizers with SentencePiece, merging and extending with the Hugging Face Tokenizers library, and the impact of tokenizer change on embedding + lm_head are shown practically. The vocabulary expansion vs no-expansion CPT trade-off is evidence-based decided.

The fifth module comparatively analyzes Turkey's four prominent open-source Turkish LLM projects from the CPT methodology perspective. Cosmos Llama 3.3 / 3.1 series (base, CPT data, SFT, instruct variants); Trendyol AI Llama 3 8B / 70B (Trendyol dataset + domain adaptation); KUIS-AI Turkish-Llama (Koç University contributions); Cohere Aya Expanse 8B / 32B (23-language multilingual CPT approach). For each, the base-model selection, CPT data strategy, vocabulary-expansion decision, training compute, and eval results are analyzed in detail. Comparison on Turkish MMLU, MMLU-Pro-tr, Belebele-tr, TruthfulQA-tr, Hellaswag-tr, ARC-tr benchmarks and Open LLM Leaderboard Turkish ranking analysis is performed. Boğaziçi, METU, İTÜ Turkish LLM research is also examined.

The sixth module provides CPT recipes for the four most-demanded domains in Turkey. Legal domain: Turkish case-law (Yargıtay, Danıştay, Constitutional Court decisions), Legislation (laws, regulations), Official Gazette archive CPT pipeline; Harvey AI approach (legal exam + risk assessment); KVKK-compliant data collection. Healthcare domain: DSM-5-TR + medical guidelines + patient records (anonymized) CPT; Med-PaLM (Google 2023) and Med-PaLM 2 approach; HIPAA + KVKK biomedical compliance. Finance domain: replication of the BloombergGPT (50B-token finance) recipe; Turkish finance CPT with TCMB reports + KAP disclosures + BIST data + Turkish balance-sheet corpus. Code domain: comparison of Code Llama, DeepSeek-Coder V3, Qwen2.5-Coder recipes. For each domain, benchmark production (bar-exam simulation, USMLE-tr, FinanceBench-tr, HumanEval-tr, MBPP-tr, BigCodeBench-tr) and sector-regulation-compliant deployment discipline are provided.

The seventh module deeply addresses the parameter-efficient + memory-efficient approaches that determine compute efficiency in production CPT. Full fine-tuning, LoRA (Hu 2021 low-rank decomposition W = W_0 + B·A formulation), DoRA (Liu 2024 magnitude + direction separation), QLoRA (Dettmers 2023 4-bit NF4 quantization + LoRA), ReFT (representation fine-tuning), and GaLore (Zhao 2024 memory-efficient full pre-training via gradient low-rank projection) approaches are evidence-based compared. LoRA capacity limitations for CPT — at what rank LoRA is sufficient for knowledge injection, in which scenario full FT is mandatory — are taught with a practical cookbook. 30B+ model FT on a single H100 with DeepSpeed ZeRO-3 + offload, and FSDP2 (PyTorch 2.x) + activation-checkpointing CPT are shown practically.

The eighth module addresses techniques for extending a base model's context window via CPT. RoPE (Rotary Position Embeddings) is built at the mathematical level (rotation matrix per dimension); Linear interpolation, NTK-aware scaling, Dynamic NTK, YaRN (Yet another RoPE extensioN, Peng 2023 — attention scaling correction), Position Interpolation (Chen 2023), and LongRoPE (Microsoft 2024) are comparatively covered. The Llama 3.1 128K extension recipe (Meta 2024), the Gemini 2.5 Pro 1M-10M context production approach, and the Mistral interleaved sliding-window attention are analyzed with practical examples. Curriculum: 4K → 16K → 64K → 1M token progressive extension strategy; needle-in-a-haystack and multi-needle eval; NVIDIA RULER benchmark (retrieval + reasoning long-context); LongBench, InfiniteBench for real-world long-context eval are taught.

The ninth module is dedicated to the discipline of how much data to use from which domain in CPT (domain mixing ratios) — a first-order determinant of final model quality. DoReMi (Xie 2023 — domain reweighting via worst-domain minimax optimization), RegMix (Liu 2024 — regression-based mix prediction with small-scale proxy), and DataMix approaches are covered at the mathematical level. In Turkish CPT, the Turkish vs English ratio decision matrix (recommended starting at 70/30 → cooldown at 50/50), the recipe for preventing catastrophic forgetting via a domain + general data mix, and the DeepSeek-Coder recipe in the code + math + general triangle are shown practically. Curriculum learning (easy → hard data ordering), and final-stage high-quality data injection in the Llama 3.1 and Qwen3 cooldown/annealing stages for MMLU boost strategies are addressed.

The tenth module is dedicated to the engineering side of CPT. Learning-rate selection (basic principle: 1/10 → 1/100 of pre-training LR); warmup steps; comparison of cosine decay vs constant LR vs WSD (Warmup-Stable-Decay) schedules; max LR, min LR tuning cookbook; batch-size scaling (global batch size 1M-4M tokens), gradient accumulation, mixed precision (bf16, fp8 — Blackwell B200/GB200); the DeepSpeed ZeRO-3 vs FSDP2 vs Megatron-LM distributed-setup decision matrix; mix of TP (tensor parallel) + PP (pipeline parallel) + DP; training-run monitoring (loss curves, gradient norm, weight stats); loss spikes and divergence-recovery strategies; checkpoint frequency, async checkpointing, and eval-on-checkpoint pipeline are covered in detail.

The eleventh module addresses the four-dimensional post-CPT evaluation discipline. (1) Domain gain: Turkish MMLU, MMLU-Pro-tr, Belebele-tr, ARC-tr; producing domain-specific benchmarks (Turkish bar-exam simulation, FinanceBench-tr, USMLE-tr); chat-ability eval with MT-Bench Turkish and AlpacaEval Turkish. (2) Catastrophic forgetting: regression tests on general MMLU, HellaSwag, ARC, TruthfulQA; regression on code benchmarks (HumanEval, MBPP). (3) Long-context regression: RULER, needle-in-a-haystack, LongBench. (4) Production eval: production comparison of base model vs CPT model via A/B testing, online eval with user feedback (thumbs up/down), business metrics (conversion, satisfaction, task completion rate). All reporting formats are tied to enterprise compliance discipline.

In the capstone module, each participant designs an end-to-end CPT pipeline tailored to their own scenario: scenario selection (Turkish LLM, legal, healthcare, finance, code, or the participant's own domain), base-model selection (Llama 3.3, Qwen3, Gemma 3, Mistral, DeepSeek base), Turkish and/or domain data collection (50B-200B tokens), vocabulary-expansion decision, mitigation strategy (replay ratio + LoRA / full FT / hybrid), training stack (TRL + Axolotl or OpenRLHF + DeepSpeed), compute budget (single H100, 8x H100, multi-node planning), eval framework (4 dimensions), 90-day production deployment roadmap (including post-CPT SFT + DPO + RAG integration). By the end of the training, participants reach a level of technical competence to apply the CPT vs SFT vs RAG decision matrix at enterprise scale; build a FineWeb-style data pipeline for Turkish + domain; evidence-based-select catastrophic-forgetting mitigation recipes; double Turkish efficiency via vocabulary expansion + tokenizer adaptation; perform 128K-1M long-context extension with YaRN; estimate optimal data mix with DoReMi/RegMix; make compute-optimal choices among LoRA/DoRA/QLoRA/GaLore; and build production-grade CPT pipelines at Cosmos / Trendyol AI / Aya Expanse / BloombergGPT level. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.

Training Methodology

The only advanced program in Turkey that covers Turkish LLM (Cosmos, Trendyol AI, KUIS-AI, Aya Expanse) and domain-specific LLM CPT end to end

Production-grade data engineering with FineWeb pipeline + Turkish corpus + KVKK-compliant PII detection

Mathematical construction of catastrophic forgetting + EWC + replay buffer + LoRA-CPT + model souping mitigation

2x token efficiency for Turkish via vocabulary expansion + tokenizer adaptation (FOCUS, Aya Expanse approach)

CPT recipes for legal (Yargıtay/Danıştay), healthcare (DSM-5-TR), finance (TCMB/BIST), code (DeepSeek-Coder) domains

128K-1M long-context extension + RoPE scaling techniques with YaRN

DoReMi + RegMix data mixing + Llama 3.1/Qwen3 cooldown/annealing recipe

Compute-optimal-selection discipline via Full FT, LoRA, DoRA, QLoRA, GaLore comparison

Who Is This For?

ML Engineers and AI Researchers who want to train a Turkish LLM (Cosmos / Trendyol AI / KUIS-AI / Aya Expanse style)

Enterprise AI teams who want to produce a custom LLM for legal / healthcare / finance / code domains

Startup technical leaders who want to build domain-specific LLM architectures like BloombergGPT, Med-PaLM, Harvey AI

ML Platform engineers who want to adapt Llama 3.3 / Qwen3 / Gemma 3 / Mistral base to their sector

University research groups that want to lead on Turkish LLM benchmarks

Data Engineers who need to perform long-context (128K-1M) extension + KVKK-compliant Turkish deployment

Why This Course?

The only program in Turkey that covers the CPT + Domain Adaptation discipline end to end with math + data + mitigation + eval.

Comparatively analyzes Cosmos Llama, Trendyol AI, KUIS-AI, Aya Expanse from the CPT methodology perspective.

Makes BloombergGPT, Med-PaLM, Harvey AI-style domain-specific LLM recipes Turkish + KVKK-compliant.

Covers 2024-2026 frontier techniques like vocabulary expansion + YaRN long-context + DoReMi data mixing.

Instills compute-optimal CPT-selection discipline via LoRA / DoRA / QLoRA / GaLore comparison.

Deeply covers catastrophic forgetting with Fisher Information Matrix + EWC + replay buffer.

Through the capstone project, equips the participant with a CPT pipeline + cost analysis + roadmap applicable in their own domain.

Together with RLHF + Reasoning Models + Mech Interp + CPT, completes a four-training frontier set covering the alignment + reasoning + interpretability + knowledge injection ecosystem.

Learning Outcomes

Apply the CPT vs SFT vs RAG decision matrix at enterprise scale.

Build a FineWeb-style data pipeline for Turkish + domain.

Evidence-based-select catastrophic-forgetting mitigation recipes.

Double Turkish efficiency via vocabulary expansion + tokenizer adaptation.

Train a Turkish LLM at the Cosmos / Trendyol AI / KUIS-AI / Aya Expanse level.

Build a CPT pipeline in the legal, healthcare, finance, or code domain.

Make compute-optimal choices among LoRA, DoRA, QLoRA, GaLore.

Perform 128K-1M long-context extension with YaRN.

Estimate optimal data mix with DoReMi/RegMix.

Build a 4-dimensional (domain gain + forgetting + long-context + production) post-CPT eval framework.

Requirements

Active Python experience (intermediate to advanced), basic use of PyTorch and HuggingFace Transformers

Basic experience with LLM fine-tuning (at least conceptual familiarity with SFT, LoRA)

Foundational ML math: linear algebra, probability, gradient descent

Basic knowledge of transformer architecture (attention, residual stream, RoPE)

GPU access: H100 (80GB) or 2-4x A100 recommended for the capstone

HuggingFace + Weights & Biases account before the training

Course Curriculum

105 Lessons

Module 1: Strategic Introduction to the Continued Pretraining Discipline — CPT vs SFT vs RAG10 Lessons

Module 2: Data Engineering for Continued Pretraining — FineWeb, Turkish Corpora, and Quality Filtering9 Lessons

Module 3: The Catastrophic Forgetting Problem and Mitigation Strategies9 Lessons

Module 4: Vocabulary Expansion and Tokenizer Adaptation — Token Efficiency for Turkish9 Lessons

Module 5: Turkish LLM CPT Case Studies — Cosmos, Trendyol AI, KUIS-AI, Aya Expanse9 Lessons

Module 6: Domain-Specific CPT — Legal, Healthcare, Finance, and Code Domains9 Lessons

Module 7: Efficient CPT — Full FT, LoRA, DoRA, and QLoRA Comparison9 Lessons

Module 8: Long-Context Extension — RoPE Scaling, YaRN, and 1M-10M Token CPT9 Lessons

Module 9: Data Mixing Strategies — DoReMi, RegMix, and Optimal Domain Mix9 Lessons

Module 10: Training Engineering — LR Schedule, Warmup, Hyperparams, and Distributed Setup9 Lessons

Module 11: Post-CPT Evaluation — Domain Benchmark, Forgetting Tests, and Production Eval9 Lessons

Module 12: Capstone — Producing a Turkish or Domain-Specific LLM5 Lessons

Instructor

Şükrü Yusuf KAYA

AI Architect | Enterprise AI & LLM Training | Stanford University | Software & Technology Consultant

Şükrü Yusuf KAYA is an internationally experienced AI Consultant and Technology Strategist leading the integration of artificial intelligence technologies into the global business landscape. With operations spanning 6 different countries, he bridges the gap between the theoretical boundaries of technology and practical business needs, overseeing end-to-end AI projects in data-critical sectors such as banking, e-commerce, retail, and logistics. Deepening his technical expertise particularly in Generative AI and Large Language Models (LLMs), KAYA ensures that organizations build architectures that shape the future rather than relying on short-term solutions. His visionary approach to transforming complex algorithms and advanced systems into tangible business value aligned with corporate growth targets has positioned him as a sought-after solution partner in the industry. Distinguished by his role as an instructor alongside his consulting and project management career, Şükrü Yusuf KAYA is driven by the motto of "Making AI accessible and applicable for everyone." Through comprehensive training programs designed for a wide spectrum of professionals—from technical teams to C-level executives—he prioritizes increasing organizational AI literacy and establishing a sustainable culture of technological transformation.

Frequently Asked Questions

Apply for Training

Boutique training with limited seats.

Pre-register for Next Groups

Leave your info to be the first to know when the next batch opens.

Live & Interactive Sessions

Project-Based Learning

Industry-Focused Curriculum

Professional Networking

1-on-1 Mentorship

Book a private session.

Enroll

About this training

Key Takeaways

LLM Continued Pretraining and Domain Adaptation Engineering Training (Turkish LLM + Legal/Healthcare/Finance Domain)