# LLM Continued Pretraining and Domain Adaptation Engineering Training (Turkish LLM + Legal/Healthcare/Finance Domain)

> Source: https://sukruyusufkaya.com/en/training/llm-continued-pretraining-domain-adaptation-muhendisligi-egitimi
> Updated: 2026-05-19T15:06:05.978Z
> Level: advanced
> Topics: continued pretraining, cpt, domain adaptation, türkçe llm, cosmos llama, trendyol ai, kuis-ai, aya expanse, catastrophic forgetting, vocabulary expansion, yarn rope scaling, long-context extension, doremi regmix, lora dora qlora, galore, fineweb dataset, data mixing, domain-specific llm, bloomberggpt, med-palm
**TLDR:** A 3-day advanced Turkish training that covers end to end the Continued Pretraining + Domain Adaptation discipline for those wishing to train a Turkish LLM (Cosmos Llama, Trendyol AI, KUIS-AI, Aya Expanse) or produce a custom LLM for legal/healthcare/finance/code domains. Includes catastrophic-forgetting mitigation, vocabulary expansion, YaRN long-context extension, DoReMi/RegMix data mixing, LoRA/DoRA/QLoRA/GaLore efficient CPT, and domain-benchmark production.

## Açıklama

The LLM Continued Pretraining and Domain Adaptation Engineering Training is a 3-day advanced program designed for ML Engineers, AI Researchers, Data Engineers, and ML Platform engineers who want to adapt a base LLM to Turkish or to legal/healthcare/finance/code domains.

## Kazanımlar

- Apply the CPT vs SFT vs RAG decision matrix at enterprise scale.
- Build a FineWeb-style data pipeline for Turkish + domain.
- Evidence-based-select catastrophic-forgetting mitigation recipes.
- Double Turkish efficiency via vocabulary expansion + tokenizer adaptation.
- Train a Turkish LLM at the Cosmos / Trendyol AI / KUIS-AI / Aya Expanse level.
- Build a CPT pipeline in the legal, healthcare, finance, or code domain.
- Make compute-optimal choices among LoRA, DoRA, QLoRA, GaLore.
- Perform 128K-1M long-context extension with YaRN.
- Estimate optimal data mix with DoReMi/RegMix.
- Build a 4-dimensional (domain gain + forgetting + long-context + production) post-CPT eval framework.

<p>This training is a 3-day advanced Continued Pretraining (CPT) program designed end to end for ML Engineers, AI Researchers, Data Engineers, and ML Platform engineers who want to adapt open-source base LLMs (Llama 3.3, Qwen3, Gemma 3, Mistral) to the Turkish language or to domains like legal, healthcare, finance, and code. In Turkey, projects like Cosmos / Trendyol AI / KUIS-AI that want to train Turkish LLMs are growing rapidly; similarly, law firms need Harvey AI-style case-law reasoning; healthcare institutions need Med-PaLM-style medical expertise; finance companies need BloombergGPT-style sectoral intelligence in custom LLMs. However, a Turkish-language training that covers this discipline end to end with math + data pipeline + mitigation + eval is virtually nonexistent — existing content either stays at the level of academic-paper summaries or remains shallow at example-copy script level. This program is designed to fill that gap as Turkey's most comprehensive production-grade CPT reference training.</p>

<p>The strategic backbone of the program is the first module, which clearly frames the place of the Continued Pretraining discipline in the pre-training → CPT → SFT → DPO/RLHF → deployment flow and its difference from SFT / RLHF / RAG. An evidence-based decision matrix is provided: knowledge injection (teaching new knowledge, learning a new language, static domain knowledge) → CPT optimal; behavior shaping (response style, formatting, instruction following) → SFT sufficient; dynamic / frequently updated knowledge → RAG mandatory; very high-volume + static domain knowledge → CPT + RAG hybrid. Production case studies — BloombergGPT (50B-token finance CPT), Med-PaLM (healthcare), Code Llama (code), Cosmos Llama / Trendyol AI / KUIS-AI / Aya Expanse (Turkish), DeepSeek-Math / Qwen3-Math / Llemma (math) — are analyzed from a strategic perspective.</p>

<p>The second module is dedicated to the data-engineering discipline that determines 70% of CPT success. HuggingFace FineWeb (15T tokens) and FineWeb-Edu methodology; Common Crawl WARC processing and HTML cleaning with Trafilatura; comparison of Cosmopedia, RefinedWeb, RedPajama, DOLMA datasets; for Turkish: Turkish FineWeb, mC4-tr, OSCAR-tr, Wikipedia-tr, Boğaziçi/İTÜ/KUIS open corpus sources; deduplication strategies (exact hash, MinHash LSH fuzzy dedup, embedding-based semantic dedup); quality filtering (Gopher rules + Cosmopedia + fastText classifier); KVKK-compliant PII detection (Turkish TC kimlik, IBAN, phone-number detection); toxicity and contamination detection — every stage is hands-on. The practical recipe for producing 100B-500B tokens from Turkish raw data is provided.</p>

<p>The third module analyzes mathematically the fundamental challenge of CPT — the catastrophic-forgetting problem. From the loss-landscape perspective: drift from pre-train minimum to domain minimum, identification of important parameters with the Fisher Information Matrix, and the plasticity-stability dilemma are covered in detail. Classical mitigation: replay buffer (mix ratio of domain data + pre-training data — practical recommendation 5-20% pre-training mix), EWC (Kirkpatrick 2017 Fisher-weighted L2 regularization), layer-wise learning rate, embedding freeze. Modern approaches: LoRA-based CPT (a small adapter prevents catastrophic forgetting but capacity is limited), model souping / weight averaging (Wortsman 2022), Branch-Train-Merge (BTM, Li 2022) and domain-expert routing — each is evidence-based compared on trade-offs.</p>

<p>The fourth module addresses vocabulary expansion and tokenizer adaptation techniques — especially critical for Turkish. Turkish fertility analysis of Llama 3, Qwen3, Gemma 3 tokenizers (measurement of how many tokens a Turkish word splits into on average; 1.0-1.3 in English, 1.8-2.5 in Turkish — this doubles cost and latency). Mean initialization (new token embedding as the average of existing tokens), FOCUS (Dobler 2023 semantic-aware initialization), and the Aya Expanse 2024 approach (23-language multilingual expansion + frozen base) are covered in detail. Training Turkish + domain tokenizers with SentencePiece, merging and extending with the Hugging Face Tokenizers library, and the impact of tokenizer change on embedding + lm_head are shown practically. The vocabulary expansion vs no-expansion CPT trade-off is evidence-based decided.</p>

<p>The fifth module comparatively analyzes Turkey's four prominent open-source Turkish LLM projects from the CPT methodology perspective. Cosmos Llama 3.3 / 3.1 series (base, CPT data, SFT, instruct variants); Trendyol AI Llama 3 8B / 70B (Trendyol dataset + domain adaptation); KUIS-AI Turkish-Llama (Koç University contributions); Cohere Aya Expanse 8B / 32B (23-language multilingual CPT approach). For each, the base-model selection, CPT data strategy, vocabulary-expansion decision, training compute, and eval results are analyzed in detail. Comparison on Turkish MMLU, MMLU-Pro-tr, Belebele-tr, TruthfulQA-tr, Hellaswag-tr, ARC-tr benchmarks and Open LLM Leaderboard Turkish ranking analysis is performed. Boğaziçi, METU, İTÜ Turkish LLM research is also examined.</p>

<p>The sixth module provides CPT recipes for the four most-demanded domains in Turkey. Legal domain: Turkish case-law (Yargıtay, Danıştay, Constitutional Court decisions), Legislation (laws, regulations), Official Gazette archive CPT pipeline; Harvey AI approach (legal exam + risk assessment); KVKK-compliant data collection. Healthcare domain: DSM-5-TR + medical guidelines + patient records (anonymized) CPT; Med-PaLM (Google 2023) and Med-PaLM 2 approach; HIPAA + KVKK biomedical compliance. Finance domain: replication of the BloombergGPT (50B-token finance) recipe; Turkish finance CPT with TCMB reports + KAP disclosures + BIST data + Turkish balance-sheet corpus. Code domain: comparison of Code Llama, DeepSeek-Coder V3, Qwen2.5-Coder recipes. For each domain, benchmark production (bar-exam simulation, USMLE-tr, FinanceBench-tr, HumanEval-tr, MBPP-tr, BigCodeBench-tr) and sector-regulation-compliant deployment discipline are provided.</p>

<p>The seventh module deeply addresses the parameter-efficient + memory-efficient approaches that determine compute efficiency in production CPT. Full fine-tuning, LoRA (Hu 2021 low-rank decomposition W = W_0 + B·A formulation), DoRA (Liu 2024 magnitude + direction separation), QLoRA (Dettmers 2023 4-bit NF4 quantization + LoRA), ReFT (representation fine-tuning), and GaLore (Zhao 2024 memory-efficient full pre-training via gradient low-rank projection) approaches are evidence-based compared. LoRA capacity limitations for CPT — at what rank LoRA is sufficient for knowledge injection, in which scenario full FT is mandatory — are taught with a practical cookbook. 30B+ model FT on a single H100 with DeepSpeed ZeRO-3 + offload, and FSDP2 (PyTorch 2.x) + activation-checkpointing CPT are shown practically.</p>

<p>The eighth module addresses techniques for extending a base model's context window via CPT. RoPE (Rotary Position Embeddings) is built at the mathematical level (rotation matrix per dimension); Linear interpolation, NTK-aware scaling, Dynamic NTK, YaRN (Yet another RoPE extensioN, Peng 2023 — attention scaling correction), Position Interpolation (Chen 2023), and LongRoPE (Microsoft 2024) are comparatively covered. The Llama 3.1 128K extension recipe (Meta 2024), the Gemini 2.5 Pro 1M-10M context production approach, and the Mistral interleaved sliding-window attention are analyzed with practical examples. Curriculum: 4K → 16K → 64K → 1M token progressive extension strategy; needle-in-a-haystack and multi-needle eval; NVIDIA RULER benchmark (retrieval + reasoning long-context); LongBench, InfiniteBench for real-world long-context eval are taught.</p>

<p>The ninth module is dedicated to the discipline of how much data to use from which domain in CPT (domain mixing ratios) — a first-order determinant of final model quality. DoReMi (Xie 2023 — domain reweighting via worst-domain minimax optimization), RegMix (Liu 2024 — regression-based mix prediction with small-scale proxy), and DataMix approaches are covered at the mathematical level. In Turkish CPT, the Turkish vs English ratio decision matrix (recommended starting at 70/30 → cooldown at 50/50), the recipe for preventing catastrophic forgetting via a domain + general data mix, and the DeepSeek-Coder recipe in the code + math + general triangle are shown practically. Curriculum learning (easy → hard data ordering), and final-stage high-quality data injection in the Llama 3.1 and Qwen3 cooldown/annealing stages for MMLU boost strategies are addressed.</p>

<p>The tenth module is dedicated to the engineering side of CPT. Learning-rate selection (basic principle: 1/10 → 1/100 of pre-training LR); warmup steps; comparison of cosine decay vs constant LR vs WSD (Warmup-Stable-Decay) schedules; max LR, min LR tuning cookbook; batch-size scaling (global batch size 1M-4M tokens), gradient accumulation, mixed precision (bf16, fp8 — Blackwell B200/GB200); the DeepSpeed ZeRO-3 vs FSDP2 vs Megatron-LM distributed-setup decision matrix; mix of TP (tensor parallel) + PP (pipeline parallel) + DP; training-run monitoring (loss curves, gradient norm, weight stats); loss spikes and divergence-recovery strategies; checkpoint frequency, async checkpointing, and eval-on-checkpoint pipeline are covered in detail.</p>

<p>The eleventh module addresses the four-dimensional post-CPT evaluation discipline. (1) Domain gain: Turkish MMLU, MMLU-Pro-tr, Belebele-tr, ARC-tr; producing domain-specific benchmarks (Turkish bar-exam simulation, FinanceBench-tr, USMLE-tr); chat-ability eval with MT-Bench Turkish and AlpacaEval Turkish. (2) Catastrophic forgetting: regression tests on general MMLU, HellaSwag, ARC, TruthfulQA; regression on code benchmarks (HumanEval, MBPP). (3) Long-context regression: RULER, needle-in-a-haystack, LongBench. (4) Production eval: production comparison of base model vs CPT model via A/B testing, online eval with user feedback (thumbs up/down), business metrics (conversion, satisfaction, task completion rate). All reporting formats are tied to enterprise compliance discipline.</p>

<p>In the capstone module, each participant designs an end-to-end CPT pipeline tailored to their own scenario: scenario selection (Turkish LLM, legal, healthcare, finance, code, or the participant's own domain), base-model selection (Llama 3.3, Qwen3, Gemma 3, Mistral, DeepSeek base), Turkish and/or domain data collection (50B-200B tokens), vocabulary-expansion decision, mitigation strategy (replay ratio + LoRA / full FT / hybrid), training stack (TRL + Axolotl or OpenRLHF + DeepSpeed), compute budget (single H100, 8x H100, multi-node planning), eval framework (4 dimensions), 90-day production deployment roadmap (including post-CPT SFT + DPO + RAG integration). By the end of the training, participants reach a level of technical competence to apply the CPT vs SFT vs RAG decision matrix at enterprise scale; build a FineWeb-style data pipeline for Turkish + domain; evidence-based-select catastrophic-forgetting mitigation recipes; double Turkish efficiency via vocabulary expansion + tokenizer adaptation; perform 128K-1M long-context extension with YaRN; estimate optimal data mix with DoReMi/RegMix; make compute-optimal choices among LoRA/DoRA/QLoRA/GaLore; and build production-grade CPT pipelines at Cosmos / Trendyol AI / Aya Expanse / BloombergGPT level. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.</p>