About this training
A 3-day advanced Turkish LLM alignment training that covers the RLHF (PPO), DPO, KTO, IPO, SimPO, ORPO, and DeepSeek R1 GRPO algorithms at both math and code level; and teaches reward modeling, Constitutional AI, RLAIF, reasoning-model alignment, and the TRL/Axolotl/LLaMA-Factory/OpenRLHF/verl toolchain at production grade.
This training is designed for: ML Engineers who want to align enterprise LLM products with human preferences and safety constraints AI Researchers who want to train reasoning models in the DeepSeek R1, OpenAI o3/o4 paradigm Senior backend developers who want to reduce sycophancy, jailbreaks, and hallucination in production agent / chat / RAG products Startup technical leaders who want to align their own open-source LLM (Turkish or domain-specific) ML Platform and MLOps engineers who want to move the RLHF discipline from academic to production grade Enterprise AI / governance leaders who need to build a KVKK + EU AI Act-compliant LLM-alignment pipeline
Why this course matters: The only advanced program in Turkey that addresses RLHF, DPO, and GRPO end to end with math + code + production. Teaches the DeepSeek R1 GRPO and reasoning-model paradigm as of 2026 in its current form. Provides an evidence-based comparative analysis of the DPO, KTO, IPO, SimPO, ORPO, cDPO family. Imparts an Anthropic-Claude-style human-label-independent alignment discipline with Constitutional AI and RLAIF. Provides a scale-based right-choice matrix among five frameworks: TRL, Axolotl, LLaMA-Factory, OpenRLHF, verl. Teaches a production evaluation discipline with RewardBench, AlpacaEval 2.0 LC, MT-Bench, and Arena Hard. Ties production failure modes such as reward hacking, length collapse, and KL drift to detection and mitigation. Establishes a compliance audit framework with the EU AI Act, NIST AI RMF, ISO/IEC 42001, and KVKK.
Learning outcomes by the end of the programme: Train a production-grade reward model starting from the Bradley-Terry preference model. Skillfully manage the PPO clipping objective and KL-penalty tuning. Grasp the mathematical derivation of DPO and evidence-based-select the β temperature. Make the right choice among the KTO, IPO, SimPO, ORPO, and cDPO variants. Build an R1-scale reasoning-model alignment pipeline with GRPO. Produce AI-labeled preference datasets with Constitutional AI and RLAIF. Select the right tool among TRL, Axolotl, LLaMA-Factory, OpenRLHF, and verl based on scale. Validate alignment quality with RewardBench, AlpacaEval 2.0 LC, and MT-Bench. Detect and prevent reward hacking, length collapse, sycophancy, and KL drift. Produce an EU AI Act + KVKK compliance audit report and tie alignment processes to compliance discipline.
Prerequisites and recommended background: Active Python experience (intermediate to advanced), basic use of PyTorch and HuggingFace Transformers Basic experience with LLM fine-tuning (at least conceptual familiarity with SFT, LoRA / QLoRA) Foundational ML math such as linear algebra, probability, and gradient descent Basic familiarity with reinforcement-learning concepts (advantage, policy, reward — depth is built in the training) GPU access before the training (RunPod, Lambda Labs, Modal, or your own setup) — H100/A100 recommended HuggingFace + Weights & Biases account before the training
- The only comprehensive advanced program in Turkey that addresses RLHF, DPO, and GRPO algorithms end to end with math + code + production
- Full mathematical construction from Bradley-Terry preference loss to the DPO implicit-reward derivation, and from the PPO clipping objective to GRPO group-relative advantage computation
- Comparative evidence-based analysis of the modern preference-optimization family: KTO, IPO, SimPO, ORPO, cDPO
- Internal structure of the reasoning-model alignment pipelines: DeepSeek R1, R1-Zero, Qwen3 Reasoning, and Tülu 3
- Alignment without human labels via Constitutional AI and RLAIF; Turkish + KVKK-compliant principle-set design
- A production toolchain comparison matrix of five frameworks: TRL, Axolotl, LLaMA-Factory, OpenRLHF, verl
- End-to-end evaluation discipline with RewardBench, AlpacaEval 2.0 LC, MT-Bench, Arena Hard and reward-hacking mitigation
- Compliance integration with an EU AI Act, NIST AI RMF, ISO/IEC 42001, and KVKK audit framework
Key Takeaways
- Train a production-grade reward model starting from the Bradley-Terry preference model.
- Skillfully manage the PPO clipping objective and KL-penalty tuning.
- Grasp the mathematical derivation of DPO and evidence-based-select the β temperature.
- Make the right choice among the KTO, IPO, SimPO, ORPO, and cDPO variants.
- Build an R1-scale reasoning-model alignment pipeline with GRPO.
- Produce AI-labeled preference datasets with Constitutional AI and RLAIF.
- Select the right tool among TRL, Axolotl, LLaMA-Factory, OpenRLHF, and verl based on scale.
- Validate alignment quality with RewardBench, AlpacaEval 2.0 LC, and MT-Bench.
- Detect and prevent reward hacking, length collapse, sycophancy, and KL drift.
- Produce an EU AI Act + KVKK compliance audit report and tie alignment processes to compliance discipline.
LLM Alignment Engineering with RLHF, DPO, and GRPO Training
A 3-day advanced Turkish LLM alignment training that covers the RLHF (PPO), DPO, KTO, IPO, SimPO, ORPO, and DeepSeek R1 GRPO algorithms at both math and code level; and teaches reward modeling, Constitutional AI, RLAIF, reasoning-model alignment, and the TRL/Axolotl/LLaMA-Factory/OpenRLHF/verl toolchain at production grade.
About This Course
This training is designed to teach end to end the mathematical foundation, practical implementation, and production deployment of modern algorithms that align large language models (LLMs) with human preferences and safety constraints. The LLM-alignment discipline — which began in 2022 with OpenAI's InstructGPT and Anthropic's Constitutional AI; experienced a major paradigm shift in 2023 with DPO; was enriched in 2024 with the IPO, KTO, SimPO, and ORPO variants; and opened the reasoning-model era in 2025 with DeepSeek R1's GRPO algorithm — is one of the central topics of modern AI engineering. A comprehensive training that unites math + code + production for this discipline is virtually nonexistent in Turkey; existing content either stays academically theoretical or stays at the example-copying surface level. This program is designed to fill that gap as Turkey's most comprehensive production-grade LLM-alignment reference training.
The theoretical backbone of the training rests on three mathematical pillars: first, the Bradley-Terry preference model that underpins classic RLHF and the reward-model (RM) training built on top of it — sigmoid pairwise loss, K-wise Plackett-Luce generalization, classification head vs generative reward model (Nemotron-4 Reward, Skywork-Critic), and the Tülu 3 PRM (process reward) implementation are covered in detail. Second, the adaptation of the PPO algorithm to LLMs: derivation from policy gradient → importance sampling → clipped surrogate objective, GAE (Generalized Advantage Estimation), reference-model distance control via the KL penalty, fixed β vs adaptive KL controller, high-throughput response sampling with vLLM, and memory-efficient management of reference-model log-prob computation. Third, the closed-form derivation of DPO: a formulation that completely eliminates the RL loop by parameterizing the RLHF reward-maximization problem with the implicit reward r(x,y) = β log π_θ(y|x)/π_ref(y|x) — this derivation is performed step by step in the training, the KL-constraint duality of the β temperature is clarified, and a cookbook for selecting the β value is provided.
The modern preference-optimization family published in the 2024-2026 period after DPO is covered comparatively: IPO (Azar 2024) — identity preference loss against overfitting; cDPO (Conservative DPO) — a variant robust to noisy preference labels; KTO (Kahneman-Tversky Optimization) — prospect-theory based, working with binary feedback (thumbs up/down) without pairwise data; SimPO — length-normalized loss that does not require a reference model; ORPO — a Llama 3.1 approach that combines SFT and preference optimization in a single stage; DPO-SDP — self-discovery preference. Each algorithm's mathematical formulation is derived step by step, the data type (pairwise, binary, scalar) it works with is explained, and practical implementation is performed with TRL DPOTrainer + Axolotl + OpenRLHF. This disciplined comparison enables your team to evidence-based-select the right technique for its scenario.
The most up-to-date section of the program is dedicated to the GRPO (Group Relative Policy Optimization) algorithm introduced by DeepSeek R1 in 2025. GRPO completely eliminates the value (critic) model required in PPO and computes advantages with in-group normalization: A_i = (r_i - mean(r)) / std(r). This approach offers roughly half the cost of PPO in both memory and compute and improves stability. In the training, GRPO is mathematically derived; the R1-Zero (reasoning emergence with pure RL without cold-start SFT) and R1 (SFT cold-start → reasoning RL → general RL) pipelines are analyzed separately; and rule-based reward (math accuracy, code execution, format compliance) design is detailed. For production implementation, the ByteDance verl framework (highest-scale GRPO), OpenRLHF (Ray + DeepSpeed multi-node), and TRL GRPOTrainer (single-node prototype) are covered comparatively; the vLLM rollout + FSDP training hybrid-engine architecture is shown with practical examples.
Anthropic's pioneering Constitutional AI since 2022 and its generalized form, RLAIF (Reinforcement Learning from AI Feedback), are addressed in a separate module. The SL-CAI (critique → revision → revised response training data) and RL-CAI (reward model + PPO/DPO with AI-labeled preference data) stages are addressed in detail; the principle-set structure Anthropic uses in the Claude 4.x family is examined; the method of building a hybrid RLAIF pipeline by using Claude Opus 4.7, GPT-5, or Gemini 2.5 Pro as a strong-model-as-judge is shown. Practical design of Turkish + KVKK + Turkish-law-compliant principle sets is performed; this creates significant value for Turkey as a market opener because existing AI assistants' principle sets are predominantly calibrated to English and Western legal systems.
The alignment discipline of the 2025-2026 reasoning-model era is addressed in a separate module. Reasoning models such as OpenAI o3/o4, DeepSeek R1, Gemini 2.5 Deep Think, and Claude Extended Thinking use a process reward model (PRM) instead of — or alongside — outcome reward (rule-based, exact match): not only is the correct final answer scored, but the quality of each intermediate step is also scored. The AllenAI Tülu 3 (2025) PRM approach, the OpenThoughts dataset, the OLMo 2 reasoning pipeline, and the mixed-mode (thinking on/off) approach of Qwen3 are covered in detail. The Snell scaling laws are used to analyze the marginal gains of test-time compute over pre-train compute; reasoning distillation is practically applied to transfer knowledge from R1 → 7B/14B/32B compact models.
The five main open-source frameworks that build production preference-optimization pipelines are addressed comparatively: HuggingFace TRL (reference implementation, SFTTrainer + RewardTrainer + DPOTrainer + PPOTrainer + GRPOTrainer); Axolotl (config-driven YAML pipeline); LLaMA-Factory (UI + multi-model preference optimization); OpenRLHF (Ray + DeepSpeed multi-node distributed RL); ByteDance verl (hybrid-engine architecture for the highest-scale GRPO). For each framework, the dataset format, custom-reward integration, scaling characteristics, and compute requirements are covered in a detailed table; a framework-selection matrix offers participants a concrete decision path — TRL for an 8B model + single GPU, Axolotl or OpenRLHF for 8B-70B + 8 GPU + production CI, and verl for multi-node 70B+ R1-scale GRPO.
The verification discipline of the alignment pipeline is addressed in a separate module. Reward-model evaluation is done with RewardBench (Chat, Chat-Hard, Safety, Reasoning), JudgeBench, and RM-Bench; policy evaluation is performed with AlpacaEval 2.0 LC (length-controlled win rate), MT-Bench, Arena Hard (with a Claude Opus 4.7 or GPT-5 judge), and Chatbot Arena ELO. For reward-hacking detection, typical failure modes such as length collapse, sycophancy, EOS spam, format hacking, and KL drift are shown with practical examples and mitigation strategies (length-control reward, KL-penalty tuning, early-stopping criteria) are provided. The EU AI Act, NIST AI RMF, ISO/IEC 42001, and KVKK compliance audit checklist ties the alignment process to enterprise compliance discipline.
In the capstone module, each participant designs an end-to-end Turkish LLM alignment pipeline tailored to their own use case: base-model selection (Llama 3.3, Qwen3, Gemma 3, Mistral); Turkish SFT mix (Cosmos, Turkish UltraChat, their own data); reward-model training (on a Turkish UltraFeedback preference dataset); evidence-based selection among DPO/KTO/SimPO/GRPO; pipeline implementation (TRL or Axolotl or OpenRLHF); evaluation with RewardBench + AlpacaEval 2.0 LC + Turkish MT-Bench; production deployment with vLLM; and a 90-day operational roadmap (cost, KL-drift monitoring, online RLAIF feedback loop). By the end of the training, participants reach a level of technical competence to build a reward model at production grade from Bradley-Terry preference loss; skillfully manage PPO's clipping objective and KL-penalty tuning; evidence-based-make the right choice among DPO/KTO/SimPO/ORPO/IPO/cDPO; align an R1-scale reasoning model with GRPO; set up Constitutional AI and RLAIF pipelines; operate the TRL/Axolotl/LLaMA-Factory/OpenRLHF/verl toolchain in production; and manage alignment processes with EU AI Act + KVKK compliance discipline. The training consists of 3 days, 12 modules, and more than 90 hands-on lessons.
Training Methodology
The only comprehensive advanced program in Turkey that addresses RLHF, DPO, and GRPO algorithms end to end with math + code + production
Full mathematical construction from Bradley-Terry preference loss to the DPO implicit-reward derivation, and from the PPO clipping objective to GRPO group-relative advantage computation
Comparative evidence-based analysis of the modern preference-optimization family: KTO, IPO, SimPO, ORPO, cDPO
Internal structure of the reasoning-model alignment pipelines: DeepSeek R1, R1-Zero, Qwen3 Reasoning, and Tülu 3
Alignment without human labels via Constitutional AI and RLAIF; Turkish + KVKK-compliant principle-set design
A production toolchain comparison matrix of five frameworks: TRL, Axolotl, LLaMA-Factory, OpenRLHF, verl
End-to-end evaluation discipline with RewardBench, AlpacaEval 2.0 LC, MT-Bench, Arena Hard and reward-hacking mitigation
Compliance integration with an EU AI Act, NIST AI RMF, ISO/IEC 42001, and KVKK audit framework
Who Is This For?
Why This Course?
The only advanced program in Turkey that addresses RLHF, DPO, and GRPO end to end with math + code + production.
Teaches the DeepSeek R1 GRPO and reasoning-model paradigm as of 2026 in its current form.
Provides an evidence-based comparative analysis of the DPO, KTO, IPO, SimPO, ORPO, cDPO family.
Imparts an Anthropic-Claude-style human-label-independent alignment discipline with Constitutional AI and RLAIF.
Provides a scale-based right-choice matrix among five frameworks: TRL, Axolotl, LLaMA-Factory, OpenRLHF, verl.
Teaches a production evaluation discipline with RewardBench, AlpacaEval 2.0 LC, MT-Bench, and Arena Hard.
Ties production failure modes such as reward hacking, length collapse, and KL drift to detection and mitigation.
Establishes a compliance audit framework with the EU AI Act, NIST AI RMF, ISO/IEC 42001, and KVKK.
Learning Outcomes
Requirements
Course Curriculum
104 LessonsInstructor

Şükrü Yusuf KAYA
AI Architect | Enterprise AI & LLM Training | Stanford University | Software & Technology Consultant
Şükrü Yusuf KAYA is an internationally experienced AI Consultant and Technology Strategist leading the integration of artificial intelligence technologies into the global business landscape. With operations spanning 6 different countries, he bridges the gap between the theoretical boundaries of technology and practical business needs, overseeing end-to-end AI projects in data-critical sectors such as banking, e-commerce, retail, and logistics. Deepening his technical expertise particularly in Generative AI and Large Language Models (LLMs), KAYA ensures that organizations build architectures that shape the future rather than relying on short-term solutions. His visionary approach to transforming complex algorithms and advanced systems into tangible business value aligned with corporate growth targets has positioned him as a sought-after solution partner in the industry. Distinguished by his role as an instructor alongside his consulting and project management career, Şükrü Yusuf KAYA is driven by the motto of "Making AI accessible and applicable for everyone." Through comprehensive training programs designed for a wide spectrum of professionals—from technical teams to C-level executives—he prioritizes increasing organizational AI literacy and establishing a sustainable culture of technological transformation.
Frequently Asked Questions
Apply for Training
Boutique training with limited seats.
Pre-register for Next Groups
Leave your info to be the first to know when the next batch opens.
1-on-1 Mentorship
Book a private session.
Categories
Related programs
Professional Software Development with Claude Code Training
A comprehensive, advanced 4-day training program for software professionals seeking enterprise-level mastery of Anthropic's agentic coding platform, Claude Code. Production-grade agent architecture with MCP integrations, Hooks, Sub-agents, Skills, and the Claude Agent SDK.
4 GünadvancedBuilding AI Agents with the Claude Agent SDK Training
A comprehensive, advanced 4-day program for software engineers who want to develop production-grade AI agents with Anthropic's Claude Agent SDK. Tool-use orchestration, MCP server development, multi-agent patterns, prompt caching, and evaluation engineering.
4 GünadvancedIntroduction to Artificial Intelligence and Enterprise Prompt Engineering Training
This enterprise-focused training teaches AI foundations, large language models, prompt engineering, secure usage, and real business scenarios to help teams generate higher-quality and better-controlled AI outputs.
2 Gün