LLM Mühendisliği
LLM Engineering is a new discipline: position among ML engineer, data scientist, AI researcher, and MLOps; skill matrix, seniority levels, global and Turkey salary ranges, daily workflow, career pivots.
Table of Contents
Module 0: Course Framework & Workshop Setup
- 1
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
LLM Engineering is a new discipline: position among ML engineer, data scientist, AI researcher, and MLOps; skill matrix, seniority levels, global and Turkey salary ranges, daily workflow, career pivots.
- 2
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
The 8 pedagogical principles behind this course, the 12-part / 76-module architecture, prerequisite graph, comparison with Karpathy & Stanford CS336 & Hamel Husain, 4 study modes, and 3 certificate levels.
- 3
Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight
Modern Python + PyTorch + CUDA + Triton + FlashAttention + Nsight setup from scratch. Step-by-step for Linux, WSL2, macOS (Apple Silicon), Windows native. Why uv package manager beats pip/conda, sanity test.
- 4
Cloud Account Atlas: HuggingFace, OpenAI, Anthropic, Together, Modal, Runpod, Lambda — Which One and Why?
An LLM engineer juggles 12+ cloud accounts over 8 months. Which one and why, pricing models, API key security, free credit hunting, multi-provider strategy, Türkiye-specific payment and tax practices.
- 5
Cost & Ethics Contract: 8-Month Budget, Token Economics, The AI Engineer's Code
Total estimated course cost in three scenarios, token economics 101, budget alarm setup, the AI engineer's ethics contract: copyright, KVKK, EU AI Act, academic integrity, open-source contribution, environmental impact.
Module 1: The AI Engineer's Mathematical Arsenal
- 1
Linear Algebra Refresher: Vectors, Matrices, Broadcasting, Einsum — The LLM Engineer's Mathematical Language
Vector/matrix/tensor intuition, broadcasting rules, dot product, three perspectives on matrix multiplication, einsum notation, norm families, mathematical anatomy of Q@K^T multiplication in attention.
- 2
Matrix Decompositions: Eigendecomposition, SVD, PCA, and the Secret of LoRA
The art of decomposing a matrix into its 'DNA'. Eigendecomposition and SVD, building PCA from SVD from scratch, the mathematical foundation of LoRA — why low-rank updates suffice. Embedding compression practice.
- 3
Derivatives, Gradients, and Matrix Calculus: The Math of Backprop from Scratch
Derivatives from scalar to vector to matrix. Jacobian, Hessian, chain rule, numerator vs denominator layout. Why the derivative of softmax + cross-entropy is so elegant. Manual backprop computation compared with PyTorch autograd.
- 4
Chain Rule and Backpropagation: Build a Mini-Autograd from Scratch (Karpathy micrograd in Turkish)
Build Karpathy's micrograd from scratch in Turkish — a 200-line PyTorch-like automatic differentiation engine. Computational graph, topological sort, operator overloading, _backward closures, gradient accumulation. Train an MLP at the end.
- 5
Probability Foundations: Joint, Marginal, Conditional, and Bayes — The Language of How LLMs Think
LLMs are fundamentally conditional probability machines. The math of P(x_t | x_<t), joint/marginal/conditional relationship, independence, the power of Bayes theorem, distribution families (Bernoulli, Categorical, Gaussian), expectation, variance — sampling (temperature, top-k, top-p) starts here.
- 6
MLE, MAP, Posterior: The Grammar of Modeling — The Mathematical Root of Pretrain Loss
See that LLM pretrain loss is a Maximum Likelihood Estimation (MLE) objective, that fine-tuning is mathematically Bayesian updating, and that regularization corresponds to MAP. Cross-entropy = NLL relation, prior choice, conjugate priors.
- 7
Entropy, Cross-Entropy, KL Divergence, and Mutual Information: Information Theory's Life in LLMs
Shannon entropy, the true meaning of cross-entropy as LLM loss, KL divergence asymmetry and forward vs reverse KL (mode covering vs mode seeking), the role of KL constraint in RLHF/DPO, JS and Wasserstein, mutual information, knowledge distillation math.
- 8
Optimization: From SGD to AdamW, Lion, Muon — All Modern LLM Optimizers
Past and future of the gradient descent family: GD, SGD, Momentum (Heavy ball, Nesterov), AdaGrad, RMSProp, Adam, AdamW, Lion, Muon. Learning rate schedules: linear warmup + cosine decay. Loss landscape: sharp vs flat minima.
- 9
Numerical Stability: Log-Sum-Exp, FP16 Pitfalls, NaN Hunting — Hidden Hours of LLM Training
Floating point representations (FP32, FP16, BF16, FP8), overflow/underflow/NaN hunting, log-sum-exp trick, softmax numerical stability, mixed precision training (autocast + GradScaler), numerical roots of pretrain loss spikes.
- 10
Information Geometry and Manifold Intuition: Why Embeddings Are Meaningful
Geometric anatomy of embedding space: manifold hypothesis, t-SNE/UMAP visualization, cosine vs Euclidean metric, Riemannian geometry intuition, Fisher information, natural gradient, embedding rotation invariance. This lesson completes Module 1.
Module 2: Before PyTorch — NumPy and Autodiff from Scratch
- 1
NumPy Tensor Engineering: Strides, View, Broadcasting, and the Anatomy of Memory Layout
Memory anatomy of a tensor: row-major C vs column-major F, strides, view vs copy, contiguous, fancy indexing, advanced broadcasting rules, BLAS backend intuition, einsum vs einops. Foundation of performance-critical code.
- 2
Computational Graph Deep Dive: DAG Structure, Topological Sort, Eager vs Static Paradigm
Deep analysis of the graph structure behind autograd: DAG anatomy, in-degree/out-degree, topological sort algorithms (DFS post-order, Kahn's), eager (PyTorch) vs static (TF1, JAX, XLA) paradigms, graph optimization (fusion, dead code elimination).
- 3
Reverse-mode vs Forward-mode Autodiff: JVP, VJP, Dual Numbers, and Which to Use When in LLMs
The two fundamental modes of automatic differentiation: forward-mode (Jacobian-vector product, dual numbers) and reverse-mode (vector-Jacobian product, backprop). Mathematical comparison, computational complexity, JAX's jvp/vjp/grad/hessian, which scenario requires which mode in LLMs.
- 4
Tensor Autograd from Scratch in NumPy: Building Broadcasting-Aware Mini-Tinygrad
Lift scalar micrograd from Lesson 1.4 to tensor level: NumPy-based Tensor class, broadcasting-aware backward (sum-along-broadcast-dims trick), matmul/conv/softmax operators, gradient flow through transpose and views, ~500-line PyTorch-like training engine.
- 5
PyTorch vs JAX vs torch.compile: Practical Comparison of Eager, Static, and Hybrid
Theoretical difference from 2.2 → practical benchmark. Implement the same transformer block in PyTorch eager, JAX jit, torch.compile (reduce-overhead, max-autotune) modes. Compile time, throughput, memory, debug experience side-by-side. Which framework for which scenario in 2026?
- 6
Custom autograd.Function and PyTorch Internals: Write Your Own Gradients
Extending PyTorch autograd: torch.autograd.Function subclasses, custom forward/backward, state saving via ctx, gradcheck validation, custom CUDA/Triton kernel wrap (preview), FlashAttention block matmul mini-implementation, second-order gradients and gradgradcheck.
Module 3: The Philosophical History of Deep Learning
- 1
The 70-Year Journey of Artificial Neural Networks: From McCulloch-Pitts to GPT-5
The history of deep learning: 1943 McCulloch-Pitts neurons, 1958 Perceptron, 1986 backprop popularization, 1989 LeCun's ZIP-code CNN, 1997 LSTM, 2006 Hinton's DBN paper, 2012 AlexNet, 2017 Transformer, 2022 ChatGPT, 2026 GPT-5. Technical and social context of each milestone.
- 2
Connectionism vs Symbolic: 60 Years of an Unending Debate and Where LLMs Fit
The 60-year philosophical battle between Symbolic AI (LISP, expert systems, logic programming) and connectionism (neural networks). Bitter Lesson (Sutton 2019), neuro-symbolic hybrids, whether chain-of-thought and tool use are symbolic manipulation, the future of LLM reasoning.
- 3
Big Bang in Vision: AlexNet, VGG, Inception, ResNet, BatchNorm — Birth of Modern Architectural Components
The 2012-2017 vision revolution: AlexNet's 5 innovations, VGG's uniformity principle, Inception's multi-scale approach, ResNet's skip connection revolution, BatchNorm's response to internal covariate shift. Detailed analysis of the architectural legacy that led to Transformers.
- 4
Sequence Modeling: From RNN, LSTM, GRU to Encoder-Decoder and Attention
NLP evolution 1990-2017: vanilla RNN's vanishing gradient, LSTM (Hochreiter 1997) and GRU solutions, Seq2Seq (Sutskever 2014), Bahdanau and Luong attention mechanisms, contextual embeddings birth with ELMo. This journey set the stage for the 2017 Transformer.
- 5
8 Years After Transformer: From 'Attention Is All You Need' to GPT-5 Anatomy
Detailed 8-year evolution map of transformer from Vaswani 2017 to 2026 GPT-5: BERT, GPT series, T5, BART, Llama, Claude, DeepSeek, Mistral, Qwen. Pre-training paradigm settlement, scaling laws, RLHF, multimodal capabilities, reasoning models.
Module 4: The Mental Model of LLMs
- 1
An LLM Is a Conditional Probability Machine: P(x_t | x_<t) and Its Consequences
Clarify what an LLM is at its core: a machine producing conditional probability distributions. Autoregressive generation, chain rule decomposition of joint probability, real meaning of perplexity measurement, why 'hallucination' is inevitable, calibration concept, relationship between logits and probabilities.
- 2
Tokenization as Part of the Mental Model: Token Economics, Turkish Pitfalls, and Glitch Tokens
How token boundaries shape predictions, token economics in morphologically rich languages like Turkish, 'glitch tokens' like SolidGoldMagikarp, leading whitespace problem, token-level detail of prompt engineering. Practical foundation for Module 6 (Tokenization Microsurgery).
- 3
Sampling Art Deep Dive: Greedy, Beam, Top-K, Top-P, Min-P, DRY, Tail-Free — All in Production
Production-level sampling strategies: temperature/top-k/top-p/min-p/typical-p/tail-free/DRY repetition penalty, beam search and diverse beam, contrastive decoding, speculative sampling, sampling in reasoning models, sampling with structured output, multi-sample self-consistency.
- 4
Logit Observability: Reading the Model's Mind with logprobs — Production Diagnostics
Production-grade use of logprobs API: confidence-based filtering, hallucination detection, prompt diagnostics, model probing, MCQ scoring, semantic confidence, anomaly detection. logits/probability/log-probability conversions, token-level entropy, extraction techniques.
- 5
The Mathematics of In-Context Learning: Implicit Bayesian Inference and Induction Heads
Mathematical explanations of GPT-3's few-shot learning: implicit Bayesian inference (Xie 2022), induction heads mechanism (Olsson 2022), task identification and learning algorithm emergence. Why prompting with examples works, why only in sufficiently large models, why it fails OOD.
- 6
Scaling Laws Intuition: Kaplan, Chinchilla, Post-Chinchilla — Mathematical Planning of LLM Training
Complete analysis of the mathematical foundations of LLM training: Kaplan 2020 power laws, Chinchilla 2022 compute-optimal theorem, post-Chinchilla over-training (Llama 3 approach), inference-aware scaling (Sardana 2023), μP hyperparameter transfer, FLOP calculation, MFU optimization.
- 7
Emergent Capabilities: Are 'Sudden' Abilities Real or Measurement Artifacts?
GPT-3 paper's 'emergent abilities' claim, Wei 2022's systematic study, Schaeffer 2023's 'Are Emergent Abilities a Mirage?' challenge. Threshold effects, metric design, smooth vs discontinuous capabilities. Which abilities are truly emergent, which are measurement artifacts?
- 8
Memorization vs Generalization: Paraphrase Tests and LLM's True Understanding
Does LLM training memorize the corpus or generalize? Exact match tests, paraphrase resistance, contamination detection, membership inference. Memorization detection in eval, training data extraction risks, privacy implications.
Module 5: PyTorch Engineering — Engineer-Grade
- 1
torch.compile and torch.fx: Graph Capture, JIT Compilation, and Production Optimization
PyTorch 2.0+ game-changer feature torch.compile in depth: TorchDynamo + TorchInductor + Triton pipeline, FX graph manipulation, compile modes (default/reduce-overhead/max-autotune), graph breaks debugging, dynamic shapes, production trade-offs. Production extension of Module 2.5.
- 2
Mixed Precision Training Deep Dive: BF16, FP16, FP8, autocast, GradScaler — Production Patterns
Module 1.9 covered numerical stability fundamentals. This lesson dives into production mixed precision: autocast region design, GradScaler dynamics, FP8 H100/B200 native training (DeepSeek-V3 method), gradient norm monitoring, loss spike investigation, BF16 vs FP16 production decision matrix.
- 3
Memory Profiling: torch.profiler, Nsight Systems, OOM Debugging — Production GPU Memory Management
The hidden life of GPU memory: activation vs gradient vs optimizer state breakdown, memory snapshot with torch.profiler, Nsight Systems timeline analysis, OOM root cause analysis, activation checkpointing, gradient accumulation, fragmentation solutions.
- 4
CUDA Streams, Events, and NCCL Fundamentals: The Foundation Layer of Multi-GPU Communication
Concurrency on GPU: parallel kernel execution with streams, fine-grained synchronization with events, NCCL collective operations (allreduce, broadcast, all-gather, reduce-scatter). The infrastructure layer of distributed training. Preparation for Module 17 (Distributed Training).
- 5
Custom GPU Kernels with Triton: Softmax, Matmul, FlashAttention Mini from Scratch
Triton's secret of GPU programming with Python syntax: programming model (program_id, block_size, autotune), softmax kernel from scratch, matmul tiling, FlashAttention's block-wise mini implementation, performance tuning. Practical foundation for Module 37 (CUDA/Triton deep dive).
- 6
torch.distributed In Depth: DDP, FSDP, ZeRO Stages — Production Distributed Training
We covered NCCL fundamentals in 5.4. Now production distributed training stack: DDP gradient bucketing + overlap, FSDP shard strategies (FULL_SHARD, SHARD_GRAD_OP, HYBRID_SHARD), DeepSpeed ZeRO Stage 1/2/3 comparison, hybrid 3D parallelism. Final bridge to Module 17.
- 7
Debug Arsenal: register_hook, Anomaly Mode, torch.utils.benchmark — Production Debugging Toolkit
Toolkit for when things break in production PyTorch: forward/backward hooks, anomaly detection mode, deterministic training, torch.utils.benchmark precise timing, repro patterns, systematic NaN hunting, gradient inspection, model debugging strategies.
- 8
Production Engineering: Reproducibility, CI/CD for ML, Versioning, and Deployment Patterns
Final lesson of PyTorch engineering — production workflow patterns: ML CI/CD pipelines, eval harness CI integration, model + prompt + data versioning (DVC, MLflow, HF Hub), canary deployment, A/B testing, rollback strategies, drift monitoring, KVKK-compliant deployment. Closing of Part I.
Module 6: Tokenization Microsurgery
- 1
Character, Word, Subword: Tokenization Design Constraints and Decision Matrix
Tokenization design space: character-level (UTF-8, byte), word-level (whitespace, morphology), subword (BPE, WordPiece, Unigram). Mathematical and pragmatic trade-offs of each choice, OOV problem, vocabulary size decision matrix, multilingual challenges, Turkish characteristics.
- 2
BPE Algorithm: Sennrich 2016 Line by Line — Pseudocode, Complexity, Edge Cases
BPE mathematical anatomy. Sennrich 2016 paper line by line: pre-tokenization, byte-pair merge counting, greedy merge selection, vocabulary construction, encoding logic, complexity analysis (O(N·V)), edge cases (Unicode, whitespace, special tokens). Full understanding before implementation in Module 6.3.
- 3
Write BPE from Scratch in 200 Lines: Training + Encoding + Decoding + Turkish Corpus
Karpathy minbpe-style from-scratch implementation: pure Python BPE training (Sennrich algorithm), encoding/decoding, regex pre-tokenization, byte-level extension, train on Turkish corpus + compare with Trendyol-LLM. Practical understanding of modern LLM tokenizers.
- 4
WordPiece (BERT): Likelihood-Based Merges and Quiet Differences from BPE
WordPiece algorithm: from Schuster & Nakajima 2012 to BERT 2018. Likelihood-based merge score instead of frequency, ##suffix prefix convention, [UNK]/[CLS]/[SEP] special tokens, quiet but critical differences from BPE. Practical training with HuggingFace Tokenizers, BERT-base-Turkish-cased example, vocab design.
- 5
SentencePiece + Unigram LM (Kudo 2018): Probabilistic Tokenization and Subword Regularization
SentencePiece framework + Unigram language model algorithm. Kudo 2018's probabilistic approach: start with large vocab, prune with EM. Viterbi forward encoding, subword regularization, ▁ whitespace-as-character. Llama, T5, Mistral's choice. Turkish and multilingual advantages.
- 6
GPT-2/GPT-4 Byte-Level BPE + tiktoken Regex: Anatomy of the Modern Standard
Birth of GPT-2 byte-level BPE (Radford 2019), the secret of the regex pre-tokenizer, GPT-3.5 cl100k, GPT-4o o200k, Llama-3's return to tiktoken. tiktoken Rust performance, token counting for prompt engineering, Turkish cost economics, comparison of encoding regimes.
- 7
Special Tokens + ChatML + Chat Templates: Tokenization Anatomy of the Conversational LLM
Birth of chat formats (ChatGPT March 2022), ChatML spec, anatomy of <|im_start|>/<|im_end|>/<|im_sep|> tokens, Llama-3 Instruct + Mistral [INST] + Claude Messages API + Gemini formats, HuggingFace chat_template Jinja2, system prompt placement, tool use tokens, prompt injection security, multi-turn token economy, Turkish chat practice.
- 8
HuggingFace Tokenizers Rust + Production Pipeline: Training a Production-Quality Tokenizer from Scratch
HuggingFace tokenizers crate Rust architecture, 6-layer pipeline (Normalizer → PreTokenizer → Model → PostProcessor → Decoder → Trainer), tokenizer.json format anatomy, Turkish production-grade end-to-end training, Rust internals (parallel processing, SIMD, ahash, mmap), tiktoken/SentencePiece conversion, threading + caching + FFI overhead, benchmarks.
- 9
Tokenizer Evaluation: Fertility, Compression Ratio, Downstream Impact, and Information-Theoretic Metrics
Deep anatomy of all metrics that measure tokenizer quality: fertility (tokens/word), compression ratio (bytes/token), OOV rate, bits-per-character (BPC), impact on perplexity, cross-lingual fertility, downstream task impact, vocab coverage, A/B testing protocols, Turkish-specific metrics, cost 'tax' analysis, capstone evaluation framework.
- 10
Capstone TurkTokenizer-tr: Train, Evaluate, and Publish a Production-Grade Turkish Tokenizer to HuggingFace Hub
The work of Module 6: train TurkTokenizer-tr (32K vocab Turkish BPE) from scratch, evaluate with 6.9 framework, write model card, choose license, publish to HuggingFace Hub. Corpus curation (Wikipedia + OSCAR + news + literature + code), cleaning pipeline, chat template, production integration, maintenance roadmap. Synthesis of Modules 6.1-6.9, real-world artifact.
Module 7: Embedding Layer — The Vector Space of Meaning
- 1
What is Embedding? Bridge from Token ID to Meaning Vector — The Discrete-to-Continuous Revolution
Mathematical anatomy of embedding: integer token ID to d-dimensional dense vector mapping. Vocab × d_model matrix. Degenerate case of one-hot encoding. Why semantic vector space works (distributional hypothesis, Firth 1957). 'Meaning emerges from co-occurrence' philosophy. Pre-NN era (LSA, LSI) vs neural era (word2vec → BERT → LLM). Practical meaning for Turkish.
- 2
Word2Vec Line by Line: Anatomy of Mikolov 2013's Skip-Gram + CBOW + Negative Sampling
Line-by-line anatomy of Mikolov 2013 paper: Skip-Gram vs CBOW architecture differences, softmax computational bottleneck, hierarchical softmax (Huffman tree), negative sampling (Mikolov 2013b), subsampling, dynamic window. Pure Python implementation in 100 lines. Gensim Turkish word2vec training demo. Comparison with modern LLM embeddings.
- 3
GloVe + FastText: Global Co-Occurrence Matrix + Subword N-Gram Extension
GloVe (Pennington 2014) global co-occurrence matrix approach vs Word2Vec local window: mathematical formulation, weighted least squares objective, X_ij interpretation. FastText (Bojanowski 2017) subword n-gram embedding: 'merhaba' = 'mer' + 'erh' + ... OOV problem solution, ideal for Turkish morphological languages. Performance comparison, which scenario for each.
- 4
Modern LLM Embedding Layer + Embedding Tying: Input/Output Sharing and Scaling
Embedding layer in modern transformer architecture: nn.Embedding initialization (Llama-3 style), embedding tying (input/output sharing) — mathematical justification and memory savings, embedding scaling before pre-layernorm (sqrt(d_model) or not), no position addition before RoPE, multimodal embeddings (vision + audio tokens). Architectural differences between Llama-3, GPT-4o, Claude-3.
- 5
Embedding Geometry: Cosine Similarity, Euclidean Distance, Isotropy and BERTology Findings
Topology of embedding vector space: cosine similarity vs Euclidean distance vs dot product (which when, mathematical relationships), isotropy (vectors balanced across directions, Gao 2019 'representation degeneration'), anisotropy problem in BERT/GPT embeddings, mitigation (whitening, normalization). BERTology findings: which information in which layer (Rogers 2020). Practical analysis for Turkish.
- 6
Capstone Module 7: Turkish Semantic Search System — sentence-transformers + FAISS + Mini-RAG
Module 7 capstone project: Turkish semantic search system from scratch. sentence-transformers Turkish model selection, FAISS vector index, production-grade query pipeline, mini-RAG architecture (retriever + generator), benchmark + deployment. Practical application of embedding theory.
Module 8: Attention Mathematics — The Heart of Transformer
- 1
Scaled Dot-Product Attention: The Heart of Vaswani 2017 Line by Line — Anatomy of Query, Key, Value Trio
The cornerstone of transformer — mathematical anatomy of scaled dot-product attention: Query/Key/Value trio, dot product similarity, softmax normalize, sqrt(d_k) scaling justification, causal mask (autoregressive), attention weights interpretation. PyTorch implementation, FLOP analysis, numerical stability concerns, attention pattern visualization with Turkish examples.
- 2
Multi-Head Attention: N Parallel Heads, Concat + Projection, Grouped-Query Attention (GQA), Multi-Query Attention (MQA)
Why we split single attention into N parallel heads: each head's capacity to learn different patterns (syntactic, semantic, positional). Concat + output projection architecture, head pruning empirical findings, Llama-3 grouped-query attention (GQA), Mistral multi-query attention (MQA), head visualization with Turkish examples.
- 3
FlashAttention: IO-Aware Attention — Dao 2022 Algorithm and Modern Implementations
Mathematical and systems anatomy of FlashAttention: why standard attention is memory-bound, GPU memory hierarchy (HBM vs SRAM), tile-based computation, online softmax, recomputation backward. Evolution of FlashAttention-1 (Dao 2022), FlashAttention-2, FlashAttention-3. PyTorch flash_attn library, performance benchmarks, long context enablement.
- 4
KV Cache + Paged Attention: Inference Serving Optimization — vLLM Paged Attention and Continuous Batching
LLM inference serving optimization: KV cache anatomy (prefill vs decode phases), memory fragmentation problem, paged attention (vLLM 2023 Kwon), continuous batching, dynamic memory allocation. Llama-3 production serving math: throughput, latency trade-offs, multi-tenancy.
- 5
Capstone Module 8: Alternatives to Quadratic Attention — Linear Attention, RetNet, Mamba (State Space Models)
Module 8 capstone: alternatives to quadratic attention. Linear Attention (Katharopoulos 2020) — kernel trick + recurrent form. RetNet (Sun 2023) — retention mechanism Microsoft. Mamba (Gu Dao 2023) — selective state space models. Which sub-quadratic architecture for which scenario, GPT-4 vs Mamba comparison, hybrid models (Jamba), future trends.
Module 9: Position Encoding — Order-Embedded Meaning
- 1
Why is Position Encoding Mandatory? Sinusoidal vs Learned Absolute Position — Classical Approaches from Vaswani 2017 to GPT-2
Attention's permutation-invariance problem: 'Dog bit cat' and 'Cat bit dog' are equivalent! Necessity of position encoding. Vaswani 2017 sinusoidal formula (sin/cos at different frequencies), generalization argument (longer sequences). GPT-2 learned absolute position embedding, max_position_embeddings limit. Trade-offs, practical meaning for Turkish syntax.
- 2
RoPE in Depth: Mathematical Anatomy of Rotary Position Embedding — From Su 2021 to Llama-3
Mathematical anatomy of RoPE: complex number rotation interpretation, why applied to Q and K, relative position implicit derivation. Llama-3 RoPE implementation line by line, base frequency 10000, pair-wise rotation. PyTorch implementation, RoPE vs sinusoidal/learned comparison, reason for widespread adoption in modern models.
- 3
ALiBi: Attention with Linear Biases — Press 2021's Simple Solution and Extrapolation Advantage
ALiBi (Press 2021): inject position information by adding linear bias to attention score without position embedding. Math: attention[i,j] += m × (j-i). Per-head slopes hierarchy (m_h = 2^{-8h/H}). Strengths: zero parameters, train-short eval-long extrapolation, simple implementation. Comparison with RoPE, Mistral and BLOOM usage.
- 4
Long Context Extrapolation: NTK-Aware Scaling + YaRN + LongRoPE — Journey from 8K to 1M Tokens
Extending RoPE to long context: NTK-aware scaling intuition, YaRN (Peng 2023) — comprehensive solution + temperature scaling, LongRoPE (Microsoft 2024) — 2M token context. Llama-3-8B base 8K → 128K extension recipes, Gemini 1.5 1M token tricks, fine-tune protocol.
- 5
Capstone Module 9: Implement Llama-3 RoPE from Scratch in 50 Lines — Pure NumPy + Visualization
Module 9 capstone: implement Llama-3 compatible RoPE in 50 lines pure NumPy. cos/sin cache precomputation, pair-wise rotation, position visualization (cos/sin heatmap, attention bias pattern). Compatibility test with actual Llama-3 weights. Turkish examples for position pattern interpretation.
Module 10: Transformer Block — Anatomy of the Block
- 1
Normalization Revolution: LayerNorm, RMSNorm and Pre-LN vs Post-LN — Cornerstone of Training Stability
Mathematical and systems anatomy of transformer training stability: LayerNorm (Ba 2016) classical formula, RMSNorm (Zhang 2019) — Llama-3's choice, why gain parameter only, computational savings. Pre-LN (modern) vs Post-LN (original Vaswani) trade-off, gradient flow, deep transformer stability. Normalization concerns in Turkish model fine-tuning.
- 2
SwiGLU Activation: SiLU + GLU = Heart of Modern FFN — From Shazeer 2020 to Llama-3
Anatomy of SwiGLU activation function: SiLU (Sigmoid-weighted Linear Unit) base + Gated Linear Unit mechanism. Shazeer 2020 'GLU Variants Improve Transformer'. ReLU/GeLU comparison, why modern models' choice. FFN dimensions (d_ff = 8/3 × d_model Llama-3's choice), parameter math, Llama-3 implementation.
- 3
Capstone Module 10: Llama-3 Transformer Block in 200 Lines from Scratch — RMSNorm + RoPE + GQA + SwiGLU
Module 10 capstone: implement Llama-3 architecture transformer block in 200 lines. RMSNorm + Pre-LN + GQA (Grouped-Query Attention) + RoPE + SwiGLU FFN + residual connections. Synthesis of Modules 6-10. Turkish example forward pass, gradient flow analysis, Llama-3 actual weights load test.
Module 11: Pre-training Dynamics + Optimizer Math
- 1
Pre-training Pipeline End-to-End: Corpus → Tokenize → Pack → Train — Llama-3 Production Recipe
All stages of pre-training pipeline: corpus collection (Common Crawl, Wikipedia, code), data cleaning (deduplication, language filtering, quality scoring), tokenization batching, sequence packing strategy, document boundary handling. Llama-3 production recipe: 15T tokens, 24K H100 days compute, 70 days training.
- 2
AdamW + Learning Rate Schedule: Mathematical Anatomy of Modern LLM Optimization
Modern LLM optimization: evolution from SGD to Adam to AdamW. Loshchilov 2019 weight decay decoupling. Momentum (β1=0.9) + variance estimate (β2=0.95) intuition. Learning rate schedules: cosine decay, linear decay, warmup needed. Gradient clipping, mixed precision training, hyperparameter pitfalls.
- 3
Capstone Module 11: Mini Llama-3 100M Param Pre-training — Single H100, 1 Week
Module 11 capstone: pre-train your own Llama-3 architecture mini model (100M params) from scratch. All Module 6-10 components (Llama tokenizer + RMSNorm + GQA + RoPE + SwiGLU) + Module 11 pre-training pipeline + AdamW. 5GB Turkish corpus, single H100, 1 week. Validation loss tracking, checkpoint, sampling demo.
Module 12: Scaling Laws — Growth Laws of LLMs
- 1
Kaplan Scaling Laws (2020): Power Law Anatomy of LLM Performance — Compute, Data, Param Triangle
Anatomy of Kaplan et al. 2020 paper: LLM loss follows power law for compute (C), parameters (N), data (D). Why log-log plot is linear, optimum allocation formula, 'bigger is better' claim, GPT-3 (175B) was built on this. Limitations and subsequent Chinchilla refutation.
- 2
Chinchilla Scaling Laws (2022): Hoffmann et al. — 1:1 Param:Data Revolution
Hoffmann et al. 2022 'Training Compute-Optimal LLMs' paper — corrected Kaplan. Kaplan undertrained models bias. Chinchilla recipe: N ≈ D (1:1 ratio). 70B Chinchilla model > 280B Gopher (Hoffmann). Llama-3 Chinchilla-aware. New compute-optimal formula, post-Chinchilla overtraining trend.
- 3
Capstone Module 12: Plan Your Own LLM Training Compute Budget — Chinchilla-Aware Calculator
Module 12 capstone: plan your own LLM training budget. Target model size (1B-70B), available compute (single GPU / cluster), available data — Chinchilla-aware optimal allocation calculation. Cost estimator ($/training), time estimator, quality projection.
Module 13: Distributed Training — Multi-GPU/Multi-Node
- 1
Data Parallelism (DDP): Foundation of Multi-GPU LLM Training — AllReduce and NCCL Anatomy
Distributed Data Parallel (DDP) anatomy: model replication across GPUs, mini-batch split, forward/backward independent per GPU, gradient AllReduce synchronization. NCCL (NVIDIA Collective Communication Library), ring-allreduce algorithm, bandwidth math. PyTorch DDP API, launch scripts, common pitfalls (uneven batches, batch norm sync).
- 2
FSDP + ZeRO: Sharded Training — Memory Revolution from Rajbhandari 2020 to Llama-3
ZeRO (Zero Redundancy Optimizer, Rajbhandari 2020) — DeepSpeed library: optimizer state, gradients, parameters sharding stages 1/2/3. FSDP (Fully Sharded Data Parallel, PyTorch native) — ZeRO-3 implementation. Llama-3 production: FSDP + activation checkpointing. Memory math: 8B model trainable on 1 H100.
- 3
3D Parallelism: Tensor + Pipeline + Data Parallel — Training Llama-3 70B and 405B
Frontier LLM training: Megatron-LM's 3D parallelism. Tensor Parallelism (Shoeybi 2019) — matrix splits across GPUs. Pipeline Parallelism (Huang 2018) — layer splits + bubble optimization. Combined 3D: DP × TP × PP. Llama-3 70B (DP=192, TP=8, PP=16). Communication patterns, optimization, capstone implementation outline.
Module 14: Fine-tuning — SFT, LoRA, QLoRA
- 1
Supervised Fine-Tuning (SFT): Transforming Pre-trained Base Model into Instruct — Llama-3-Instruct Anatomy
Supervised Fine-Tuning (SFT) anatomy: pre-trained base model → instruction-following model. Instruction dataset (Alpaca, OASST, Dolly), chat template application, loss masking (loss only on response), hyperparameter differences (lr 1/10 of pre-train), Llama-3-Instruct production recipe, practical Turkish fine-tune.
- 2
LoRA + QLoRA: Parameter-Efficient Fine-Tuning Revolution — From Hu 2021 to Dettmers 2023
LoRA (Hu 2021): low-rank decomposition fine-tuning — base weights frozen, train only small adapter. %1 parameters, %95+ quality preservation. QLoRA (Dettmers 2023): 4-bit base + LoRA, fine-tune 70B model on consumer GPU. NF4 quantization, paged optimizer. Turkish practical: $5K cost production Turkish Llama-3 70B.
- 3
Capstone Module 14: Production Turkish Llama-3 8B Fine-Tune — QLoRA + SFT End-to-End
Module 14 capstone: Llama-3-8B base + Turkish SFT + QLoRA = production-quality Turkish Llama-3-Instruct. Dataset curation (50K Turkish instructions), QLoRA training (single H100 8 hours), evaluation (MT-Bench-TR), HuggingFace Hub publish, vLLM inference deployment.
Module 15: RLHF + DPO — Alignment & Preference Optimization
- 1
RLHF: Reinforcement Learning from Human Feedback — From Ouyang 2022 InstructGPT to ChatGPT
Full anatomy of RLHF: SFT model → reward model training (Bradley-Terry) → PPO RL training. Ouyang 2022 InstructGPT paper, 3-stage pipeline, KL divergence penalty, reward hacking concerns. ChatGPT's secret sauce. Turkish RLHF challenges (human annotator pool, cultural nuances).
- 2
DPO: Direct Preference Optimization — Rafailov 2023, Cheaper Rebirth of RLHF
DPO (Rafailov 2023): mathematical reformulation of RLHF — no reward model, no RL. Direct preference loss. Llama-3 RLHF replacement. Math derivation, implementation simpler than PPO, comparable quality. Turkish DPO practical: $1K cost 8B model alignment.
Module 15: Preference Alignment — RLHF, PPO, DPO, GRPO
- 1
Birth of RLHF: A Seven-Year Journey from Christiano 2017 to ChatGPT — Historical and Philosophical Anatomy of Human Preference Alignment
Historical and philosophical foundations of RLHF: a seven-year transformation starting from Christiano et al. 2017 'Deep RL from Human Preferences', through Stiennon 2020 summarization work, Ouyang 2022 InstructGPT, to December 2022 ChatGPT launch. Why SFT alone is insufficient, the tension of the 'helpful-harmless-honest' triangle, Goodhart's Law and the reward hacking problem. What alignment means with Turkish cultural context — sen/siz distinction, social sensitivity, KVKK boundaries. The most conceptually critical lesson of the curriculum.
- 2
Mathematics of the Reward Model: From Bradley-Terry 1952 to Modern LLM Reward Architecture
Mathematical anatomy of the reward model — the heart of RLHF: derivation of Bradley-Terry 1952 logistic preference model, probabilistic interpretation of sigmoid, derivative of ranking loss, RM architectural choices (separate from SFT vs shared trunk + value head), calibration and overconfidence problems, Plackett-Luce extension for multiple comparisons, practical pitfalls of RM training for Turkish.
- 3
PPO Algorithm Line by Line: From Schulman 2017 to InstructGPT — Adapting RL to LLM
Adaptation of Proximal Policy Optimization (Schulman 2017) to LLM RLHF: policy gradient foundation, advantage estimation (GAE), clipped surrogate loss derivation and why 'clip', KL penalty mathematics, value function loss, entropy bonus. InstructGPT's full PPO setup, hyperparameter choices, training stability, debugging strategies.
- 4
DPO Revolution: Rafailov 2023's Mathematical Discovery — Compressing RLHF into a Single Loss Function
Direct Preference Optimization (Rafailov et al. 2023): full derivation of the mathematical discovery that compresses RLHF's 3 stages into a single supervised loss. Reward model's 'hidden reformulation', optimum solution of Bradley-Terry + KL constraint, why DPO says 'every LLM is already a reward model', mathematical meaning of closed-form solution. Numerical comparison with PPO, modern DPO variants (IPO, KTO, SimPO), Turkish DPO production pipeline.
- 5
GRPO and Reasoning RL: Inside DeepSeek-R1 — From Group-Based Advantage Estimation to Process Reward
GRPO (Group Relative Policy Optimization): DeepSeek's elegant simplification of PPO. Advantage estimation without value function, group comparison, computational efficiency. Anatomy of DeepSeek-R1 paper (Jan 2025), RL ordering of reasoning training, 'aha moments' phenomenon, role of process reward models, o1 vs R1 architecture comparison, practical notes for Turkish reasoning model.
- 6
Capstone Module 15: Turkish DPO Model from Scratch to Production — Data, Training, Evaluation, Publishing
Module 15 capstone project: producing a production-grade model with Turkish DPO on Llama-3-8B-Instruct. How to collect 5K Turkish comparison data (manual + synthetic), DPO training (QLoRA, single H100, $50), MT-Bench-TR evaluation, win-rate measurement, publishing on HuggingFace Hub with model card. The curriculum's sixth production artifact.
Module 16: Production Deployment — vLLM, Quantization, Monitoring
- 1
vLLM Production Serving: 10x Throughput with Paged Attention + Continuous Batching
vLLM production deployment: paged attention (Kwon 2023), continuous batching, OpenAI-compatible API, multi-GPU tensor parallel serving, Kubernetes deployment patterns. Llama-3-8B + custom Turkish model serving 1000+ concurrent users.
- 2
Quantization (GPTQ/AWQ/GGUF) + Final Capstone: Turkish ChatGPT Clone in Production
Module 16 capstone (curriculum's final capstone): GPTQ, AWQ, GGUF quantization formats. Turkish Llama-3-8B-Instruct quantize + vLLM serve + Next.js frontend = Turkish ChatGPT clone. sukruyusufkaya.com/ai-asistan production deploy. Curriculum synthesis, real-world artifact.
Module 16: Production Engineering — Self-Host, Quantization, Serving, Monitoring
- 1
Self-Host Decision Framework: OpenAI API vs Your Own GPU — Cost, Privacy, Performance, Independence
First critical decision in LLM production: API or self-host? This lesson's aim is to ground decision engineering solidly. Cost mathematics (per-token economics, fixed vs variable costs), privacy (KVKK, sectoral restrictions), performance (latency, throughput), independence (lock-in risk). 5 different scenarios for Turkish SaaS: chatbot, RAG, content gen, legal, health. Right decision different in each.
- 2
vLLM Production Engineering: From Paged Attention to SLAs — Anatomy of Modern LLM Serving
Mathematical and systems anatomy of vLLM: paged attention (Kwon et al. 2023) why it uses RAM 5× more efficient, continuous batching math, internal structure of KV cache, OpenAI-compatible API, Turkish Llama-3 deployment from start to finish. Hardware selection (H100 vs A100 vs RTX 4090), Kubernetes setup, autoscaling, SLA guarantees.
- 3
Quantization in Depth: From INT4 to FP8 — Shrinking Your Model 4×, Speeding 2×
Mathematical and engineering anatomy of LLM quantization: INT8, INT4, FP8 formats, GPTQ (Frantar 2022) vs AWQ (Lin 2023) vs GGUF (Gerganov) algorithms, quality-size-speed trade-offs. Quantizing Llama-3-8B Turkish DPO model with 4-bit AWQ, measuring quality loss, running Llama-3-70B on RTX 4090, mobile device deployment.
- 4
Monitoring, Observability and Alerting: Watch Your Production LLM — From Metrics to Action
Monitoring and observability layer of production LLM serving: Prometheus metrics (vLLM native), Grafana dashboard design, OpenTelemetry tracing, log aggregation (Loki/Elastic), alerting rules (Slack/PagerDuty), error tracking with Sentry. Turkish-specific anomalies: hallucination detection, tokenizer errors, prompt injection alert. An LLM engineer's 'what to monitor' guide.
- 5
Capstone Module 16: Turkish ChatGPT Clone Live — Integration of Module 16
Module 16 capstone: synthesizing 4 lessons (decision, vLLM, quantization, monitoring) into a real product. Turkish DPO model from Module 15.6 → 4-bit AWQ quantize → vLLM serve → Next.js frontend + streaming → Vercel deploy → Sentry + Grafana monitoring → live at **chat.sukruyusufkaya.com**. Curriculum's 7th production artifact. Backend ($60/month cost), frontend (Vercel free tier), monitoring (Grafana Cloud free) full stack.
Module 17: Reasoning Models — o1, DeepSeek-R1, Test-Time Compute
- 1
Reasoning Revolution: From OpenAI o1 to DeepSeek-R1 — Test-Time Compute and the Rebirth of Chain-of-Thought
2024-2026 LLM frontier: reasoning models. OpenAI o1 (Sept 2024), DeepSeek-R1 (Jan 2025) revolution. Test-time compute scaling (Kaplan's new dimension), chain-of-thought intensification, hidden reasoning tokens (o1) vs visible (R1), RL training reasoning patterns. AIME, MATH benchmark revolution, GPT-4 → o1 90% accuracy jump.
- 2
DeepSeek-R1 Self-Host + Turkish Reasoning: Distilled Models, Prompt Patterns, Production Deployment
DeepSeek-R1-distilled (7B, 14B, 32B) self-host: vLLM deployment, hardware requirements, prompt patterns for reasoning, Turkish math problem solving demo. Production usage of reasoning models: when, how, cost-benefit.
Module 17: Reasoning Models — Test-Time Compute Revolution
- 1
History of the Reasoning Revolution: From Wei 2022 Chain-of-Thought to o1 — Seven-Year Birth of 'Thinking Models'
Historical and conceptual anatomy of reasoning models: seven years from Wei et al. 2022 'Chain-of-Thought Prompting' to September 12, 2024 OpenAI o1 launch. Self-consistency (Wang 2022), Tree of Thoughts (Yao 2023), Reflexion (Shinn 2023) — rise and limits of prompting-based reasoning. Why there were no 'reasoning models' until 2024, why o1 was different, emergence of test-time compute as new scaling dimension. What it means for models solving Turkish math problems.
- 2
Test-Time Compute Scaling Mathematics: Snell 2024 Paper — The New Science of Spending Compute on 'Thinking'
Mathematics of new scaling dimension: Snell et al. 2024 'Scaling LLM Test-Time Compute Optimally' paper. Multi-sample (best-of-N, self-consistency) vs deep thinking (long reasoning chain) trade-offs. Optimal compute allocation: how to best distribute same budget? Paradox with pre-training compute: %20 less pre-training + %50 more test-time = same quality. Planning 'thinking budget' for Turkish.
- 3
o1 Architecture Speculative Analysis: Behind Closed Doors — Public Observations + Reverse Engineering
Speculating o1 architecture (not disclosed by OpenAI) by combining public observations + academic papers + community reverse engineering. Combination of PRM (Process Reward Model) + MCTS (Monte Carlo Tree Search) + RL? Hints from pricing model. AI safety + commercial meaning of hidden reasoning tokens. Reflection from R1 paper — what did open alternative teach?
- 4
DeepSeek-R1 GRPO in Depth: Mathematics of Open Reasoning RL — Group Relative Policy Optimization
Main training algorithm of DeepSeek-R1 (January 2025) GRPO (Group Relative Policy Optimization). Line-by-line derivation of differences from PPO. Value-function-free advantage estimation (group comparison). Detailed walk-through of 4-stage training (R1-Zero → Cold Start → Reasoning RL → Distill). Empirical phenomenon of 'aha moments' — examples and statistical analysis given in paper. Turkish R1 fine-tune strategies.
- 5
Capstone Module 17: Turkish Reasoning Model to Production — R1-Distill-32B Turkish Math Fine-Tune
Module 17 capstone: Turkish math DPO fine-tune on R1-Distill-Qwen-32B. Creating 5K Turkish reasoning chain dataset from YKS/TYT/TÜBİTAK math problems, DPO training (1 H100, 1 week, $200-500), evaluation (AIME-TR, YKS math), HuggingFace Hub publish. Curriculum's 8th production artifact: sukruyusufkaya/r1-distill-tr-math-32b.
Module 18: Mixture of Experts — Sparse Activation Revolution
Module 18: Mixture of Experts (MoE) — Sparse Activation Revolution
- 1
MoE History: From Jacobs 1991 to DeepSeek-V3 2024 — 33-Year Sparse Activation Revolution
33-year intellectual journey of Mixture of Experts: Jacobs et al. 1991 original paper ('Adaptive Mixtures of Local Experts'), Shazeer et al. 2017 'Outrageously Large Neural Networks' — beginning of modern MoE, GShard 2020 Google scale, Switch Transformer 2021, Mixtral 8x7B (January 2024) open-source revolution, DeepSeek-V3 (December 2024) 671B active 37B. 'Why was it outside the door for 33 years, why did it come back now?'
- 2
MoE Mathematical Anatomy: Gating Network, Top-k Routing, Load Balancing — Sparse Activation from Scratch
Internal mathematics of MoE: derivation of gating network, top-k routing implementation, expert collapse problem and load balancing loss (Shazeer 2017), auxiliary loss math, capacity factor, drop tokens, FLOP analysis. PyTorch MoE FFN layer implementation from scratch. Expert utilization observations on Turkish data.
- 3
DeepSeek-V3 Innovations: MLA, Auxiliary-Loss-Free, Multi-Token Prediction — 3 Keys to 2024 Frontier
3 critical innovations of DeepSeek-V3 in depth: (1) Multi-head Latent Attention (MLA) — attention variant reducing KV cache by %93, (2) Auxiliary-loss-free load balancing — clean gating with bias trick, (3) Multi-token prediction (MTP) — parallel prediction of 2-3 tokens in training. Mathematical anatomy of each, why they work, how they contributed to V3's $5.6M training cost. Practical use for Turkish.
- 4
Capstone Module 18: Turkish Mixtral DPO — Bend Open MoE to Turkish
Module 18 capstone: Turkish DPO fine-tune on Mixtral-8x7B-Instruct. 5K Turkish comparison data + QLoRA-DPO + 2× H100 (FSDP) + vLLM deployment. Expert utilization optimized for Turkish. Cost $200-500. Curriculum's 9th production artifact: sukruyusufkaya/mixtral-8x7b-tr-dpo.
Module 19: Multimodal LLMs — Vision + Audio + Video
Module 19: Multimodal Models — Image + Audio + Video
- 1
Multimodal LLM History: From Radford 2021 CLIP to GPT-4o — Birth of 'Seeing' Language Models
Historical and conceptual anatomy of multimodal LLMs: Radford et al. 2021 CLIP paper — birth of image-text alignment via contrastive learning, ViT (Dosovitskiy 2020) image transformer, BLIP (Li 2022), Flamingo (Alayrac 2022), LLaVA (Liu 2023) open-source breakthrough, GPT-4V (Sept 2023), GPT-4o (May 2024) unified omni-modal, Llama-3.2 Vision (Sept 2024) open-source. 5-year 'language + image' fusion journey and what multimodal means for Turkish (Turkish document OCR, cultural visual understanding).
- 2
Multimodal Architecture Mathematics: Vision Encoder → Projection → LLM — 3 Connection Strategies
Internal architectural mathematics of multimodal LLMs: 3 strategies for Vision encoder (ViT/CLIP/SigLIP) → projection → LLM binding. (1) Linear projection (LLaVA style, simple), (2) Q-Former (BLIP-2 style, learnable queries), (3) Cross-attention (Flamingo/Llama-3.2 style, deep integration). Image token budget management, resolution problem, vision-text alignment. LLaVA-style multimodal architecture in PyTorch from scratch. Image-text alignment for Turkish.
- 3
Turkish Multimodal Practice: From ID OCR to Traffic Signs — 5 Production Use Cases
Production use cases of Turkish multimodal LLMs: (1) ID card + license OCR + field extraction (banking, telco), (2) E-invoice + receipt processing (accounting), (3) Turkish traffic sign recognition (automotive), (4) Turkish exam paper digitization (education), (5) Ottoman document analysis (academic). For each use case: GPT-4o vs Llama-3.2-Vision comparison, KVKK-compliant pipeline, Python production code. Multimodal prompting best practices for Turkish.
- 4
Capstone Module 19: Turkish Multimodal Document Processing System — Production SaaS
Module 19 capstone: Turkish multimodal document processing production SaaS. Next.js drag-drop frontend + FastAPI backend + selectable Llama-3.2-Vision or GPT-4o + KVKK-compliant encrypted storage + Stripe payment. ID OCR, e-invoice, exam paper, free tier + premium. Curriculum's 10th production artifact: docproc.sukruyusufkaya.com.
Module 20: AI Agents — Tool Use, Function Calling, MCP, Multi-Agent
- 1
Tool Use History: From Yao 2022 ReAct to Anthropic MCP — 3-Year Birth of LLM Agents
Historical and conceptual anatomy of LLM agents: Yao et al. 2022 ReAct paper ('Reasoning + Action' fusion), OpenAI function calling (June 2023, first standardization), Anthropic MCP (November 2024, open standard). Rise of LangChain, AutoGen, CrewAI frameworks. 'Why aren't LLMs sufficient alone, why do they need tools?' Practical face of AGI debate. Turkish agent use cases.
- 2
Tool Use Mathematics and Implementation: From JSON Schema to Pydantic AI — Production Agent Engineering
Internal mathematics of tool use and production implementation: JSON schema standard detail, full anatomy of OpenAI function calling, ReAct prompt engineering techniques, MCP protocol implementation (Python stdio + SSE). Turkish tool calling examples (TC ID validation, e-invoice query). Clean, type-safe agent with Pydantic AI. Modern approach as LangChain alternative. Error handling, retry logic, tool timeout management.
- 3
Capstone Module 20: Turkish E-Commerce Multi-Agent System — Production Agent with CrewAI
Module 20 capstone: Turkish e-commerce multi-agent system. 3 agents: (1) Research Agent — product search on Trendyol/Hepsiburada, (2) Price Compare Agent — price and shipping comparison, (3) Recommendation Agent — user recommendation. CrewAI framework, Pydantic AI tools, FastAPI backend, Next.js frontend, Stripe API. Turkish natural conversation → automated shopping research. KVKK compliant. Curriculum's 11th production artifact.
Module 21: LLM Evaluation — Benchmarks and Production Eval
- 1
Benchmark Anatomy: From MMLU to LMSys Arena — Science and Art of Measuring LLM Quality
Mathematical and epistemic anatomy of LLM benchmarks: MMLU (Hendrycks 2020 — 57 tasks), HumanEval (Chen 2021 — code), MT-Bench (Zheng 2023 — chat), LMSys Chatbot Arena (community ELO ranking), GPQA (Rein 2023 — graduate-level reasoning). 'Why isn't a single benchmark enough?' For Turkish: TR-MMLU, MUKAYESE, BoazıçNLP. **Benchmark contamination** problem serious analysis. Holistic evaluation approach.
- 2
Production Evaluation Framework: From Test Set Design to LLM-as-Judge — Build Your Turkish Eval System
Building production-grade LLM evaluation framework: test set design (sampling strategy, edge cases, adversarial), automated eval pipeline (pytest-like setup), LLM-as-a-judge strategies (GPT-4o vs Claude vs ensemble, bias detection), error analysis (clustering, root cause), A/B testing protocols (statistical significance, sample size). Objective comparison of 7 production artifacts from Modules 15-20. Clean evaluation code with Python + Pydantic.
- 3
Capstone Module 21: TR-LLMArena — Turkish LMSys-Style Community Leaderboard
Module 21 capstone: Turkish LMSys-style community-driven leaderboard. Double-blind A/B vote system, ELO ranking, monthly leaderboard. HuggingFace Spaces deploy, GPT-4o/Claude/Llama-3 vs Turkish models (Modules 14-20 capstones). Concrete scientific contribution to Turkish AI ecosystem. Curriculum's 12th production artifact.
Module 22: AI Safety + Alignment + KVKK — Final Module
Module 22: AI Safety and Regulation — Jailbreak, KVKK, EU AI Act
- 1
Jailbreak and Red-Teaming: From 'DAN' to Constitutional AI — Art of LLM Attack and Defense
Attack + defense side of LLM security: prompt injection, jailbreak techniques (DAN, roleplay, encoding attacks), token smuggling, indirect injection (leakage from RAG). Bai et al. 2022 Constitutional AI approach — Anthropic's defense strategy. Red-teaming protocols (OpenAI, Anthropic best practices). Turkish-specific jailbreak examples (Islamic sensitivity bypass, KVKK bypass attempts). Production-grade defense layers: input filter + output filter + monitoring.
- 2
KVKK + EU AI Act Regulation: Turkish LLM Engineer's Legal Guide — Building Compliance Pipeline
Regulation guide for Turkish LLM engineer: KVKK (Law 6698) all relevant articles, **EU AI Act** (June 2024) risk categories (prohibited, high-risk, limited, minimal), dilemma of Turkish company serving EU (both KVKK and AI Act compliance). Production compliance pipeline: VERBİS registration, data inventory, GDPR-compliant logging, KVKK board audits, AI Act high-risk documentation. Real cases and fines (KVKK with $50K+ fines).
- 3
Capstone Module 22: Turkish LLM Compliance Stack — Curriculum's Closing Ribbon
Module 22 capstone: making 12 production artifacts from curriculum (Modules 6-21) KVKK + EU AI Act compliant. Audit log infrastructure + encryption + deletion endpoint + breach response plan + EU representative + AI Act risk assessment documentation. Curriculum's **13th and final production artifact**. Also curriculum's **official closing** — end of 200+ hour expert-level journey from scratch to AI engineering.