About this training
A 3-day advanced Turkish training that covers the 2022-2026 mechanistic-interpretability research of Anthropic, OpenAI, DeepMind, and Goodfire AI end to end: the superposition hypothesis, Sparse Autoencoders (Vanilla + Top-K + Gated + JumpReLU), Anthropic Scaling Monosemanticity, Crosscoders, refusal direction, persona vectors, circuit analysis, activation patching, and production AI-safety applications. With the TransformerLens, SAELens, nnsight, Gemma Scope, Goodfire AI, and Neuronpedia stack.
This training is designed for: AI Researchers who want to do Anthropic / OpenAI / DeepMind-style mech-interp research AI Safety Engineers who want to build production AI-safety pipelines by understanding LLM internals Senior AI Engineers developing products that require jailbreak prevention, hallucination detection, and adversarial robustness Compliance + risk managers who must perform alignment audits in enterprise LLM usage Red Team engineers and adversarial AI-security experts Startup technical leaders who want to build the interpretability infrastructure for their own open-source LLM (Turkish or domain-specific)
Why this course matters: The first advanced program in Turkey that addresses mechanistic-interpretability + Sparse Autoencoder discipline at production grade. Covers 2024-2026 frontier research, including Anthropic Scaling Monosemanticity, Crosscoders, OpenAI Top-K SAE, DeepMind JumpReLU, and Gemma Scope. Teaches safety-critical activation-steering techniques like refusal direction (Arditi 2024) and persona vectors. Covers the TransformerLens, SAELens, nnsight, Goodfire, Neuronpedia stack end to end and hands-on. Ties mech interp to production AI-safety problems like jailbreak prevention, hallucination detection, and alignment audit. Instills the discipline of producing interpretability reports for EU AI Act Article 13 and KVKK compliance. Through the capstone project, equips the participant with a custom feature catalog + steering pipeline applicable in their own domain. Offers Anthropic / DeepMind / Goodfire-level coverage for teams wishing to contribute to AI-safety research.
Learning outcomes by the end of the programme: Dissect the theoretical foundations of mech interp (superposition, polysemanticity, linear representation). Make evidence-based choices among the Vanilla, Top-K, Gated, and JumpReLU SAE variants. Train production-grade SAEs with TransformerLens + SAELens. Extract millions of features by applying the Anthropic Scaling Monosemanticity methodology. Automatically label features with an auto-interpretation pipeline (GPT-5 / Claude Opus 4.7). Perform circuit identification with activation patching + ACDC. Establish inference-time behavior control with refusal direction + persona vectors. Apply mech interp to jailbreak prevention, hallucination detection, and alignment audit. Skillfully use the Gemma Scope, Goodfire AI, and Neuronpedia public banks. Produce EU AI Act + KVKK-compliant interpretability reports.
Prerequisites and recommended background: Active Python experience (intermediate to advanced), basic use of PyTorch and HuggingFace Transformers Foundations in linear algebra (matrix operations, eigenvalue decomposition), probability, and gradient descent Basic knowledge of the transformer architecture (attention, residual stream, layer norm) A habit of reading ML/DL research papers (following Anthropic / DeepMind / OpenAI papers is recommended) GPU access before the training (RunPod, Lambda Labs, Modal) — H100 (80GB) or 2x A100 recommended Hugging Face + Weights & Biases + Neuronpedia accounts before the training
- The only production-grade advanced program in Turkey that addresses Anthropic, OpenAI, DeepMind, and Goodfire AI's 2022-2026 mech-interp research
- Full mathematical construction of the Sparse Autoencoder architecture family: comparison of Vanilla, Top-K (OpenAI), Gated + JumpReLU (DeepMind), BatchTopK
- Hands-on analysis of the Anthropic Scaling Monosemanticity and Crosscoders methodology
- End-to-end learning of the TransformerLens + SAELens + nnsight + Gemma Scope + Goodfire + Neuronpedia open-source stack
- Circuit-analysis engineering with activation patching, ACDC, and attribution patching
- Inference-time behavior control with refusal direction (Arditi 2024), persona vectors, ITI, CAA
- Production AI-safety applications: jailbreak prevention, hallucination detection, deception audit
- The discipline of producing EU AI Act Article 13 and KVKK-compliant interpretability reports
Key Takeaways
- Dissect the theoretical foundations of mech interp (superposition, polysemanticity, linear representation).
- Make evidence-based choices among the Vanilla, Top-K, Gated, and JumpReLU SAE variants.
- Train production-grade SAEs with TransformerLens + SAELens.
- Extract millions of features by applying the Anthropic Scaling Monosemanticity methodology.
- Automatically label features with an auto-interpretation pipeline (GPT-5 / Claude Opus 4.7).
- Perform circuit identification with activation patching + ACDC.
- Establish inference-time behavior control with refusal direction + persona vectors.
- Apply mech interp to jailbreak prevention, hallucination detection, and alignment audit.
- Skillfully use the Gemma Scope, Goodfire AI, and Neuronpedia public banks.
- Produce EU AI Act + KVKK-compliant interpretability reports.
Sparse Autoencoders and Mechanistic Interpretability Engineering Training (Anthropic Approach)
A 3-day advanced Turkish training that covers the 2022-2026 mechanistic-interpretability research of Anthropic, OpenAI, DeepMind, and Goodfire AI end to end: the superposition hypothesis, Sparse Autoencoders (Vanilla + Top-K + Gated + JumpReLU), Anthropic Scaling Monosemanticity, Crosscoders, refusal direction, persona vectors, circuit analysis, activation patching, and production AI-safety applications. With the TransformerLens, SAELens, nnsight, Gemma Scope, Goodfire AI, and Neuronpedia stack.
About This Course
This training is designed to be the first in Turkey to address end to end the mechanistic-interpretability (mech interp) discipline, which reverse-engineers neural networks and dissects the internal computational flow of LLMs at the mathematical level. Beginning with Chris Olah's 2020 Distill 'Circuits Thread', building the theoretical framework with Anthropic's 2022 Toy Models of Superposition paper, taken to production LLMs through Sparse Autoencoders (SAEs) by Cunningham 2023 and Anthropic Bricken/Templeton 2024, and becoming one of the AI ecosystem's central research areas throughout 2024-2026 with developments like Anthropic Scaling Monosemanticity (millions of interpretable features on Claude 3 Sonnet), Crosscoders, refusal direction (Arditi 2024), and persona vectors — this discipline has barely been addressed in Turkey even at the academic level. This program is designed to close that gap.
The program's theoretical backbone consists of mech interp's three foundational concepts: feature (the model's 'unit of thought'), circuit (the computational flow between features), and superposition (the phenomenon of a single neuron encoding multiple features). The mathematical formulation of Elhage 2022's Toy Models of Superposition — why N features can be encoded in n neurons (N > n) via the Johnson-Lindenstrauss almost-orthogonal-vector bound — is derived step by step. The polysemantic vs monosemantic neuron distinction, why the 'one neuron = one feature' assumption is wrong, and Park 2023's linear-representation hypothesis (encoding LLM features as linear directions in activation space) are covered in detail. Without this foundation, why the SAE is critical cannot be grasped.
The third module builds at the mathematical level how the Sparse Autoencoder solves the superposition problem. The works of Cunningham et al. 2023 (the first SAE experience on Pythia) and Anthropic Bricken/Templeton 2024 (Towards Monosemanticity — a production-grade SAE on a 1-layer transformer, with interpretable features like 'Arabic text', 'DNA sequences', 'base64') are analyzed in detail. The encoder f = ReLU(W_e · x + b_e), decoder x̂ = W_d · f + b_d, and loss L = ||x - x̂||² + λ · ||f||_1 formulations are constructed step by step. The discipline of an overcomplete basis with dictionary size (M) >> input dim (d), L0 sparsity measurement, the dead-features problem, and the resampling strategy are covered hands-on. The interpretation of decoder weights as feature directions and the connection to sparse-coding theory are clarified.
The fourth module comparatively examines modern SAE variants that overcome vanilla SAE's limitations. OpenAI Top-K SAE (Gao et al. 2024 — explicit K-active selection, hard sparsity constraint instead of L1 penalty, dead-feature recovery via the AuxK auxiliary loss); DeepMind Gated SAE (Rajamanoharan 2024 — gate vs magnitude separation); DeepMind JumpReLU SAE (2024 — step-function activation + straight-through estimator training); BatchTopK (Anthropic 2024); TopK + L1 hybrid approaches. The reconstruction-sparsity Pareto frontier of each is concretely compared on Gemma 2; evidence-based recommendations are given for JumpReLU or Gated in the small-model (7B) + production scenario, and Top-K for the large-model (70B+) + research scenario.
The fifth module practically sets up the end-to-end SAE training pipeline with the TransformerLens + SAELens stack. TransformerLens HookedTransformer and hook points, SAELens config (model_name, hook_name, dataset_path, batch sizes), choice of residual stream vs MLP output vs attention output, GPU memory management with the activation buffer, tokenizer + dataset preparation (Pile-uncopyrighted, FineWeb, OpenWebText), activation normalization (unit norm vs scale invariance), hyperparameter sweep (L0, L1, learning rate, K, dictionary size), dead-feature tracking + auxiliary-loss recovery, W&B + Neuronpedia training-run logging — every step is hands-on. By the end of the training, participants can train production-quality SAEs on an LLM of their choice (Gemma 2 9B, Llama 3.3 8B, Qwen3).
The sixth module analyzes in detail the training and findings of 1M, 4M, and 34M feature SAEs on Claude 3 Sonnet in Anthropic's 2024 Scaling Monosemanticity paper. Safety-relevant features — deception, manipulation, weapons, code vulnerability, bias, sycophancy — are shown with concrete examples; multilingual + multimodal features (shared Turkish-English grammatical features) are exemplified. The cross-layer SAE (encoding multiple layers with a single SAE) and cross-model SAE (Claude vs GPT vs Gemini feature comparison) approaches introduced in the 2024-2025 Crosscoders papers are covered; the universal-features hypothesis (shared feature encoding across different models) is tested. The demo of transforming Claude into the Golden Gate Claude persona by amplifying the 'Golden Gate Bridge' feature via feature steering is performed; in production, feature steering is practically set up with the Goodfire AI Ember API.
The seventh module is dedicated to the discipline of systematically discovering the meaning of millions of features after an SAE is trained. Feature labeling with top-activating examples (extracting tokens that yield max activation), the Bills et al. 2023 OpenAI auto-interpretation methodology (using GPT-5 / Claude Opus 4.7 / Gemini 2.5 Pro as feature labelers), auto-interp accuracy via simulation-based evaluation, and the specificity and sensitivity metrics are covered in detail. At the platform level, Neuronpedia (browsing 1,000+ public SAEs — GPT-2 → Gemma 2 → Claude), Goodfire AI (interactive feature exploration + steering API), and Gemma Scope (DeepMind 2024 — 400+ public SAEs on Gemma 2) are introduced. Through these platforms, the discipline of running a feature-family scan for your own domain (Turkish NLP, legal, healthcare, finance) is established.
The eighth module is dedicated to circuit-analysis engineering that uses the features extracted from SAEs. Activation patching (causal intervention via clean vs corrupt run comparison), reproduction of Wang 2022's IOI (Indirect Object Identification) circuit, Olsson 2022's induction-heads finding (the 2-step circuit of in-context learning: previous-token head + induction head), Conmy 2023's ACDC (automatic circuit discovery), edge attribution patching, and EAP-IG (compute-efficient attribution via integrated gradients) are covered in detail. Sparse interpretations of large circuits are produced with path patching and direct logit attribution.
The ninth module covers the discipline of controlling model behavior at inference time by simply adding vectors to activations without fine-tuning. The Arditi et al. 2024 finding 'Refusal in LLMs is mediated by a single direction' — that refusal is governed by a single activation direction — is constructed step by step. Direction extraction with harmful vs harmless prompt pairs, and refusal ablation with the 'jailbreak by orthogonalization' technique, are applied. Anthropic persona vectors (helpful, harmless, honest directions), ITI (Li 2023 — truthfulness improvement via head selection), CAA (Rimsky 2023 — contrastive activation addition), and the production steering API (Goodfire AI + nnsight) are covered in detail. This discipline is a critical production tool for both AI safety (jailbreak prevention) and red teaming (detecting model weaknesses).
The tenth module applies mech interp and SAEs to production AI-safety problems. Real-time jailbreak detection via refusal-direction monitoring, reducing jailbreak success rate via safety-feature amplification (40-60% in Anthropic's 2024 experiments), feature-level fingerprint of adversarial suffix attacks, hallucination prediction via uncertainty features, knowledge cutoff + temporal feature detection, factuality monitoring in production RAG, the Anthropic 2024 deception-feature research, model-behavior audits via manipulation + sycophancy features, and producing interpretability reports for EU AI Act Article 13 transparency and KVKK compliance — concrete implementations are produced for each.
The eleventh module comparatively addresses all open-source tools in the mech-interp ecosystem: TransformerLens (Neel Nanda — Python mech-interp standard, HookedTransformer + hook points + ActivationCache); SAELens (Joseph Bloom — SAE training + analysis + dashboards); nnsight (Eleuther AI — distributed mech interp + remote execution + multi-model interventions); Gemma Scope (DeepMind 2024 — 400+ public SAEs on Gemma 2 2B-9B-27B); EleutherAI sae bank (Pythia + GPT-NeoX SAEs); Goodfire AI Ember API (production feature steering); Neuronpedia (a public SAE browsing platform with 1,000+ SAEs); and Anthropic Circuits Lab open-source artifacts. Scope, learning curve, right-use scenarios, and production integration of each are covered in detail.
In the capstone module, each participant designs an end-to-end mech-interp pipeline for their own scenario: use-case selection (jailbreak detector, hallucination monitor, red-team tool, custom feature catalog), base model (Gemma 2 9B or Llama 3.3 8B or Qwen3), SAE training (Gemma Scope public SAE or custom training), feature discovery + auto-interpretation, custom feature-steering implementation, a concrete AI-safety / RAG / red-teaming use-case solution, and a 90-day operational roadmap. By the end of the training, participants reach a level of technical competence to construct the SAE mathematical formulation at the Bradley-Terry level; make the right choice among vanilla, Top-K, Gated, and JumpReLU variants; train production-grade SAEs with TransformerLens + SAELens; apply the Anthropic Scaling Monosemanticity and Crosscoders methodology; perform circuit analysis with activation patching + ACDC; control model behavior with refusal direction + persona vectors + ITI + CAA; apply mech interp to production AI-safety problems like jailbreak prevention, hallucination detection, and alignment audit; and skillfully manage the TransformerLens, SAELens, nnsight, Gemma Scope, Goodfire, Neuronpedia toolchain. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.
Training Methodology
The only production-grade advanced program in Turkey that addresses Anthropic, OpenAI, DeepMind, and Goodfire AI's 2022-2026 mech-interp research
Full mathematical construction of the Sparse Autoencoder architecture family: comparison of Vanilla, Top-K (OpenAI), Gated + JumpReLU (DeepMind), BatchTopK
Hands-on analysis of the Anthropic Scaling Monosemanticity and Crosscoders methodology
End-to-end learning of the TransformerLens + SAELens + nnsight + Gemma Scope + Goodfire + Neuronpedia open-source stack
Circuit-analysis engineering with activation patching, ACDC, and attribution patching
Inference-time behavior control with refusal direction (Arditi 2024), persona vectors, ITI, CAA
Production AI-safety applications: jailbreak prevention, hallucination detection, deception audit
The discipline of producing EU AI Act Article 13 and KVKK-compliant interpretability reports
Who Is This For?
Why This Course?
The first advanced program in Turkey that addresses mechanistic-interpretability + Sparse Autoencoder discipline at production grade.
Covers 2024-2026 frontier research, including Anthropic Scaling Monosemanticity, Crosscoders, OpenAI Top-K SAE, DeepMind JumpReLU, and Gemma Scope.
Teaches safety-critical activation-steering techniques like refusal direction (Arditi 2024) and persona vectors.
Covers the TransformerLens, SAELens, nnsight, Goodfire, Neuronpedia stack end to end and hands-on.
Ties mech interp to production AI-safety problems like jailbreak prevention, hallucination detection, and alignment audit.
Instills the discipline of producing interpretability reports for EU AI Act Article 13 and KVKK compliance.
Through the capstone project, equips the participant with a custom feature catalog + steering pipeline applicable in their own domain.
Offers Anthropic / DeepMind / Goodfire-level coverage for teams wishing to contribute to AI-safety research.
Learning Outcomes
Requirements
Course Curriculum
104 LessonsInstructor

Şükrü Yusuf KAYA
AI Architect | Enterprise AI & LLM Training | Stanford University | Software & Technology Consultant
Şükrü Yusuf KAYA is an internationally experienced AI Consultant and Technology Strategist leading the integration of artificial intelligence technologies into the global business landscape. With operations spanning 6 different countries, he bridges the gap between the theoretical boundaries of technology and practical business needs, overseeing end-to-end AI projects in data-critical sectors such as banking, e-commerce, retail, and logistics. Deepening his technical expertise particularly in Generative AI and Large Language Models (LLMs), KAYA ensures that organizations build architectures that shape the future rather than relying on short-term solutions. His visionary approach to transforming complex algorithms and advanced systems into tangible business value aligned with corporate growth targets has positioned him as a sought-after solution partner in the industry. Distinguished by his role as an instructor alongside his consulting and project management career, Şükrü Yusuf KAYA is driven by the motto of "Making AI accessible and applicable for everyone." Through comprehensive training programs designed for a wide spectrum of professionals—from technical teams to C-level executives—he prioritizes increasing organizational AI literacy and establishing a sustainable culture of technological transformation.
Frequently Asked Questions
Apply for Training
Boutique training with limited seats.
Pre-register for Next Groups
Leave your info to be the first to know when the next batch opens.
1-on-1 Mentorship
Book a private session.
Categories
Related programs
Healthcare AI Training: Hospital Operations, Clinical Decision Support, Imaging Triage and Clinical RAG
Hospital operations, clinical decision support, medical imaging triage and clinical knowledge base RAG — an end-to-end hands-on program tailored to Türkiye's healthcare sector, framed within KVKK, EU AI Act and TİTCK compliance.
2 GünallIntroduction to Artificial Intelligence and Enterprise Prompt Engineering Training
This enterprise-focused training teaches AI foundations, large language models, prompt engineering, secure usage, and real business scenarios to help teams generate higher-quality and better-controlled AI outputs.
2 GünDeepSeek and Turkish Open-Source LLM Usage Training
A comprehensive 3-day advanced training for AI engineers who want to take DeepSeek V3 / R1, Qwen 3, Gemma 3, Llama 3.3, and Turkish-fine-tuned models (Trendyol LLM, Cosmos LLM) into production in a KVKK-compliant, self-hosted architecture. Ollama, vLLM, LoRA fine-tuning, Turkish RAG, and quantization.
3 Günadvanced