Skip to content

About this training

A 3-day advanced Turkish training that covers the 2022-2026 mechanistic-interpretability research of Anthropic, OpenAI, DeepMind, and Goodfire AI end to end: the superposition hypothesis, Sparse Autoencoders (Vanilla + Top-K + Gated + JumpReLU), Anthropic Scaling Monosemanticity, Crosscoders, refusal direction, persona vectors, circuit analysis, activation patching, and production AI-safety applications. With the TransformerLens, SAELens, nnsight, Gemma Scope, Goodfire AI, and Neuronpedia stack.

This training is designed for: AI Researchers who want to do Anthropic / OpenAI / DeepMind-style mech-interp research AI Safety Engineers who want to build production AI-safety pipelines by understanding LLM internals Senior AI Engineers developing products that require jailbreak prevention, hallucination detection, and adversarial robustness Compliance + risk managers who must perform alignment audits in enterprise LLM usage Red Team engineers and adversarial AI-security experts Startup technical leaders who want to build the interpretability infrastructure for their own open-source LLM (Turkish or domain-specific)

Why this course matters: The first advanced program in Turkey that addresses mechanistic-interpretability + Sparse Autoencoder discipline at production grade. Covers 2024-2026 frontier research, including Anthropic Scaling Monosemanticity, Crosscoders, OpenAI Top-K SAE, DeepMind JumpReLU, and Gemma Scope. Teaches safety-critical activation-steering techniques like refusal direction (Arditi 2024) and persona vectors. Covers the TransformerLens, SAELens, nnsight, Goodfire, Neuronpedia stack end to end and hands-on. Ties mech interp to production AI-safety problems like jailbreak prevention, hallucination detection, and alignment audit. Instills the discipline of producing interpretability reports for EU AI Act Article 13 and KVKK compliance. Through the capstone project, equips the participant with a custom feature catalog + steering pipeline applicable in their own domain. Offers Anthropic / DeepMind / Goodfire-level coverage for teams wishing to contribute to AI-safety research.

Learning outcomes by the end of the programme: Dissect the theoretical foundations of mech interp (superposition, polysemanticity, linear representation). Make evidence-based choices among the Vanilla, Top-K, Gated, and JumpReLU SAE variants. Train production-grade SAEs with TransformerLens + SAELens. Extract millions of features by applying the Anthropic Scaling Monosemanticity methodology. Automatically label features with an auto-interpretation pipeline (GPT-5 / Claude Opus 4.7). Perform circuit identification with activation patching + ACDC. Establish inference-time behavior control with refusal direction + persona vectors. Apply mech interp to jailbreak prevention, hallucination detection, and alignment audit. Skillfully use the Gemma Scope, Goodfire AI, and Neuronpedia public banks. Produce EU AI Act + KVKK-compliant interpretability reports.

Prerequisites and recommended background: Active Python experience (intermediate to advanced), basic use of PyTorch and HuggingFace Transformers Foundations in linear algebra (matrix operations, eigenvalue decomposition), probability, and gradient descent Basic knowledge of the transformer architecture (attention, residual stream, layer norm) A habit of reading ML/DL research papers (following Anthropic / DeepMind / OpenAI papers is recommended) GPU access before the training (RunPod, Lambda Labs, Modal) — H100 (80GB) or 2x A100 recommended Hugging Face + Weights & Biases + Neuronpedia accounts before the training

  • The only production-grade advanced program in Turkey that addresses Anthropic, OpenAI, DeepMind, and Goodfire AI's 2022-2026 mech-interp research
  • Full mathematical construction of the Sparse Autoencoder architecture family: comparison of Vanilla, Top-K (OpenAI), Gated + JumpReLU (DeepMind), BatchTopK
  • Hands-on analysis of the Anthropic Scaling Monosemanticity and Crosscoders methodology
  • End-to-end learning of the TransformerLens + SAELens + nnsight + Gemma Scope + Goodfire + Neuronpedia open-source stack
  • Circuit-analysis engineering with activation patching, ACDC, and attribution patching
  • Inference-time behavior control with refusal direction (Arditi 2024), persona vectors, ITI, CAA
  • Production AI-safety applications: jailbreak prevention, hallucination detection, deception audit
  • The discipline of producing EU AI Act Article 13 and KVKK-compliant interpretability reports

Key Takeaways

  1. Dissect the theoretical foundations of mech interp (superposition, polysemanticity, linear representation).
  2. Make evidence-based choices among the Vanilla, Top-K, Gated, and JumpReLU SAE variants.
  3. Train production-grade SAEs with TransformerLens + SAELens.
  4. Extract millions of features by applying the Anthropic Scaling Monosemanticity methodology.
  5. Automatically label features with an auto-interpretation pipeline (GPT-5 / Claude Opus 4.7).
  6. Perform circuit identification with activation patching + ACDC.
  7. Establish inference-time behavior control with refusal direction + persona vectors.
  8. Apply mech interp to jailbreak prevention, hallucination detection, and alignment audit.
  9. Skillfully use the Gemma Scope, Goodfire AI, and Neuronpedia public banks.
  10. Produce EU AI Act + KVKK-compliant interpretability reports.
Hero Background
Advanced Level3 Gün

Sparse Autoencoders and Mechanistic Interpretability Engineering Training (Anthropic Approach)

A 3-day advanced Turkish training that covers the 2022-2026 mechanistic-interpretability research of Anthropic, OpenAI, DeepMind, and Goodfire AI end to end: the superposition hypothesis, Sparse Autoencoders (Vanilla + Top-K + Gated + JumpReLU), Anthropic Scaling Monosemanticity, Crosscoders, refusal direction, persona vectors, circuit analysis, activation patching, and production AI-safety applications. With the TransformerLens, SAELens, nnsight, Gemma Scope, Goodfire AI, and Neuronpedia stack.

About This Course

This training is designed to be the first in Turkey to address end to end the mechanistic-interpretability (mech interp) discipline, which reverse-engineers neural networks and dissects the internal computational flow of LLMs at the mathematical level. Beginning with Chris Olah's 2020 Distill 'Circuits Thread', building the theoretical framework with Anthropic's 2022 Toy Models of Superposition paper, taken to production LLMs through Sparse Autoencoders (SAEs) by Cunningham 2023 and Anthropic Bricken/Templeton 2024, and becoming one of the AI ecosystem's central research areas throughout 2024-2026 with developments like Anthropic Scaling Monosemanticity (millions of interpretable features on Claude 3 Sonnet), Crosscoders, refusal direction (Arditi 2024), and persona vectors — this discipline has barely been addressed in Turkey even at the academic level. This program is designed to close that gap.



The program's theoretical backbone consists of mech interp's three foundational concepts: feature (the model's 'unit of thought'), circuit (the computational flow between features), and superposition (the phenomenon of a single neuron encoding multiple features). The mathematical formulation of Elhage 2022's Toy Models of Superposition — why N features can be encoded in n neurons (N > n) via the Johnson-Lindenstrauss almost-orthogonal-vector bound — is derived step by step. The polysemantic vs monosemantic neuron distinction, why the 'one neuron = one feature' assumption is wrong, and Park 2023's linear-representation hypothesis (encoding LLM features as linear directions in activation space) are covered in detail. Without this foundation, why the SAE is critical cannot be grasped.



The third module builds at the mathematical level how the Sparse Autoencoder solves the superposition problem. The works of Cunningham et al. 2023 (the first SAE experience on Pythia) and Anthropic Bricken/Templeton 2024 (Towards Monosemanticity — a production-grade SAE on a 1-layer transformer, with interpretable features like 'Arabic text', 'DNA sequences', 'base64') are analyzed in detail. The encoder f = ReLU(W_e · x + b_e), decoder x̂ = W_d · f + b_d, and loss L = ||x - x̂||² + λ · ||f||_1 formulations are constructed step by step. The discipline of an overcomplete basis with dictionary size (M) >> input dim (d), L0 sparsity measurement, the dead-features problem, and the resampling strategy are covered hands-on. The interpretation of decoder weights as feature directions and the connection to sparse-coding theory are clarified.



The fourth module comparatively examines modern SAE variants that overcome vanilla SAE's limitations. OpenAI Top-K SAE (Gao et al. 2024 — explicit K-active selection, hard sparsity constraint instead of L1 penalty, dead-feature recovery via the AuxK auxiliary loss); DeepMind Gated SAE (Rajamanoharan 2024 — gate vs magnitude separation); DeepMind JumpReLU SAE (2024 — step-function activation + straight-through estimator training); BatchTopK (Anthropic 2024); TopK + L1 hybrid approaches. The reconstruction-sparsity Pareto frontier of each is concretely compared on Gemma 2; evidence-based recommendations are given for JumpReLU or Gated in the small-model (7B) + production scenario, and Top-K for the large-model (70B+) + research scenario.



The fifth module practically sets up the end-to-end SAE training pipeline with the TransformerLens + SAELens stack. TransformerLens HookedTransformer and hook points, SAELens config (model_name, hook_name, dataset_path, batch sizes), choice of residual stream vs MLP output vs attention output, GPU memory management with the activation buffer, tokenizer + dataset preparation (Pile-uncopyrighted, FineWeb, OpenWebText), activation normalization (unit norm vs scale invariance), hyperparameter sweep (L0, L1, learning rate, K, dictionary size), dead-feature tracking + auxiliary-loss recovery, W&B + Neuronpedia training-run logging — every step is hands-on. By the end of the training, participants can train production-quality SAEs on an LLM of their choice (Gemma 2 9B, Llama 3.3 8B, Qwen3).



The sixth module analyzes in detail the training and findings of 1M, 4M, and 34M feature SAEs on Claude 3 Sonnet in Anthropic's 2024 Scaling Monosemanticity paper. Safety-relevant features — deception, manipulation, weapons, code vulnerability, bias, sycophancy — are shown with concrete examples; multilingual + multimodal features (shared Turkish-English grammatical features) are exemplified. The cross-layer SAE (encoding multiple layers with a single SAE) and cross-model SAE (Claude vs GPT vs Gemini feature comparison) approaches introduced in the 2024-2025 Crosscoders papers are covered; the universal-features hypothesis (shared feature encoding across different models) is tested. The demo of transforming Claude into the Golden Gate Claude persona by amplifying the 'Golden Gate Bridge' feature via feature steering is performed; in production, feature steering is practically set up with the Goodfire AI Ember API.



The seventh module is dedicated to the discipline of systematically discovering the meaning of millions of features after an SAE is trained. Feature labeling with top-activating examples (extracting tokens that yield max activation), the Bills et al. 2023 OpenAI auto-interpretation methodology (using GPT-5 / Claude Opus 4.7 / Gemini 2.5 Pro as feature labelers), auto-interp accuracy via simulation-based evaluation, and the specificity and sensitivity metrics are covered in detail. At the platform level, Neuronpedia (browsing 1,000+ public SAEs — GPT-2 → Gemma 2 → Claude), Goodfire AI (interactive feature exploration + steering API), and Gemma Scope (DeepMind 2024 — 400+ public SAEs on Gemma 2) are introduced. Through these platforms, the discipline of running a feature-family scan for your own domain (Turkish NLP, legal, healthcare, finance) is established.



The eighth module is dedicated to circuit-analysis engineering that uses the features extracted from SAEs. Activation patching (causal intervention via clean vs corrupt run comparison), reproduction of Wang 2022's IOI (Indirect Object Identification) circuit, Olsson 2022's induction-heads finding (the 2-step circuit of in-context learning: previous-token head + induction head), Conmy 2023's ACDC (automatic circuit discovery), edge attribution patching, and EAP-IG (compute-efficient attribution via integrated gradients) are covered in detail. Sparse interpretations of large circuits are produced with path patching and direct logit attribution.



The ninth module covers the discipline of controlling model behavior at inference time by simply adding vectors to activations without fine-tuning. The Arditi et al. 2024 finding 'Refusal in LLMs is mediated by a single direction' — that refusal is governed by a single activation direction — is constructed step by step. Direction extraction with harmful vs harmless prompt pairs, and refusal ablation with the 'jailbreak by orthogonalization' technique, are applied. Anthropic persona vectors (helpful, harmless, honest directions), ITI (Li 2023 — truthfulness improvement via head selection), CAA (Rimsky 2023 — contrastive activation addition), and the production steering API (Goodfire AI + nnsight) are covered in detail. This discipline is a critical production tool for both AI safety (jailbreak prevention) and red teaming (detecting model weaknesses).



The tenth module applies mech interp and SAEs to production AI-safety problems. Real-time jailbreak detection via refusal-direction monitoring, reducing jailbreak success rate via safety-feature amplification (40-60% in Anthropic's 2024 experiments), feature-level fingerprint of adversarial suffix attacks, hallucination prediction via uncertainty features, knowledge cutoff + temporal feature detection, factuality monitoring in production RAG, the Anthropic 2024 deception-feature research, model-behavior audits via manipulation + sycophancy features, and producing interpretability reports for EU AI Act Article 13 transparency and KVKK compliance — concrete implementations are produced for each.



The eleventh module comparatively addresses all open-source tools in the mech-interp ecosystem: TransformerLens (Neel Nanda — Python mech-interp standard, HookedTransformer + hook points + ActivationCache); SAELens (Joseph Bloom — SAE training + analysis + dashboards); nnsight (Eleuther AI — distributed mech interp + remote execution + multi-model interventions); Gemma Scope (DeepMind 2024 — 400+ public SAEs on Gemma 2 2B-9B-27B); EleutherAI sae bank (Pythia + GPT-NeoX SAEs); Goodfire AI Ember API (production feature steering); Neuronpedia (a public SAE browsing platform with 1,000+ SAEs); and Anthropic Circuits Lab open-source artifacts. Scope, learning curve, right-use scenarios, and production integration of each are covered in detail.



In the capstone module, each participant designs an end-to-end mech-interp pipeline for their own scenario: use-case selection (jailbreak detector, hallucination monitor, red-team tool, custom feature catalog), base model (Gemma 2 9B or Llama 3.3 8B or Qwen3), SAE training (Gemma Scope public SAE or custom training), feature discovery + auto-interpretation, custom feature-steering implementation, a concrete AI-safety / RAG / red-teaming use-case solution, and a 90-day operational roadmap. By the end of the training, participants reach a level of technical competence to construct the SAE mathematical formulation at the Bradley-Terry level; make the right choice among vanilla, Top-K, Gated, and JumpReLU variants; train production-grade SAEs with TransformerLens + SAELens; apply the Anthropic Scaling Monosemanticity and Crosscoders methodology; perform circuit analysis with activation patching + ACDC; control model behavior with refusal direction + persona vectors + ITI + CAA; apply mech interp to production AI-safety problems like jailbreak prevention, hallucination detection, and alignment audit; and skillfully manage the TransformerLens, SAELens, nnsight, Gemma Scope, Goodfire, Neuronpedia toolchain. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.

Training Methodology

The only production-grade advanced program in Turkey that addresses Anthropic, OpenAI, DeepMind, and Goodfire AI's 2022-2026 mech-interp research

Full mathematical construction of the Sparse Autoencoder architecture family: comparison of Vanilla, Top-K (OpenAI), Gated + JumpReLU (DeepMind), BatchTopK

Hands-on analysis of the Anthropic Scaling Monosemanticity and Crosscoders methodology

End-to-end learning of the TransformerLens + SAELens + nnsight + Gemma Scope + Goodfire + Neuronpedia open-source stack

Circuit-analysis engineering with activation patching, ACDC, and attribution patching

Inference-time behavior control with refusal direction (Arditi 2024), persona vectors, ITI, CAA

Production AI-safety applications: jailbreak prevention, hallucination detection, deception audit

The discipline of producing EU AI Act Article 13 and KVKK-compliant interpretability reports

Who Is This For?

AI Researchers who want to do Anthropic / OpenAI / DeepMind-style mech-interp research
AI Safety Engineers who want to build production AI-safety pipelines by understanding LLM internals
Senior AI Engineers developing products that require jailbreak prevention, hallucination detection, and adversarial robustness
Compliance + risk managers who must perform alignment audits in enterprise LLM usage
Red Team engineers and adversarial AI-security experts
Startup technical leaders who want to build the interpretability infrastructure for their own open-source LLM (Turkish or domain-specific)

Why This Course?

1

The first advanced program in Turkey that addresses mechanistic-interpretability + Sparse Autoencoder discipline at production grade.

2

Covers 2024-2026 frontier research, including Anthropic Scaling Monosemanticity, Crosscoders, OpenAI Top-K SAE, DeepMind JumpReLU, and Gemma Scope.

3

Teaches safety-critical activation-steering techniques like refusal direction (Arditi 2024) and persona vectors.

4

Covers the TransformerLens, SAELens, nnsight, Goodfire, Neuronpedia stack end to end and hands-on.

5

Ties mech interp to production AI-safety problems like jailbreak prevention, hallucination detection, and alignment audit.

6

Instills the discipline of producing interpretability reports for EU AI Act Article 13 and KVKK compliance.

7

Through the capstone project, equips the participant with a custom feature catalog + steering pipeline applicable in their own domain.

8

Offers Anthropic / DeepMind / Goodfire-level coverage for teams wishing to contribute to AI-safety research.

Learning Outcomes

Dissect the theoretical foundations of mech interp (superposition, polysemanticity, linear representation).
Make evidence-based choices among the Vanilla, Top-K, Gated, and JumpReLU SAE variants.
Train production-grade SAEs with TransformerLens + SAELens.
Extract millions of features by applying the Anthropic Scaling Monosemanticity methodology.
Automatically label features with an auto-interpretation pipeline (GPT-5 / Claude Opus 4.7).
Perform circuit identification with activation patching + ACDC.
Establish inference-time behavior control with refusal direction + persona vectors.
Apply mech interp to jailbreak prevention, hallucination detection, and alignment audit.
Skillfully use the Gemma Scope, Goodfire AI, and Neuronpedia public banks.
Produce EU AI Act + KVKK-compliant interpretability reports.

Requirements

Active Python experience (intermediate to advanced), basic use of PyTorch and HuggingFace Transformers
Foundations in linear algebra (matrix operations, eigenvalue decomposition), probability, and gradient descent
Basic knowledge of the transformer architecture (attention, residual stream, layer norm)
A habit of reading ML/DL research papers (following Anthropic / DeepMind / OpenAI papers is recommended)
GPU access before the training (RunPod, Lambda Labs, Modal) — H100 (80GB) or 2x A100 recommended
Hugging Face + Weights & Biases + Neuronpedia accounts before the training

Course Curriculum

104 Lessons
01
Module 1: Strategic Introduction to the Mechanistic Interpretability Discipline9 Lessons
02
Module 2: Features, Circuits, and the Superposition Hypothesis9 Lessons
03
Module 3: Sparse Autoencoder (SAE) Foundations — Cunningham 2023 and Anthropic Bricken 20249 Lessons
04
Module 4: Modern SAE Architecture Families — Top-K, Gated, JumpReLU, and BatchTopK9 Lessons
05
Module 5: SAE Training — Practical Implementation (TransformerLens + SAELens)9 Lessons
06
Module 6: Anthropic Scaling Monosemanticity and Crosscoders9 Lessons
07
Module 7: Feature Discovery and Auto-Interpretation — Neuronpedia and GPT-5 Auto-Interp9 Lessons
08
Module 8: Circuit Analysis Engineering — IOI Circuit, Induction Heads, and Attribution Patching9 Lessons
09
Module 9: Activation Steering and Refusal Direction — Controlling LLM Behavior at Inference9 Lessons
10
Module 10: Production AI Safety Applications — Jailbreak, Hallucination, Alignment Audit9 Lessons
11
Module 11: Open-Source Stack and Tooling — TransformerLens, SAELens, nnsight, Gemma Scope, Goodfire9 Lessons
12
Module 12: Capstone — Custom Feature Discovery and Steering Pipeline5 Lessons

Instructor

Şükrü Yusuf KAYA

Şükrü Yusuf KAYA

AI Architect | Enterprise AI & LLM Training | Stanford University | Software & Technology Consultant

Şükrü Yusuf KAYA is an internationally experienced AI Consultant and Technology Strategist leading the integration of artificial intelligence technologies into the global business landscape. With operations spanning 6 different countries, he bridges the gap between the theoretical boundaries of technology and practical business needs, overseeing end-to-end AI projects in data-critical sectors such as banking, e-commerce, retail, and logistics. Deepening his technical expertise particularly in Generative AI and Large Language Models (LLMs), KAYA ensures that organizations build architectures that shape the future rather than relying on short-term solutions. His visionary approach to transforming complex algorithms and advanced systems into tangible business value aligned with corporate growth targets has positioned him as a sought-after solution partner in the industry. Distinguished by his role as an instructor alongside his consulting and project management career, Şükrü Yusuf KAYA is driven by the motto of "Making AI accessible and applicable for everyone." Through comprehensive training programs designed for a wide spectrum of professionals—from technical teams to C-level executives—he prioritizes increasing organizational AI literacy and establishing a sustainable culture of technological transformation.

Frequently Asked Questions