# Sparse Autoencoders and Mechanistic Interpretability Engineering Training (Anthropic Approach)

> Source: https://sukruyusufkaya.com/en/training/sparse-autoencoders-mechanistic-interpretability-muhendisligi-egitimi
> Updated: 2026-05-19T01:12:12.037Z
> Level: advanced
> Topics: mechanistic interpretability, sparse autoencoder, sae, superposition, anthropic interpretability, scaling monosemanticity, crosscoders, refusal direction, activation steering, persona vectors, circuit analysis, activation patching, induction heads, ioi circuit, transformerlens, saelens, nnsight, gemma scope, goodfire ai, neuronpedia
**TLDR:** A 3-day advanced Turkish training that covers the 2022-2026 mechanistic-interpretability research of Anthropic, OpenAI, DeepMind, and Goodfire AI end to end: the superposition hypothesis, Sparse Autoencoders (Vanilla + Top-K + Gated + JumpReLU), Anthropic Scaling Monosemanticity, Crosscoders, refusal direction, persona vectors, circuit analysis, activation patching, and production AI-safety applications. With the TransformerLens, SAELens, nnsight, Gemma Scope, Goodfire AI, and Neuronpedia stack.

## Açıklama

The Sparse Autoencoders and Mechanistic Interpretability Engineering Training is a 3-day advanced program that — for the first time in Turkey — addresses end to end the mechanistic-interpretability discipline that reverse-engineers neural networks and dissects the internal workings of LLMs at the mathematical level. Calibrated for AI Researchers, AI Safety Engineers, ML Researchers, and senior AI Engineers.

## Kazanımlar

- Dissect the theoretical foundations of mech interp (superposition, polysemanticity, linear representation).
- Make evidence-based choices among the Vanilla, Top-K, Gated, and JumpReLU SAE variants.
- Train production-grade SAEs with TransformerLens + SAELens.
- Extract millions of features by applying the Anthropic Scaling Monosemanticity methodology.
- Automatically label features with an auto-interpretation pipeline (GPT-5 / Claude Opus 4.7).
- Perform circuit identification with activation patching + ACDC.
- Establish inference-time behavior control with refusal direction + persona vectors.
- Apply mech interp to jailbreak prevention, hallucination detection, and alignment audit.
- Skillfully use the Gemma Scope, Goodfire AI, and Neuronpedia public banks.
- Produce EU AI Act + KVKK-compliant interpretability reports.

<p>This training is designed to be the first in Turkey to address end to end the mechanistic-interpretability (mech interp) discipline, which reverse-engineers neural networks and dissects the internal computational flow of LLMs at the mathematical level. Beginning with Chris Olah's 2020 Distill 'Circuits Thread', building the theoretical framework with Anthropic's 2022 Toy Models of Superposition paper, taken to production LLMs through Sparse Autoencoders (SAEs) by Cunningham 2023 and Anthropic Bricken/Templeton 2024, and becoming one of the AI ecosystem's central research areas throughout 2024-2026 with developments like Anthropic Scaling Monosemanticity (millions of interpretable features on Claude 3 Sonnet), Crosscoders, refusal direction (Arditi 2024), and persona vectors — this discipline has barely been addressed in Turkey even at the academic level. This program is designed to close that gap.</p>

<p>The program's theoretical backbone consists of mech interp's three foundational concepts: feature (the model's 'unit of thought'), circuit (the computational flow between features), and superposition (the phenomenon of a single neuron encoding multiple features). The mathematical formulation of Elhage 2022's Toy Models of Superposition — why N features can be encoded in n neurons (N > n) via the Johnson-Lindenstrauss almost-orthogonal-vector bound — is derived step by step. The polysemantic vs monosemantic neuron distinction, why the 'one neuron = one feature' assumption is wrong, and Park 2023's linear-representation hypothesis (encoding LLM features as linear directions in activation space) are covered in detail. Without this foundation, why the SAE is critical cannot be grasped.</p>

<p>The third module builds at the mathematical level how the Sparse Autoencoder solves the superposition problem. The works of Cunningham et al. 2023 (the first SAE experience on Pythia) and Anthropic Bricken/Templeton 2024 (Towards Monosemanticity — a production-grade SAE on a 1-layer transformer, with interpretable features like 'Arabic text', 'DNA sequences', 'base64') are analyzed in detail. The encoder f = ReLU(W_e · x + b_e), decoder x̂ = W_d · f + b_d, and loss L = ||x - x̂||² + λ · ||f||_1 formulations are constructed step by step. The discipline of an overcomplete basis with dictionary size (M) >> input dim (d), L0 sparsity measurement, the dead-features problem, and the resampling strategy are covered hands-on. The interpretation of decoder weights as feature directions and the connection to sparse-coding theory are clarified.</p>

<p>The fourth module comparatively examines modern SAE variants that overcome vanilla SAE's limitations. OpenAI Top-K SAE (Gao et al. 2024 — explicit K-active selection, hard sparsity constraint instead of L1 penalty, dead-feature recovery via the AuxK auxiliary loss); DeepMind Gated SAE (Rajamanoharan 2024 — gate vs magnitude separation); DeepMind JumpReLU SAE (2024 — step-function activation + straight-through estimator training); BatchTopK (Anthropic 2024); TopK + L1 hybrid approaches. The reconstruction-sparsity Pareto frontier of each is concretely compared on Gemma 2; evidence-based recommendations are given for JumpReLU or Gated in the small-model (7B) + production scenario, and Top-K for the large-model (70B+) + research scenario.</p>

<p>The fifth module practically sets up the end-to-end SAE training pipeline with the TransformerLens + SAELens stack. TransformerLens HookedTransformer and hook points, SAELens config (model_name, hook_name, dataset_path, batch sizes), choice of residual stream vs MLP output vs attention output, GPU memory management with the activation buffer, tokenizer + dataset preparation (Pile-uncopyrighted, FineWeb, OpenWebText), activation normalization (unit norm vs scale invariance), hyperparameter sweep (L0, L1, learning rate, K, dictionary size), dead-feature tracking + auxiliary-loss recovery, W&B + Neuronpedia training-run logging — every step is hands-on. By the end of the training, participants can train production-quality SAEs on an LLM of their choice (Gemma 2 9B, Llama 3.3 8B, Qwen3).</p>

<p>The sixth module analyzes in detail the training and findings of 1M, 4M, and 34M feature SAEs on Claude 3 Sonnet in Anthropic's 2024 Scaling Monosemanticity paper. Safety-relevant features — deception, manipulation, weapons, code vulnerability, bias, sycophancy — are shown with concrete examples; multilingual + multimodal features (shared Turkish-English grammatical features) are exemplified. The cross-layer SAE (encoding multiple layers with a single SAE) and cross-model SAE (Claude vs GPT vs Gemini feature comparison) approaches introduced in the 2024-2025 Crosscoders papers are covered; the universal-features hypothesis (shared feature encoding across different models) is tested. The demo of transforming Claude into the Golden Gate Claude persona by amplifying the 'Golden Gate Bridge' feature via feature steering is performed; in production, feature steering is practically set up with the Goodfire AI Ember API.</p>

<p>The seventh module is dedicated to the discipline of systematically discovering the meaning of millions of features after an SAE is trained. Feature labeling with top-activating examples (extracting tokens that yield max activation), the Bills et al. 2023 OpenAI auto-interpretation methodology (using GPT-5 / Claude Opus 4.7 / Gemini 2.5 Pro as feature labelers), auto-interp accuracy via simulation-based evaluation, and the specificity and sensitivity metrics are covered in detail. At the platform level, Neuronpedia (browsing 1,000+ public SAEs — GPT-2 → Gemma 2 → Claude), Goodfire AI (interactive feature exploration + steering API), and Gemma Scope (DeepMind 2024 — 400+ public SAEs on Gemma 2) are introduced. Through these platforms, the discipline of running a feature-family scan for your own domain (Turkish NLP, legal, healthcare, finance) is established.</p>

<p>The eighth module is dedicated to circuit-analysis engineering that uses the features extracted from SAEs. Activation patching (causal intervention via clean vs corrupt run comparison), reproduction of Wang 2022's IOI (Indirect Object Identification) circuit, Olsson 2022's induction-heads finding (the 2-step circuit of in-context learning: previous-token head + induction head), Conmy 2023's ACDC (automatic circuit discovery), edge attribution patching, and EAP-IG (compute-efficient attribution via integrated gradients) are covered in detail. Sparse interpretations of large circuits are produced with path patching and direct logit attribution.</p>

<p>The ninth module covers the discipline of controlling model behavior at inference time by simply adding vectors to activations without fine-tuning. The Arditi et al. 2024 finding 'Refusal in LLMs is mediated by a single direction' — that refusal is governed by a single activation direction — is constructed step by step. Direction extraction with harmful vs harmless prompt pairs, and refusal ablation with the 'jailbreak by orthogonalization' technique, are applied. Anthropic persona vectors (helpful, harmless, honest directions), ITI (Li 2023 — truthfulness improvement via head selection), CAA (Rimsky 2023 — contrastive activation addition), and the production steering API (Goodfire AI + nnsight) are covered in detail. This discipline is a critical production tool for both AI safety (jailbreak prevention) and red teaming (detecting model weaknesses).</p>

<p>The tenth module applies mech interp and SAEs to production AI-safety problems. Real-time jailbreak detection via refusal-direction monitoring, reducing jailbreak success rate via safety-feature amplification (40-60% in Anthropic's 2024 experiments), feature-level fingerprint of adversarial suffix attacks, hallucination prediction via uncertainty features, knowledge cutoff + temporal feature detection, factuality monitoring in production RAG, the Anthropic 2024 deception-feature research, model-behavior audits via manipulation + sycophancy features, and producing interpretability reports for EU AI Act Article 13 transparency and KVKK compliance — concrete implementations are produced for each.</p>

<p>The eleventh module comparatively addresses all open-source tools in the mech-interp ecosystem: TransformerLens (Neel Nanda — Python mech-interp standard, HookedTransformer + hook points + ActivationCache); SAELens (Joseph Bloom — SAE training + analysis + dashboards); nnsight (Eleuther AI — distributed mech interp + remote execution + multi-model interventions); Gemma Scope (DeepMind 2024 — 400+ public SAEs on Gemma 2 2B-9B-27B); EleutherAI sae bank (Pythia + GPT-NeoX SAEs); Goodfire AI Ember API (production feature steering); Neuronpedia (a public SAE browsing platform with 1,000+ SAEs); and Anthropic Circuits Lab open-source artifacts. Scope, learning curve, right-use scenarios, and production integration of each are covered in detail.</p>

<p>In the capstone module, each participant designs an end-to-end mech-interp pipeline for their own scenario: use-case selection (jailbreak detector, hallucination monitor, red-team tool, custom feature catalog), base model (Gemma 2 9B or Llama 3.3 8B or Qwen3), SAE training (Gemma Scope public SAE or custom training), feature discovery + auto-interpretation, custom feature-steering implementation, a concrete AI-safety / RAG / red-teaming use-case solution, and a 90-day operational roadmap. By the end of the training, participants reach a level of technical competence to construct the SAE mathematical formulation at the Bradley-Terry level; make the right choice among vanilla, Top-K, Gated, and JumpReLU variants; train production-grade SAEs with TransformerLens + SAELens; apply the Anthropic Scaling Monosemanticity and Crosscoders methodology; perform circuit analysis with activation patching + ACDC; control model behavior with refusal direction + persona vectors + ITI + CAA; apply mech interp to production AI-safety problems like jailbreak prevention, hallucination detection, and alignment audit; and skillfully manage the TransformerLens, SAELens, nnsight, Gemma Scope, Goodfire, Neuronpedia toolchain. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.</p>