I can implement dual numbers in Python. How does JAX make this 'fast'?

Two main contributions: (1) **XLA compilation**: compiles forward graph once (not operator overloading), then calls optimized binary. (2) **Fused operations**: combines multiple small ops into a single kernel → less memory traffic. (3) **GPU/TPU dispatch**: operations behaving like dual numbers translate directly to GPU kernels. Your Python dual class needs interpreter trips per op; JAX does it in one GPU call. **Result**: 1000x+ faster.

Is hessian in PyTorch really scalable? For n=1B parameters?

No, not full hessian. Full hessian 1B × 1B = 10¹⁸ elements → ~exabyte memory. **HVP** (Hessian-vector product) works: O(N) memory, O(N) time. From there, iterative Hessian eigendecomposition (Lanczos) and conjugate gradient enable Newton-style updates. K-FAC and Shampoo do this. Full second-order optimizers aren't common in practical LLM training — Adam family approximations suffice. Module 17 details.

Are influence functions (Koh & Liang 2017) actually used?

Niche but applied. (1) **Data cleaning**: 'Which training example caused this test error?' (2) **Debugging LLM behavior**: 'Which training data contributed to this hallucination?' (3) **Privacy auditing**: 'How visible is a user's data in the model?' (4) **Memorization detection**: extract attack potential. As of 2024, Anthropic's interpretability team uses influence function variants (seen in papers). Practical challenge: full HVP not cheap enough, **stochastic estimates** and approximations needed.

What's the difference between `torch.func` and `torch.autograd`? Can I use both?

Difference: (1) `torch.autograd` is mutable, stateful — calling `.backward()` accumulates grads on tensors. (2) `torch.func` is **functional**: gradient returns as a function, no side effects, vmap-friendly. (3) `torch.func` makes higher-order autodiff easier with JAX-like APIs. Combined use: standard `torch.autograd` for training, switch to `torch.func` for per-sample gradients or HVP in a batch. No friction but a single function doesn't bridge both APIs.

How do JAX's `pmap` and `vmap` affect autodiff?

**Composable**: `grad`, `jit`, `vmap`, `pmap` combine in any order. (1) `vmap(grad(f))`: **per-sample gradient** across batch — for federated learning, influence functions, advanced regularization. (2) `grad(vmap(f))`: gradient of vectorized function — legal but rarely useful. (3) `pmap(grad(f))`: multi-device per-shard gradient — foundation of distributed training. (4) `jit(grad(f))`: compiled gradient — production speed. JAX's superpower is this **transform composition**; PyTorch `torch.func` is getting there but JAX is native.

Reverse-mode vs Forward-mode Autodiff: JVP, VJP, Dual Numbers, and Which to Use When in LLMs

Q: Is forward-mode autodiff really not popular? Classical courses always show backprop.

Popular but in different places. Backprop (reverse) is the daily bread of deep learning — perfect fit for 1 scalar loss + billions of parameters. Forward-mode niches: (1) **Higher-order derivatives** (HVP, Hessian) — backbone of modern optimizer math. (2) **Physics-informed NN** — PDE constraints needing input derivatives. (3) **Sensitivity analysis** — 1 input → many outputs systems. (4) **Sparse Jacobian** — if mostly zeros, forward is efficient. (5) **Hardware-friendly**: forward memory O(1) vs reverse O(N) — sometimes preferred on edge devices.

The two fundamental modes of automatic differentiation: forward-mode (Jacobian-vector product, dual numbers) and reverse-mode (vector-Jacobian product, backprop). Mathematical comparison, computational complexity, JAX's jvp/vjp/grad/hessian, which scenario requires which mode in LLMs.

Şükrü Yusuf KAYA

55 min read

5/13/2026

Advanced

Reverse-mode vs Forward-mode Autodiff: JVP, VJP, Dual Numbers ve LLM'de Hangisi Ne Zaman

🔄 Backprop'un kardeşi forward-mode

Her LLM mühendisi backprop'u (reverse-mode) bilir. Ama autodiff'in iki modu olduğunu, ve forward-mode'un da spesifik senaryolarda kullanıldığını çok azı bilir. 55 dakika sonra: JVP ve VJP'in ne demek olduğunu, JAX'in jvp/vjp/grad/hessian/jacfwd/jacrev API'larını, ve neden Hessian-vector products için forward-mode'un kritik olduğunu kavrayacaksın. Bu, advanced ML araştırmacılarının cephaneliği.

Ders Haritası (Detaylı)#

Sembolik, numerik, otomatik türev karşılaştırması
Forward-mode: Dual numbers ile sezgi
Reverse-mode: Backprop'un genel hali
Jacobian'ın anatomisi: tam matris vs implicit
JVP (Jacobian-vector product) — forward-mode'un kalbi
VJP (vector-Jacobian product) — reverse-mode'un kalbi
Hesaplama karmaşıklığı: hangisi ne zaman ucuz
JAX API: jvp, vjp, grad, jacfwd, jacrev, hessian
Hessian-vector products: forward-of-reverse trick
Higher-order autodiff: 2., 3. türevler
LLM uygulamaları: Newton method, K-FAC, Fisher info
Pratik: PyTorch'ta forward-mode (
torch.func
)

1. Türev Hesaplama Yöntemleri — Üç Aile#

Bir fonksiyonun türevini hesaplamanın üç yolu var:

a) Sembolik türev (analytic)#

Calculus kurallarını formelle uygula:

d/dx (x²) = 2x

d/dx (sin x) = cos x

Araçlar: SymPy, Mathematica, Maple.

Avantajlar:

Tam doğru (yuvarlama yok)
Formülü tekrar kullanılabilir

Dezavantajlar:

Expression swell: karmaşık fonksiyonun türevi exponential olarak büyür
Conditional ve loop'lar ile uyumsuz
NN'lerde imkansız (milyarlarca operasyon)

b) Numerik türev (finite differences)#

f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}

Avantajlar:

Genel (her fonksiyonda çalışır)
Implementasyon trivial

Dezavantajlar:

Truncation error (O(h²))
Round-off error (h çok küçükse FP precision kaybedilir)
n parametre için n+1 fonksiyon evaluation → O(n) maliyet
Genelde gradient check için kullanılır, gerçek hesap için değil

c) Otomatik türev (autodiff)#

İkisinin de güçlü yanlarını birleştir:

Tam doğru (sembolik gibi, çünkü chain rule'u tam uygular)
Hesaplanabilir (sembolik expression swell yok)
Conditional, loop, recursion ile uyumlu
O(N) maliyet (N = operasyon sayısı)

Autodiff'in iki modu:

Forward-mode (tangent-mode)
Reverse-mode (adjoint-mode, backprop'un genel hali)

python

import numpy as np
import sympy as sp
 
# 1. Sembolik
x_sym = sp.Symbol('x')
f_sym = sp.sin(x_sym) * sp.exp(x_sym)
df_sym = sp.diff(f_sym, x_sym)
print("Sembolik f'(x):", df_sym)
# cos(x)*exp(x) + sin(x)*exp(x)
f_at_2 = float(df_sym.subs(x_sym, 2.0))
print(f"f'(2) sembolik = {f_at_2:.6f}")
 
# 2. Numerik (finite difference)
def f(x):
    return np.sin(x) * np.exp(x)
 
h = 1e-5
df_num = (f(2.0 + h) - f(2.0 - h)) / (2 * h)
print(f"f'(2) numerik   = {df_num:.6f}")
 
# 3. Otomatik türev (JAX ile)
import jax
import jax.numpy as jnp
 
def f_jax(x):
    return jnp.sin(x) * jnp.exp(x)
 
df_auto = jax.grad(f_jax)(2.0)
print(f"f'(2) autodiff  = {df_auto:.6f}")
 
# Hepsi aynı değer — ama performans karakteristikleri çok farklı

Üç türev metodu yan yana — aynı sonuç, farklı doğa.

2. Forward-mode: Dual Numbers ile Sezgi#

Dual numbers kavramı (Clifford 1873): bir gerçek sayıyı bir infinitesimal ile genişlet.

\hat{x} = x + \epsilon \cdot \dot{x}

burada

ε² = 0

(nilpotent).

x

= primal değer,

ẋ

= tangent (türev) değer.

Aritmetik kuralları#

(x + ε ẋ) + (y + ε ẏ) = (x + y) + ε(ẋ + ẏ)
(x + ε ẋ) * (y + ε ẏ) = xy + ε(ẋy + xẏ)       [çünkü ε² = 0]

Bu çarpım kuralına bak:

ε

'nin katsayısı tam olarak çarpım türevi

(fg)' = f'g + fg'

Sezgi#

Her dual number

(x, ẋ)

aslında "x'in değeri + x'in türevi"ni birlikte taşıyor. Forward'da operasyonlar yapıldıkça, dual aritmetik otomatik olarak chain rule'u uyguluyor.

Örnek#

f(x) = x² + sin(x)

için

f'(2)

İlk: x̂ = 2 + ε·1
x̂² = (2 + ε)² = 4 + 4ε + ε² = 4 + 4ε
sin(x̂) = sin(2) + ε·cos(2)
f(x̂) = (4 + 4ε) + (sin(2) + ε cos(2)) = (4 + sin(2)) + ε(4 + cos(2))

f(2) = 4 + sin(2) ≈ 4.909
f'(2) = 4 + cos(2) ≈ 3.584

Bu kadar. Tek bir forward pass'te hem değer hem türev.

python

import math
 
class Dual:
    """Dual number for forward-mode autodiff."""
    def __init__(self, primal, tangent=0.0):
        self.p = primal     # değer
        self.t = tangent    # türev
 
    def __repr__(self):
        return f"Dual(primal={self.p}, tangent={self.t})"
 
    def __add__(self, other):
        if isinstance(other, Dual):
            return Dual(self.p + other.p, self.t + other.t)
        return Dual(self.p + other, self.t)
 
    def __mul__(self, other):
        if isinstance(other, Dual):
            # (x + ε ẋ)(y + ε ẏ) = xy + ε(ẋy + xẏ)
            return Dual(self.p * other.p, self.t * other.p + self.p * other.t)
        return Dual(self.p * other, self.t * other)
 
    def __pow__(self, n):
        # d/dx x^n = n x^(n-1)
        return Dual(self.p ** n, n * self.p ** (n - 1) * self.t)
 
    def __radd__(self, other): return self + other
    def __rmul__(self, other): return self * other
 
def sin(x):
    if isinstance(x, Dual):
        return Dual(math.sin(x.p), math.cos(x.p) * x.t)
    return math.sin(x)
 
def cos(x):
    if isinstance(x, Dual):
        return Dual(math.cos(x.p), -math.sin(x.p) * x.t)
    return math.cos(x)
 
# Kullanım: x̂ = 2 + ε·1, türevi hesapla
x = Dual(2.0, 1.0)  # tangent=1 → "x'e göre türev al"
y = x ** 2 + sin(x)
print(y)            # Dual(primal=4.909..., tangent=3.584...)
print(f"f(2) = {y.p:.6f}")
print(f"f'(2) = {y.t:.6f}")
 
# Doğrulama: f'(x) = 2x + cos(x) → f'(2) = 4 + cos(2) ≈ 3.584
print(f"Beklenen: {2*2 + math.cos(2):.6f}")

Dual numbers ile forward-mode autodiff — 40 satır.

💡 Dual numbers'ın güzel sırrı

Forward-mode autodiff aslında algebra'da bir trick. ε² = 0 olduğu için, ε katsayısı her zaman türev oluyor. Karmaşık fonksiyonları parçalara ayırıp aritmetik yapmak yeter — chain rule otomatik uygulanıyor. Bu, sembolik türev (formül) ile numerik türev (FD) arasında bir köprü.

3. Reverse-mode: Backprop'un Genel Hali#

Reverse-mode, computational graph'i iki kez geçiyor:

Forward pass: değerleri hesapla, ara aktiviteleri sakla
Backward pass: gradient'leri output'tan input'a doğru "geri" yay

Bu Ders 1.3, 1.4'te detaylıca işledik. Ana noktalar:

Topological sort + reverse traversal
Her node'da lokal türev (
_backward
closure)
Gradient accumulation (
+=
)

Forward vs Reverse: Mental model#

Forward: "Bu input'u değiştirseydim, tüm output'lar nasıl değişirdi?" Reverse: "Bu output'u kabul edip, tüm input'lar nasıl değiştirir?"

LLM eğitimi tek bir output (loss) ve milyarlarca input (parametre) → reverse-mode çok daha verimli.

Image classification 1 loss, milyon parametre → reverse-mode. Sensitivity analysis (1 input → çok output) → forward-mode.

4. Jacobian'ın Anatomisi#

f: ℝⁿ → ℝᵐ

için Jacobian:

J = \frac{\partial f}{\partial x} \in \mathbb{R}^{m \times n}, \quad J_{ij} = \frac{\partial f_i}{\partial x_j}

Tam Jacobian saklamak#

Bir LLM için

n

= milyarlar,

m

= 1 (loss). Jacobian

1 × n

= "gradient vektörü". Pratikte saklanabilir.

Bir neural network'ün bir ara katmanı için (örn. hidden states),

m

n

her ikisi de büyük (say milyonlar). Jacobian

m × n

matrisi terabytes alabilir → asla saklanmaz.

Çözüm: Jacobian'ı implicit tut#

Tam Jacobian yerine, sadece Jacobian × vektör (veya vektör × Jacobian) operasyonlarını yapabiliriz. Bu çok daha ucuz.

Forward-mode:

J · v

(Jacobian-vector product = JVP) Reverse-mode:

v^T · J

(vector-Jacobian product = VJP)

5. JVP — Jacobian-Vector Product (Forward-mode'un Kalbi)#

JVP hesaplar: bir input

v

'ye doğru yöndeki directional derivative.

\text{JVP}(f, x, v) = J_f(x) \cdot v = \lim_{h \to 0} \frac{f(x + hv) - f(x)}{h}

Sezgi: x'i v yönünde küçük bir adım hareket ettirsem, f kaç değişir?

Pratik#

Dual numbers: ẋ = v ile primal'i forward'a sürmek = JVP
m output, n input için 1 JVP = O(N) (N = op count)
m JVP ile tam Jacobian'ı yeniden kurabilirsin (m kez çağır, m kolon)

Niye forward-mode "Jacobian-vector product"?#

Çünkü forward-mode bir kez çalıştırıldığında her output için bir kolon Jacobian'ı (yani

J · e_i

= i. kolon) verir. Yani temel operasyon JVP.

6. VJP — Vector-Jacobian Product (Reverse-mode'un Kalbi)#

VJP hesaplar: bir cotangent

u

ile Jacobian'ın sol-çarpımı.

\text{VJP}(f, x, u) = u^T \cdot J_f(x)

Sezgi: f'in output'una u yönlü bir "sensitivity" verirsen, input'lara karşılık hangi türev?

Pratik#

Backward pass: gradient'i output'tan input'a doğru aktarır
LLM loss (m=1, scalar) için
u = 1
→
1^T · J = ∇L
(gradient vector)
n VJP ile tam Jacobian'ı yeniden kurabilirsin (n kez çağır, n satır)

Niye reverse-mode "vector-Jacobian product"?#

Çünkü backward bir kez çalıştırıldığında her input için bir satır Jacobian'ı (yani

e_i^T · J

= i. satır) verir. Temel operasyon VJP.

7. Hesaplama Karmaşıklığı — Hangisi Ne Zaman Ucuz#

Bir fonksiyon

f: ℝⁿ → ℝᵐ

için:

Tam Jacobian fiyatı#

Yöntem	Maliyet
Forward-mode (n kez JVP)	O(n · N)
Reverse-mode (m kez VJP)	O(m · N)

N = operasyon sayısı (forward pass maliyeti).

Kural#

n < m → Forward-mode ucuz m < n → Reverse-mode ucuz

LLM bağlamında#

Senaryo	n	m	Doğru mod
Loss → parametreler (gradient)	büyük (M-B)	1	Reverse ← LLM'in günlük işi
1 input → çok output (sensitivity)	1	büyük	Forward
Hidden state → input gradient	M	M	Eşit — fonksiyonun yapısına bağlı
Hessian (gradient of gradient)	M	M	Karışık (aşağıda)

Reverse-mode'un memory tradeoff'u#

Forward-mode'da memory: O(1) (sadece dual taşıyorsun). Reverse-mode'da memory: O(N) — tüm ara aktiviteleri saklamalısın.

Bu yüzden büyük LLM'lerde:

Reverse-mode ← gradient için zorunlu
Activation checkpointing ile memory dengelenir

python

# Karşılaştırma: 100-input, 1-output (typical loss)
import jax
import jax.numpy as jnp
import time
 
def loss_fn(x):
    return jnp.sum(jnp.sin(x) ** 2 + x ** 2)
 
n = 100
x = jnp.array([0.5] * n)
 
# Reverse-mode (grad): 1 VJP, O(N)
grad_rev = jax.grad(loss_fn)
t0 = time.perf_counter()
g = grad_rev(x)
t1 = time.perf_counter()
print(f"Reverse-mode (grad): {(t1-t0)*1e6:.1f}μs, grad shape: {g.shape}")
 
# Forward-mode (n kez JVP): O(n·N)
def forward_grad(f, x):
    """Forward-mode ile gradient — n kez JVP."""
    grad = jnp.zeros_like(x)
    for i in range(len(x)):
        v = jnp.zeros_like(x).at[i].set(1.0)
        _, jvp_i = jax.jvp(f, (x,), (v,))
        grad = grad.at[i].set(jvp_i)
    return grad
 
t0 = time.perf_counter()
g_fwd = forward_grad(loss_fn, x)
t1 = time.perf_counter()
print(f"Forward-mode (n JVPs): {(t1-t0)*1e6:.1f}μs")
 
# n=100 → forward 100x yavaş
# n=10^9 (LLM) → forward imkansız
 
# Tersi senaryo: 1 input, 100 output
def multi_output(x):
    return jnp.array([jnp.sin(x * i) for i in range(100)])
 
# Forward: 1 JVP (1 input)
y, jvp_val = jax.jvp(multi_output, (0.5,), (1.0,))
print(f"Forward 1 JVP for m=100: instant")
 
# Reverse: 100 kez VJP (her output için)
# Çok daha yavaş

Karmaşıklık karşılaştırması — hangi mod ne zaman.

8. JAX API — Otomatik Türev Master Kit#

JAX, autodiff'in tam ailesini en temiz API'lerle sunuyor:

Temel#

Fonksiyon	Ne yapıyor
`jax.grad(f)`	Reverse-mode gradient (scalar output için)
`jax.jvp(f, primals, tangents)`	Tek bir JVP
`jax.vjp(f, *primals)`	VJP fonksiyonu döndürür
`jax.jacfwd(f)`	Forward-mode ile tam Jacobian (n JVP)
`jax.jacrev(f)`	Reverse-mode ile tam Jacobian (m VJP)
`jax.hessian(f)`	Hessian (jacfwd(grad(f)) = forward-of-reverse)

Higher-level#

Fonksiyon	Ne yapıyor
`jax.value_and_grad(f)`	Hem değer hem gradient (efficient — tek pass)
`jax.vmap(f)`	Vectorize (batch axis ekle)
`jax.pmap(f)`	Parallel (multi-device)

Karpathy önerisi#

"Eğer JAX bilseydim, çoğu deep learning projemde PyTorch yerine JAX seçerdim — çünkü autodiff API'si tam ve ortogonal."

python

import jax
import jax.numpy as jnp
 
# Test function
def f(x):
    return jnp.sin(x[0]) * jnp.cos(x[1]) + x[2] ** 2
 
x = jnp.array([1.0, 2.0, 3.0])
 
# 1. Gradient (reverse-mode)
grad = jax.grad(f)
print("grad:", grad(x))  # ∇f at x
# array([0.4161..., -0.4546..., 6.0])
 
# 2. Value + Gradient (tek pass)
val, gr = jax.value_and_grad(f)(x)
print(f"f(x) = {val:.4f}, ∇f(x) = {gr}")
 
# 3. JVP: f(x + ε v) ≈ f(x) + ε · J·v
v = jnp.array([1.0, 0.0, 0.0])  # x[0] yönünde tangent
y, jvp_val = jax.jvp(f, (x,), (v,))
print(f"f(x) = {y:.4f}, JVP·v = {jvp_val:.4f}")
# JVP·v = ∂f/∂x[0] = cos(1)·cos(2) ≈ 0.416
 
# 4. VJP: u^T · J
y, vjp_fn = jax.vjp(f, x)
u = 1.0  # scalar output için
grad_via_vjp = vjp_fn(u)
print(f"VJP grad: {grad_via_vjp[0]}")
 
# 5. Jacobian (forward vs reverse) — multi-output
def g(x):
    return jnp.array([x[0]**2, x[1] * x[2], jnp.sin(x[0])])
 
J_fwd = jax.jacfwd(g)(x)  # 3 input, 3 output → 3x3 Jacobian
J_rev = jax.jacrev(g)(x)
print("Forward Jacobian:")
print(J_fwd)
print("Reverse Jacobian:")
print(J_rev)
print("Aynı mı:", jnp.allclose(J_fwd, J_rev))  # True
 
# 6. Hessian
H = jax.hessian(f)(x)
print("Hessian:")
print(H)  # 3x3 symmetric

JAX'in autodiff API'sinin tam keşfi.

9. Hessian-Vector Products — Forward-of-Reverse Trick#

Modern ML'de tam Hessian asla hesaplanmaz (n × n, milyonlarca). Ama bazı algoritmalar (Newton method, K-FAC, conjugate gradient) Hessian-vector products kullanır:

\text{HVP}(f, x, v) = H_f(x) \cdot v

Klasik yol: forward-over-reverse#

Reverse-mode ile gradient hesapla:
g = ∇f
Forward-mode ile gradient'in JVP'sini al:
H·v = JVP(g, x, v)

Bu kombinasyon

jvp(grad(f))

veya

jax.jvp

jax.grad

ile yapılır.

Reverse-over-reverse alternatif#

Aynı şey, ama iki kere reverse:

grad(grad(f))^T · v

. Daha yavaş ama mümkün.

Niye forward-of-reverse daha iyi?#

Reverse forward'dan daha pahalı memory-wise. Gradient hesabı zaten reverse, sonra üzerine forward ekleyince ekstra memory ihtiyacı yok.

Pratik: Conjugate Gradient#

CG için

H · v

kullanılır:

v_k+1 = v_k - α · (H · v_k)

Tam H gerek yok, sadece HVP.

python

import jax
import jax.numpy as jnp
 
def f(x):
    return jnp.sum(x ** 4) + jnp.sum(jnp.sin(x) ** 2)
 
x = jnp.array([1.0, 2.0, 3.0, 4.0])
v = jnp.array([1.0, 0.0, 1.0, 0.0])  # bazı yönlerde
 
# Hessian-vector product — forward-of-reverse
def hvp(f, x, v):
    """H(f, x) · v — tam Hessian saklamadan."""
    return jax.jvp(jax.grad(f), (x,), (v,))[1]
 
h_v = hvp(f, x, v)
print("H·v:", h_v)
 
# Doğrulama: tam Hessian'la
H = jax.hessian(f)(x)
h_v_direct = H @ v
print("H @ v (direct):", h_v_direct)
print("Aynı mı:", jnp.allclose(h_v, h_v_direct))
 
# Karmaşıklık: HVP O(N), tam Hessian O(N²)
# n=1000 için: HVP ~10μs, tam Hessian ~10ms (1000x fark)

Hessian-vector product — modern optimization'ın dayanağı.

10. Higher-order Autodiff: 2., 3. Türevler#

JAX'te higher-order trivially:

grad(grad(f))

grad(grad(grad(f)))

, ...

PyTorch'ta da mümkün ama dikkat:

backward(create_graph=True)

Use case'ler#

Newton method: 2. derecede optimization —
H⁻¹·∇f
MAML (meta-learning): 2. derecede gradient (gradient'in türevi)
Physics-informed NN: PDE residual'lerinde 2. derecede türevler
Sharpness-aware optimization (SAM): gradient'in gradient'i

Pratik uyarı#

Higher-order pahalı. 2. derecede iki kez backward → ~2-3x maliyet, ~2x memory. 3. derecede daha fazla. LLM'de nadir kullanılır (Newton method LLM ölçeğinde pratik değil).

11. LLM'de Forward-mode Uygulamaları#

Forward-mode klasik LLM eğitiminde yer almaz (m=1, n çok büyük → reverse default). Ama edge case'lerde kullanılır:

a) Hessian-vector products#

K-FAC, Shampoo, Newton-CG gibi 2. derece optimizer'lar HVP gerektirir → forward-of-reverse.

b) Fisher Information Matrix vector products#

Natural gradient için

F⁻¹ · ∇L

. FIM, gradient'in outer product expectation'ı; HVP gibi hesaplanır.

c) Influence functions#

"Bu test örneğinin tahmininden hangi eğitim örneği en sorumlu?" sorusu için (Koh & Liang 2017):

-H⁻¹ · ∇ training_loss · ∇ test_loss

. HVP-based.

d) Adversarial training (Lipschitz)#

Robustness analysis'inde Jacobian normları gerekli → forward-mode efficient (input genelde küçük).

e) Diffusion models (score matching)#

∇_x log p(x)

ve onun türevleri — bazı algoritmaların 2. derecede türevleri var.

f) Physics-informed NN (PINN)#

PDE constraint'leri için input'a göre 2-3 derecede türev — forward-mode genelde tercih edilir.

12. PyTorch'ta Forward-mode —
`torch.func`
#

PyTorch 2.0 öncesi forward-mode eksikti. Şimdi

torch.func

modülü (eski functorch) JAX-vari API getiriyor:

import torch
from torch.func import grad, jvp, vjp, jacfwd, jacrev, hessian, vmap

def f(x):
    return torch.sin(x[0]) * torch.cos(x[1]) + x[2] ** 2

x = torch.tensor([1.0, 2.0, 3.0])

# Reverse-mode gradient
g = grad(f)(x)
print(g)

# Forward-mode Jacobian
J = jacfwd(f)(x)
print(J)

`torch.autograd.functional`
(eski API)#

PyTorch 1.x'te benzer ama less ergonomic. Modern PyTorch için

torch.func

öner.

python

import torch
from torch.func import grad, jvp, vjp, hessian, vmap
 
# Bir mini transformer'ın forward pass'i (toy)
def mini_attention(qkv_weights, x):
    """qkv_weights: (3 * d_model, d_model), x: (seq, d_model)"""
    d = x.shape[-1]
    qkv = x @ qkv_weights.T                # (seq, 3*d)
    q, k, v = qkv.chunk(3, dim=-1)
    scores = q @ k.T / (d ** 0.5)
    attn = torch.softmax(scores, dim=-1)
    return (attn @ v).sum()                # scalar loss
 
# Test data
torch.manual_seed(0)
d_model = 8
seq_len = 4
W = torch.randn(3 * d_model, d_model, requires_grad=True)
x = torch.randn(seq_len, d_model)
 
# Gradient w.r.t. W (reverse-mode — gradient için doğru seçim)
grad_W = grad(lambda W: mini_attention(W, x))(W)
print(f"grad shape: {grad_W.shape}")
 
# Hessian-vector product
v = torch.randn_like(W)
hv = jvp(grad(lambda W: mini_attention(W, x)), (W,), (v,))[1]
print(f"HVP shape: {hv.shape}")
 
# vmap ile batched gradient
batch_x = torch.randn(16, seq_len, d_model)
batched_grad = vmap(lambda x_b: grad(lambda W: mini_attention(W, x_b))(W))(batch_x)
print(f"Batched grad shape: {batched_grad.shape}")  # (16, 24, 8)

torch.func ile JAX-vari autodiff API'lar.

13. Autodiff Implementation Stratejileri (Bonus)#

Gerçek autodiff motorları nasıl implement edilir? 4 strateji:

a) Operator overloading (dinamik graph)#

Class'ın

__add__

__mul__

'lerini override → her op execute olduğunda graph node yarat.

PyTorch, micrograd, NumPy with dual numbers → bu yaklaşım.

b) Source transformation (statik graph)#

Kodu compile time'da analyze et, türev kodunu otomatik üret. Tensorflow 1.x, JAX (XLA ile), Tapenade (Fortran/C için).

c) Tracing (hybrid)#

Forward'da execute et + graph kaydet, sonra graph'i transform et. JAX

jit

bu pattern.

d) Just-in-time + caching#

Aynı shape'lerde compiled graph cache'le. Modern

torch.compile

, JAX

jit

LLM'de hangisi?#

Eğitim: dynamic graph (PyTorch eager) — debugging için
Production inference: traced/jitted (
torch.compile
veya JAX
jit
) — performance için
Hybrid: 2026'nın standardı —
torch.compile
decorator ile eager kod hızlanır

14. Mini Egzersizler#

Dual numbers ile
exp
:
Dual
class'a
exp
metodu ekle.
d/dx eˣ = eˣ
. Test et.
JVP karmaşıklığı:
f: ℝ → ℝ¹⁰⁰
. Tam Jacobian'ı forward-mode ile kaç JVP'le hesaplarız? Reverse-mode ile kaç VJP?
HVP doğrulama: Bir 2-d
f(x, y) = x²y + sin(xy)
için,
H · v
Hessian-vector product'ını forward-of-reverse ile hesapla. Tam Hessian ile karşılaştır.
Forward vs reverse trade-off:
f: ℝ¹⁰⁰⁰ → ℝ¹⁰⁰⁰
(Jacobian 1000×1000). Hangi mod ucuz? Eğer dataset'in eğitim örneklerinde her noktada gradient norm'unu istersen?
PyTorch
torch.func
test: Yukarıdaki
mini_attention
'da, Q ve K weight'leri arasındaki cross-Hessian hesaplamak istersen, hangi API'yi kullanırsın?

Bu Derste Neler Öğrendik?#

✓ Sembolik, numerik, otomatik türev karşılaştırması ✓ Dual numbers ile forward-mode sezgisi (40-satır implementasyon) ✓ Reverse-mode = backprop'un genel hali ✓ JVP (Jacobian-vector product) → forward-mode'un atomu ✓ VJP (vector-Jacobian product) → reverse-mode'un atomu ✓ Karmaşıklık kuralı: n < m → forward, m < n → reverse ✓ JAX API: jvp, vjp, grad, jacfwd, jacrev, hessian ✓ HVP = forward-of-reverse trick — modern optimization'ın dayanağı ✓ Higher-order autodiff: 2., 3. türevler ✓ LLM'de forward-mode kullanım alanları: HVP, FIM·v, influence functions ✓ PyTorch
torch.func
— JAX-vari modern API

Sıradaki Ders#

2.4 — NumPy ile Tensor Autograd Sıfırdan: Mini-Tinygrad İnşası 1.4'te skaler micrograd yazmıştık. Şimdi tensor-bazlı mini-tinygrad: NumPy üzerinde

Tensor

class'ı, broadcasting-aware backward, view/copy bilinçli gradient akışı, GPU'ya yakın hız. ~400 satırda gerçek autograd.

Frequently Asked Questions

Popular but in different places. Backprop (reverse) is the daily bread of deep learning — perfect fit for 1 scalar loss + billions of parameters. Forward-mode niches: (1) **Higher-order derivatives** (HVP, Hessian) — backbone of modern optimizer math. (2) **Physics-informed NN** — PDE constraints needing input derivatives. (3) **Sensitivity analysis** — 1 input → many outputs systems. (4) **Sparse Jacobian** — if mostly zeros, forward is efficient. (5) **Hardware-friendly**: forward memory O(1) vs reverse O(N) — sometimes preferred on edge devices.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Ders Haritası (Detaylı)#

1. Türev Hesaplama Yöntemleri — Üç Aile#

a) Sembolik türev (analytic)#

b) Numerik türev (finite differences)#

c) Otomatik türev (autodiff)#

2. Forward-mode: Dual Numbers ile Sezgi#

Aritmetik kuralları#

Sezgi#

Örnek#

3. Reverse-mode: Backprop'un Genel Hali#

Forward vs Reverse: Mental model#

4. Jacobian'ın Anatomisi#

Tam Jacobian saklamak#

Çözüm: Jacobian'ı implicit tut#

5. JVP — Jacobian-Vector Product (Forward-mode'un Kalbi)#

Pratik#

Niye forward-mode "Jacobian-vector product"?#

6. VJP — Vector-Jacobian Product (Reverse-mode'un Kalbi)#

Pratik#

Niye reverse-mode "vector-Jacobian product"?#

7. Hesaplama Karmaşıklığı — Hangisi Ne Zaman Ucuz#

Tam Jacobian fiyatı#

Kural#

LLM bağlamında#

Reverse-mode'un memory tradeoff'u#

8. JAX API — Otomatik Türev Master Kit#

Temel#

Higher-level#

Karpathy önerisi#

9. Hessian-Vector Products — Forward-of-Reverse Trick#

Klasik yol: forward-over-reverse#

Reverse-over-reverse alternatif#

Niye forward-of-reverse daha iyi?#

Pratik: Conjugate Gradient#

10. Higher-order Autodiff: 2., 3. Türevler#

Use case'ler#

Pratik uyarı#

11. LLM'de Forward-mode Uygulamaları#

a) Hessian-vector products#

b) Fisher Information Matrix vector products#

c) Influence functions#

d) Adversarial training (Lipschitz)#

e) Diffusion models (score matching)#

f) Physics-informed NN (PINN)#

12. PyTorch'ta Forward-mode — torch.func#

torch.autograd.functional (eski API)#

13. Autodiff Implementation Stratejileri (Bonus)#

a) Operator overloading (dinamik graph)#

b) Source transformation (statik graph)#

c) Tracing (hybrid)#

d) Just-in-time + caching#

LLM'de hangisi?#

14. Mini Egzersizler#

Bu Derste Neler Öğrendik?#

Sıradaki Ders#

Frequently Asked Questions

Is forward-mode autodiff really not popular? Classical courses always show backprop.

I can implement dual numbers in Python. How does JAX make this 'fast'?

Is hessian in PyTorch really scalable? For n=1B parameters?

Are influence functions (Koh & Liang 2017) actually used?

What's the difference between `torch.func` and `torch.autograd`? Can I use both?

How do JAX's `pmap` and `vmap` affect autodiff?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight

12. PyTorch'ta Forward-mode —
`torch.func`
#

`torch.autograd.functional`
(eski API)#