How does PyTorch detect that a DAG forms a cycle?

PyTorch autograd implicitly avoids cycles because tensors are **immutable** — you do math on a tensor to produce another, not modify itself (in forward). For inplace ops like `x += y`, PyTorch keeps a 'version counter' — during gradient computation, if version mismatch is detected, it raises `RuntimeError` ("one of the variables needed for gradient computation has been modified"). RNN-style recursion is not a cycle either because each iteration creates new nodes.

Is static graph compile time a real production problem?

Yes, account for it. JAX `jit` first call can take 1-30 seconds (graph build + XLA compile). In production: (1) Warmup at server start (preload + first call). (2) Keep shapes stable — re-compile on shape change. (3) Caching: compiled kernels for same shapes are retained. PyTorch `torch.compile` also takes 5-60s first forward; then fast. Standard practice: warm-up at server boot for LLM inference.

Does `torch.compile` always speed things up?

No — can slow down in some scenarios. (1) **Very small models** (<10M params): compile overhead exceeds gain. (2) **Dynamic shapes**: re-compile per shape. (3) **Frequent graph breaks** (custom Python objects, prints): compile useless. (4) **Already-optimized models**: NVIDIA Apex fused kernels do what `torch.compile` would. **Measure**: use `torch.utils.benchmark.Timer` to compare. For LLMs 1B+, almost always beneficial.

Is JAX's functional purity requirement a problem for LLM training?

It matters. (1) **State management**: model state (weights, optimizer state) managed via `jax.lax.scan` or `flax.training.train_state` — different from Python classes. (2) **Randomness**: explicit key splitting with `jax.random.PRNGKey` — not a free function like `torch.randn`. (3) **No side effects**: use `jax.debug.print` or outside-the-jit for print/log. Trade-off with Pythonic-ness — but performance + multi-device parallelism (`pmap`) advantages are huge. Anthropic, DeepMind, OpenAI (some parts) using JAX shows it's worthwhile.

MLX vs PyTorch (MPS backend) — which is better for Apple Silicon?

Within Apple ecosystem, **MLX** is faster (~30-50%): direct Metal Performance Shaders, unified memory design, M-series optimization. PyTorch MPS backend still maturing — some ops fallback to CPU (slow). **Practical**: MLX for local Mac development, PyTorch for cross-platform production (deploy on Linux server). Module 5 (PyTorch Engineering) deepens MPS knowledge.

Computational Graph Deep Dive: DAG Structure, Topological Sort, Eager vs Static Paradigm

Q: Why is PyTorch eager mode more popular than JAX static graph?

Eager's **debug experience** and **Pythonic feel** dramatically improved developer experience. JAX's function-pure paradigm (no state, side-effects) has steep learning curve; PyTorch lets you write ML code that feels like Python. PyTorch added `torch.compile` (2.0+) bringing static graph benefits → best-of-both-worlds. JAX still dominates Google ecosystem and TPU; strong in research but limited industry adoption.

Q: Is JAX's functional purity requirement a problem for LLM training?

It matters. (1) **State management**: model state (weights, optimizer state) managed via `jax.lax.scan` or `flax.training.train_state` — different from Python classes. (2) **Randomness**: explicit key splitting with `jax.random.PRNGKey` — not a free function like `torch.randn`. (3) **No side effects**: use `jax.debug.print` or outside-the-jit for print/log. Trade-off with Pythonic-ness — but performance + multi-device parallelism (`pmap`) advantages are huge. Anthropic, DeepMind, OpenAI (some parts) using JAX shows it's worthwhile.

Q: MLX vs PyTorch (MPS backend) — which is better for Apple Silicon?

Within Apple ecosystem, **MLX** is faster (~30-50%): direct Metal Performance Shaders, unified memory design, M-series optimization. PyTorch MPS backend still maturing — some ops fallback to CPU (slow). **Practical**: MLX for local Mac development, PyTorch for cross-platform production (deploy on Linux server). Module 5 (PyTorch Engineering) deepens MPS knowledge.

Deep analysis of the graph structure behind autograd: DAG anatomy, in-degree/out-degree, topological sort algorithms (DFS post-order, Kahn's), eager (PyTorch) vs static (TF1, JAX, XLA) paradigms, graph optimization (fusion, dead code elimination).

Şükrü Yusuf KAYA

38 min read

5/13/2026

Intermediate

Computational Graph Derinden: DAG Yapısı, Topological Sort, Eager vs Static Paradigma

🕸️ Autograd'in DNA'sı: bir graph

Ders 1.4'te micrograd ile basit DAG yaptık. Şimdi gerçek autograd'in graph yapısını detaylandıracağız: in-degree/out-degree, topological sort'un farklı algoritmaları (Kahn vs DFS), graph optimization, ve PyTorch eager vs JAX static graph paradigmaları arasındaki felsefik fark.

Ders Haritası#

DAG nedir? Graph teorisi 101
Computational graph anatomisi: forward edges, gradient edges
In-degree, out-degree: erişim sayıları
Topological sort: DFS post-order vs Kahn's
Reverse traversal: backward'ın doğru sırası
Eager execution (PyTorch, NumPy)
Static graph / lazy execution (TF1, JAX, XLA)
Graph optimization: fusion, dead code, kernel scheduling
torch.compile
— best of both worlds
Modern alternatifler: MLX, TinyGrad, Mojo

1. DAG Nedir? Graph Teorisi 101#

Graph = (V, E). V vertices (node), E edges (kenar).

Directed: kenarların yönü var (
a → b
≠
b → a
)
Acyclic: hiçbir yönlü döngü yok (a → b → c → a yasak)
Connected olmayabilir (birden çok component)

DAG (Directed Acyclic Graph) = yönlü ve döngüsüz.

Computational graph bir DAG çünkü#

Yönlü: data forward'da gidiyor (input → output)
Acyclic: bir değer kendi hesabında kullanılamaz (recursion ≠ cycle, çünkü her recursion ayrı node)

Önemli kavramlar#

Source / root: in-degree 0 olan node (girdi)
Sink / leaf: out-degree 0 olan node (çıktı)
Path: source'tan sink'e bir yol
DAG'ın "uzunluğu": en uzun path (depth)

2. Computational Graph Anatomisi#

Bir hesap için graph şu yapıdadır:

       x      y       (inputs, leaf)
       \    /
        (+)           (Add op node)
         |
         a = x + y
         |
        (*) ── w       (Mul op + extra input)
         |
         L = a * w     (output)

Node tipleri#

Leaf nodes (input): no incoming edges.
requires_grad=True
olanlar parametre.
Op nodes: incoming edges = input'lar; out-going edges = output (kullanıldığı yerler).
Output node: final loss.

Edge'ler ne taşır?#

Forward edge (input → op → output): değerleri (tensor data). Backward edge (output → op → input): gradient'leri.

Önemli: backward graph ayrı#

PyTorch'ta forward yaparken

grad_fn

linkleriyle ayrı bir backward graph kuruluyor.

backward()

çağrıldığında bu graph traverse ediliyor.

python

import torch
 
# Simple computation graph
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
 
a = x + y                    # AddBackward0
b = a * x                    # MulBackward0
L = b ** 2                   # PowBackward0
 
print(f"a.grad_fn: {a.grad_fn}")   # <AddBackward0>
print(f"b.grad_fn: {b.grad_fn}")   # <MulBackward0>
print(f"L.grad_fn: {L.grad_fn}")   # <PowBackward0>
 
# Graph traversal — next_functions
print(f"L's parents: {L.grad_fn.next_functions}")
# (b'nin grad_fn, 0)
 
# Tüm graph'i traverse et
def print_graph(grad_fn, indent=0):
    print(" " * indent + str(grad_fn))
    for parent, _ in grad_fn.next_functions:
        if parent is not None:
            print_graph(parent, indent + 2)
 
print_graph(L.grad_fn)
# <PowBackward0>
#   <MulBackward0>
#     <AddBackward0>
#       <AccumulateGrad>   # x'in leaf'i
#       <AccumulateGrad>   # y'in leaf'i
#     <AccumulateGrad>      # x tekrar
 
L.backward()
print(f"x.grad: {x.grad}, y.grad: {y.grad}")

PyTorch'un backward graph'ini açığa çıkarmak.

3. In-degree, Out-degree#

In-degree(v): v'ye giren kenar sayısı. Op node için: input sayısı. Out-degree(v): v'den çıkan kenar sayısı. Bir tensor kaç farklı yerde kullanılıyorsa.

Out-degree LLM'de önemli — niye?#

Bir tensor (örn. embedding'in çıktısı) birden çok layer'a beslenir:

Decoder layer
Residual connection
Loss hesabı

Out-degree 3 demek: o tensor'ın gradient'i 3 farklı yerden geliyor → toplanır (chain rule sum).

x = torch.tensor(2.0, requires_grad=True)
a = x ** 2          # x kullanıldı (1. use)
b = x + 1           # x kullanıldı (2. use)
L = a + b           # L = x² + x + 1
L.backward()
print(x.grad)       # 2x + 1 = 5 (her iki path'ten toplam)

In-place ops uyarısı#

x += 1

gibi inplace operasyonlar autograd graph'i karıştırabilir çünkü

x

'in eski değeri kayboluyor. PyTorch genelde detect eder ve hata fırlatır.

4. Topological Sort — DFS vs Kahn's#

Bir DAG'ı lineer sırayla dizmek. Sıra: her node ondan önce gelen tüm node'lardan sonra geliyor.

Algoritma 1: DFS Post-order (Tarjan)#

visited = set()
result = []

def dfs(node):
    if node in visited: return
    visited.add(node)
    for child in node.children:
        dfs(child)
    result.append(node)   # post-order: çocuklardan sonra ekle

dfs(root)
# result child'lardan ebeveynlere sırayla
# Reverse(result) = topological sort

Karmaşıklık: O(V+E). Stack'e bağlı (recursion depth bir kısıt).

Karpathy'nin micrograd'ı bunu kullanıyor (Ders 1.4'te gördük).

Algoritma 2: Kahn's BFS#

in_degree = {v: count for v in nodes}
queue = [v for v in nodes if in_degree[v] == 0]
result = []

while queue:
    node = queue.pop(0)
    result.append(node)
    for child in node.children:
        in_degree[child] -= 1
        if in_degree[child] == 0:
            queue.append(child)

Karmaşıklık: O(V+E). In-degree'leri precompute etmen gerek.

Hangisi ne zaman?#

Algoritma	Avantaj	Dezavantaj
DFS	Recursive, kısa kod	Stack overflow riski derin graph'larda
Kahn	Iterative, derin graph dostu	In-degree precompute gerekli

PyTorch autograd internally DFS-based. Triton ve XLA Kahn's varyantı kullanıyor.

python

# DAG node — simple example
class Node:
    def __init__(self, name, children=()):
        self.name = name
        self.children = children
 
# Örnek graph: a → b, c; b → d; c → d; d → e
e = Node('e')
d = Node('d', children=[e])
c = Node('c', children=[d])
b = Node('b', children=[d])
a = Node('a', children=[b, c])
 
# DFS post-order
def topo_dfs(root):
    visited = set()
    result = []
    def dfs(node):
        if node.name in visited:
            return
        visited.add(node.name)
        for child in node.children:
            dfs(child)
        result.append(node)
    dfs(root)
    return result   # reverse for backward
 
order = topo_dfs(a)
print("DFS order:", [n.name for n in order])
# ['e', 'd', 'b', 'c', 'a']
 
# Kahn's algorithm
def topo_kahn(root):
    # Önce in-degree hesapla
    nodes = set()
    in_deg = {}
    queue = [root]
    while queue:
        n = queue.pop()
        if n in nodes:
            continue
        nodes.add(n)
        in_deg.setdefault(n, 0)
        for c in n.children:
            in_deg[c] = in_deg.get(c, 0) + 1
            queue.append(c)
 
    queue = [n for n in nodes if in_deg[n] == 0]
    result = []
    while queue:
        n = queue.pop(0)
        result.append(n)
        for c in n.children:
            in_deg[c] -= 1
            if in_deg[c] == 0:
                queue.append(c)
    return result
 
order2 = topo_kahn(a)
print("Kahn's order:", [n.name for n in order2])
# ['a', 'b', 'c', 'd', 'e']  # forward order

İki topological sort algoritmasının yan yana implementasyonu.

5. Reverse Traversal — Backward'ın Doğru Sırası#

Backward pass için:

Topological order'ı hesapla
Reverse'le
Her node'un
_backward
(lokal türev)'ini çağır

def backward(root):
    topo = topo_dfs(root)
    root.grad = 1.0
    for node in reversed(topo):
        node.backward_local()

Niye reverse?#

Output'tan başlayıp input'a doğru gitmemiz lazım — gradient'ler output'tan akıyor. Forward order'da gitsek, input'tan başlardık ve gradient bilgisi henüz yok.

Memory implication#

Forward pass'te tüm ara aktiviteler saklanır (gradient hesabı için). Bu bellek pahalısı:

70B model, batch 4, seq 4096 → ~50-100 GB aktivasyon

Çözümler:

Activation checkpointing: bazı aktiviteleri sakla, gerektiğinde re-compute
Gradient accumulation: küçük batch + accumulate
Modül 17'de detayda

6. Eager Execution Paradigması#

Eager execution = her operasyon anında yürütülür. Graph build edilmez (yani implicit graph yaratılır ama hesap hemen yapılır).

import torch
x = torch.tensor([1.0, 2.0, 3.0])
y = x * 2          # hemen hesaplanır, [2, 4, 6]
z = y.sum()        # hemen hesaplanır, 12.0

Avantajlar#

Pythonic: print, breakpoint, dynamic shapes hepsi çalışır
Debugging: error stack trace doğal
Dinamik: if/else, for loops graph'a etki etmez

Dezavantajlar#

Optimization sınırı: graph görmediği için fuse edilmez, schedule edilmez
Overhead: her op için Python interpreter geçişi

PyTorch, NumPy, TensorFlow 2.0+ default'tan eager.

7. Static Graph / Lazy Execution#

Static graph = önce graph'i build et (operasyon yazılır ama yapılmaz), sonra derleyici optimize edip yürütür.

# JAX örneği
import jax.numpy as jnp
from jax import jit

def fn(x):
    return jnp.sum(x * 2)

# fn'i compile et
fn_jit = jit(fn)

# İlk çağrı: graph build + compile (yavaş)
result = fn_jit(jnp.array([1.0, 2.0, 3.0]))

# Sonraki çağrılar: optimized binary (çok hızlı)
result = fn_jit(jnp.array([4.0, 5.0, 6.0]))

Avantajlar#

Optimization: graph'i görmek = fuse + schedule + parallelize + memory plan
Distributed: graph parçalanıp farklı device'larda çalıştırılabilir
Hardware-specific compilation (TPU, GPU)

Dezavantajlar#

Dynamic shape'ler zor: graph build edilirken shape sabit olmalı
Debug zor: error mesajları compile edilmiş kodda
Python kontrolü kaybediliyor: print, breakpoint default çalışmaz

Frameworks#

TensorFlow 1.x: tam static
JAX: opt-in (
jit
)
XLA: Google'ın static graph compiler'ı (TF, JAX altında)

8. Graph Optimization — Statik Graph'ın Süper Gücü#

Static graph derleyicisi neler yapabilir?

Operator fusion#

y = relu(x * w + b)

— 3 operation (mul, add, relu). Fused: tek bir kernel. Daha az memory traffic + daha hızlı.

Dead code elimination#

Hesaplanmış ama hiç kullanılmayan ara değerleri çıkar.

Constant folding#

x * 2 + 0 * y

→

x * 2

(compile time'da hesapla).

Memory planning#

Hangi tensor ne zaman yaşayacak/ölecek → optimal allocator strategy.

Kernel scheduling#

Bağımsız operasyonları paralel çalıştır (multi-stream).

Hardware-specific tiling#

TPU bloklarına uy, CUDA warp size'a uy.

Pratik etki#

JAX (XLA arka uçla) bazı LLM workload'larında PyTorch'tan %30-50 hızlı. PyTorch buna yetişmek için

torch.compile

ekledi (2.0+).

9.
`torch.compile`
— Best of Both Worlds#

PyTorch 2.0 (2023)

torch.compile

getirdi: eager Python kodunu arkadan derler, static graph'in optimization avantajlarını kazandırır.

import torch
import torch.nn as nn

model = nn.TransformerEncoder(
    nn.TransformerEncoderLayer(d_model=512, nhead=8),
    num_layers=6,
)
compiled = torch.compile(model)   # Tek satır!

# Sonraki forward'lar derlenmiş kodla
x = torch.randn(32, 100, 512)
y = compiled(x)

Nasıl çalışıyor?#

TorchDynamo: Python bytecode'unu yakalar
TorchInductor: graph'i build eder, optimize eder
Triton: GPU için optimized kernel'lar üretir

Kazanç#

LLM training: %20-30 hızlanma typical
Inference: %50+ hızlanma mümkün
DeepSeek-V3 paper:
torch.compile
adoption bahsediyor

Sınırlar#

First call yavaş (compile time)
Dynamic shape'ler sınırlı (PyTorch 2.1+'de daha iyi)
Custom Python objects bazen yakalanamaz

10. Modern Alternatifler#

MLX (Apple, 2023)#

Apple Silicon için optimize
Lazy evaluation (default JAX-vari)
Eager-like API ama static graph optimization
Mac kullanıcıları için doğal seçim

tinygrad (George Hotz)#

5000 satırlık framework
Static graph + lazy
"Pedagojik amaç" ile başladı, production'a evolve ediyor
Kavramları temiz görmek için süper

Mojo (Modular AI)#

Yeni dil (Python superset)
AI için designed
Tensor first-class, autograd built-in
2026'da öne çıkıyor, henüz erken

TensorFlow 2.x#

Hâlâ var, ama LLM topluluğu PyTorch'a kaymış
TPU'lar için JAX'i tercih eden Google

JAX#

Research'te dominant (DeepMind, Anthropic, Google)
Function-pure, compilation-first
LLM tarafında PyTorch ile rakip

11. Mini Egzersizler#

DAG validation:
a → b, b → c, c → a
. DAG mi?
Topological sort uniqueness:
a → b, a → c, b → d, c → d
. Kaç farklı topological order var?
PyTorch graph traversal: Bir torch tensor'un
grad_fn
'sini başlangıçtan recursive traverse eden bir fonksiyon yaz.
Eager vs Static:
for i in range(N): x = x * 2
— eager'da hangi pratik problem? Static'te?
torch.compile
ne yapamaz: hangi durumlarda
torch.compile
etkili değil veya kullanılamaz?

Bu Derste Neler Öğrendik?#

✓ DAG = yönlü + döngüsüz graph ✓ Computational graph anatomisi: leaf, op, output node'ları ✓ In-degree, out-degree ve gradient accumulation ✓ Topological sort: DFS post-order vs Kahn's BFS ✓ Reverse traversal = backward'ın doğru sırası ✓ Eager execution (PyTorch default) avantaj/dezavantaj ✓ Static graph (JAX, TF1) ve graph optimization ✓ torch.compile
— eager + compilation hybrid ✓ Modern alternatifler: MLX, tinygrad, Mojo, JAX

Sıradaki Ders#

2.3 — Reverse-mode vs Forward-mode Autodiff: Matematiksel Karşılaştırma Backprop'un "reverse-mode autodiff" olduğunu biliyoruz. Peki forward-mode ne? Jacobian-vector vs vector-Jacobian products. Her birinin matematik karakteristiği ve LLM'de kullanım yerleri.

Frequently Asked Questions

Eager's **debug experience** and **Pythonic feel** dramatically improved developer experience. JAX's function-pure paradigm (no state, side-effects) has steep learning curve; PyTorch lets you write ML code that feels like Python. PyTorch added `torch.compile` (2.0+) bringing static graph benefits → best-of-both-worlds. JAX still dominates Google ecosystem and TPU; strong in research but limited industry adoption.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Ders Haritası#

1. DAG Nedir? Graph Teorisi 101#

Computational graph bir DAG çünkü#

Önemli kavramlar#

2. Computational Graph Anatomisi#

Node tipleri#

Edge'ler ne taşır?#

Önemli: backward graph ayrı#

3. In-degree, Out-degree#

Out-degree LLM'de önemli — niye?#

In-place ops uyarısı#

4. Topological Sort — DFS vs Kahn's#

Algoritma 1: DFS Post-order (Tarjan)#

Algoritma 2: Kahn's BFS#

Hangisi ne zaman?#

5. Reverse Traversal — Backward'ın Doğru Sırası#

Niye reverse?#

Memory implication#

6. Eager Execution Paradigması#

Avantajlar#

Dezavantajlar#

7. Static Graph / Lazy Execution#

Avantajlar#

Dezavantajlar#

Frameworks#

8. Graph Optimization — Statik Graph'ın Süper Gücü#

Operator fusion#

Dead code elimination#

Constant folding#

Memory planning#

Kernel scheduling#

Hardware-specific tiling#

Pratik etki#

9. torch.compile — Best of Both Worlds#

Nasıl çalışıyor?#

Kazanç#

Sınırlar#

10. Modern Alternatifler#

MLX (Apple, 2023)#

tinygrad (George Hotz)#

Mojo (Modular AI)#

TensorFlow 2.x#

JAX#

11. Mini Egzersizler#

Bu Derste Neler Öğrendik?#

Sıradaki Ders#

Frequently Asked Questions

Why is PyTorch eager mode more popular than JAX static graph?

How does PyTorch detect that a DAG forms a cycle?

Is static graph compile time a real production problem?

Does `torch.compile` always speed things up?

Is JAX's functional purity requirement a problem for LLM training?

MLX vs PyTorch (MPS backend) — which is better for Apple Silicon?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight

9.
`torch.compile`
— Best of Both Worlds#