Why combine softmax and cross-entropy instead of computing the softmax derivative and chaining with cross-entropy?

Two reasons: (1) **Numerical stability**: separate softmax (\(e^{z_i}\) overflow risk) + log (zero crossings → -inf) is fragile. Combined `log_softmax` uses **log-sum-exp trick** for always-stable computation. (2) **Derivative simplification**: the \(y - t\) simplification means gradient is a single vector subtraction. Separately: softmax Jacobian (dense V×V) × cross-entropy gradient — more expensive + noisier. PyTorch's `F.cross_entropy` combines both (takes raw logits, integer labels).

If Hessian is so expensive, how do optimizers use it?

Modern adaptive optimizers (Adam, AdamW, Lion, Muon) **never compute full Hessian** — they use **diagonal approximation**. Adam: \(\hat{v}_t\) = EMA of **squared gradients** → per-parameter variance estimate, approximating Hessian's diagonal. O(n) memory + time. Advanced: **K-FAC** (per-layer Kronecker-factored), **Shampoo**, **Muon** (2024). Full Hessian is only for theoretical analysis (loss landscape); production uses diagonal/block approximation.

Why is backprop 'reverse mode' autodiff? Why not forward-mode?

Forward and reverse mode are two autodiff approaches. **Forward**: separate pass per input — O(n_inputs × m_ops). **Reverse**: single pass for gradients w.r.t. all inputs — O(m_ops). In LLMs, inputs (parameters) are billions, output (loss) is scalar. **Reverse** is far more efficient: one scalar → billions of gradients. **Forward-mode** when? Outputs >> inputs (e.g., Jacobian-vector products). PyTorch defaults to reverse; `torch.func.jvp` gives forward access.

Do I need to write manual backprop, or is autograd always enough?

In production, autograd suffices 95%+ of the time. Manual backprop needed: (1) **Custom CUDA kernels** — e.g., FlashAttention's backward as custom Triton kernel (Module 33). (2) **Non-differentiable ops** — straight-through estimator (quantization), Gumbel-softmax. (3) **Memory optimization** — custom gradient checkpointing. (4) **Debug** — manually verify on small examples when suspecting autograd bug. Karpathy's advice: 'write backprop from scratch at least once in your career' — for understanding. Then use autograd 99% of the time.

Turkish-specific: does the gradient change because Turkish tokenization is different?

Gradient math is language-agnostic (chain rule). But practical effects: (1) **Longer T**: Turkish consumes 2-3x tokens → sequence length grows → gradient info more diffused. (2) **Embedding gradient**: rare Turkish tokens get sparse gradients → slower learning. (3) **Since vocab isn't Turkish-optimized**, each token update has lower quality. Solution: Turkish-tuned tokenizer (Module 6 Capstone C13).

Derivatives, Gradients, and Matrix Calculus: The Math of Backprop from Scratch

Derivatives from scalar to vector to matrix. Jacobian, Hessian, chain rule, numerator vs denominator layout. Why the derivative of softmax + cross-entropy is so elegant. Manual backprop computation compared with PyTorch autograd.

Şükrü Yusuf KAYA

40 min read

5/13/2026

Intermediate

Türev, Gradient ve Matrix Calculus: Backprop'un Matematiği Sıfırdan

🧮 Backprop'u 'sihir' olmaktan çıkaracağız

Çoğu öğrenci backpropagation'ı 'PyTorch yapıyor' diye geçer. Senior pozisyonlarda bu kabul edilmez — autograd bug'ı debug ettiğinde, custom backward yazdığında, FlashAttention kerneli düşündüğünde matematiği bilmeden geçilmez. 40 dakika sonra zincir kuralının matris versiyonunu kendi başına yazabileceksin.

Ders Haritası#

Skaler türev refresher
Kısmi türev ve gradient (skaler → vektör)
Jacobian (vektör → vektör)
Hessian (skaler → matris)
Zincir kuralı matris versiyonu
Numerator vs denominator layout — büyük tuzak
NN'de yaygın türevler: sigmoid, softmax, cross-entropy
Softmax + Cross-entropy zarif türevi
Manuel backprop — küçük bir ağda adım adım
PyTorch autograd ile karşılaştırma

1. Skaler Türev — Refresher#

Bir fonksiyon

f: ℝ → ℝ

. Türev:

f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}

Geometrik: tanjant doğrusunun eğimi. Fiziksel: değişim hızı.

Önemli türevler:

d/dx [x^n] = n x^{n-1}
d/dx [e^x] = e^x
d/dx [ln(x)] = 1/x
d/dx [sin(x)] = cos(x)

Çarpım kuralı:

(fg)' = f'g + fg'

Bölüm kuralı:

(f/g)' = (f'g - fg') / g²

Zincir kuralı:

(f(g(x)))' = f'(g(x)) · g'(x)

2. Kısmi Türev ve Gradient#

Çok değişkenli fonksiyon

f: ℝ^n → ℝ

için her bir değişkene göre ayrı türev alabiliriz. Buna partial derivative (kısmi türev) denir.

\frac{\partial f}{\partial x_i}

x_i

'ye göre türev alırken, diğer tüm değişkenler sabit kabul edilir.

Gradient#

Tüm kısmi türevlerin bir vektörde toplanmış haline gradient denir:

\nabla f = \left(\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_n}\right)

Sezgi: Gradient, fonksiyonun en hızlı arttığı yönü gösteren bir vektördür. Büyüklüğü = artış hızı.

LLM'de:#

Loss

L

∈ ℝ. Weight

W

∈ ℝ^{d_out × d_in}.

∂L/∂W

∈ ℝ^{d_out × d_in} — her bir weight için gradient. Bunu loss'un weight'e göre türevi diye okuruz.

W ← W - η ∇L

ile gradient descent yaparız.

python

import torch
 
# f(x, y) = x^2 + 3*x*y + y^2
# ∂f/∂x = 2x + 3y
# ∂f/∂y = 3x + 2y
# Gradient: (2x+3y, 3x+2y)
 
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
f = x**2 + 3*x*y + y**2
 
# PyTorch autograd
f.backward()
print("∂f/∂x =", x.grad.item())   # 2*1 + 3*2 = 8
print("∂f/∂y =", y.grad.item())   # 3*1 + 2*2 = 7
 
# Manuel doğrulama:
# Gradient'in büyüklüğü: ||∇f|| = √(8² + 7²) = √113 ≈ 10.63
# En hızlı artış yönü: (8/10.63, 7/10.63)

PyTorch autograd ile gradient hesabı + manuel doğrulama.

3. Jacobian — Vektör Çıktılı Fonksiyonlar#

Şimdi

f: ℝ^n → ℝ^m

(vektör → vektör). Türevi nedir?

Cevap: Jacobian matrisi. m × n matris:

J = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \dots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \dots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Her satır = output bileşeninin gradient'i. Her sütun = input bileşenine göre output'un nasıl değiştiği.

Sezgi#

f(x + dx) ≈ f(x) + J · dx

— birinci dereceden Taylor.

Yani Jacobian = "fonksiyonun lokal linearleştirilmesi". Lineer cebirin türeve karışmış hali.

NN'de Jacobian#

Bir Linear layer:

y = Wx + b

. Jacobian:

∂y/∂x = W

. Her zaman. Sigmoid:

y = σ(x)

element-wise. Jacobian diagonal:

diag(σ(x) * (1 - σ(x)))

. Softmax: Jacobian dense (köşegen değil) — birazdan göreceğiz.

4. Hessian — İkinci Türev Matrisi#

f: ℝ^n → ℝ

skaler fonksiyonun ikinci türevleri, Hessian matrisinde toplanır:

H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

n × n simetrik matris (Schwarz teoremi:

∂²/∂x∂y = ∂²/∂y∂x

Niye önemli?#

Convexity testi: Hessian pozitif tanımlı → fonksiyon konveks (yerel minimum globaldir)
Newton method:
x_{k+1} = x_k - H^{-1} ∇f
— daha hızlı convergence
Loss landscape analysis: Hessian eigenvalue'ları → flat vs sharp minima
LLM training: Adam ailesi Hessian'a yaklaşım yapıyor

Pratik#

Tam Hessian'ı saklamak imkânsız (modelin 10B parametresi → Hessian 10B × 10B = 10^20 element). Bunun yerine diagonal Hessian ya da Hessian-vector products kullanılır (Modül 17'de detayda).

5. Zincir Kuralı — Backpropagation'ın Kalbi#

Skaler versiyonu:

(f(g(x)))' = f'(g(x)) · g'(x)

Matris versiyonu (genel):

y = f(g(x))

, x → g → z → f → y.

\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathbf{y}}{\partial \mathbf{z}} \cdot \frac{\partial \mathbf{z}}{\partial \mathbf{x}}

İki Jacobian'ın matris çarpımı.

NN bağlamı#

Bir 3-katmanlı ağ:

x → [Layer 1] → h1 → [Layer 2] → h2 → [Layer 3] → y → L

Loss'un Layer 1'in weight'ine göre gradient'i:

\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h_2} \cdot \frac{\partial h_2}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_1}

Backprop, bu çarpımı sağdan sola (output'tan input'a) yapar — çünkü her seferinde sadece bir Jacobian × vektör (gradient) çarpımı gerekiyor, tam matris saklamaya gerek yok.

💡 Backprop = Zincir kuralı + dinamik programlama

Backprop'un sırrı: aynı ara hesabı tekrarlamamak. Forward pass'te kaydedilen ara çıkışları kullanarak, gradient'leri tek geçişte hesapla. Brüt zincir kuralı her gradient için ayrı zincir çarpsa, exponansiyel zaman alırdı. Backprop = O(n).

6. Numerator vs Denominator Layout — Büyük Tuzak#

Matrix calculus'ta türev tanımı için iki konvansiyon var:

Numerator layout (Jacobian convention)#

∂y/∂x — y'nin shape'i satır olarak gelir

Eğer

y

∈ ℝ^m,

x

∈ ℝ^n:

∂y/∂x ∈ ℝ^{m × n}

(m satır, n sütun).

Denominator layout (gradient convention)#

∂y/∂x — x'in shape'i satır olarak gelir

Eğer

y

∈ ℝ^m,

x

∈ ℝ^n:

∂y/∂x ∈ ℝ^{n × m}

(yukarıdakinin transpozu).

Hangi doğru?#

İkisi de doğru — sadece farklı konvansiyon. Karışıklık şudur: paper'lar her ikisini de kullanıyor. PyTorch implicit olarak denominator layout kullanıyor (gradient'in shape'i parametrenin shape'iyle aynı).

Mühendislik tavsiyesi: Bir konvansiyon seç (genelde denominator çünkü PyTorch öyle), tüm hesaplamalarını ona göre yap. Karıştırma!

7. NN'de Sık Karşılaşılan Türevler#

Bir mühendisin ezbere bilmesi gereken türevler:

Operasyon	İleri (forward)	Geri (backward)
`y = Wx + b`	linear	`∂y/∂W = x^T` , `∂y/∂x = W` , `∂y/∂b = 1`
`y = σ(x)` (sigmoid)	`1/(1+e^{-x})`	`y(1-y)` element-wise
`y = tanh(x)`	`tanh(x)`	`1 - y²`
`y = ReLU(x)`	`max(0, x)`	`1` if x>0 else 0
`y = softmax(x)`	`e^{x_i} / Σe^{x_j}`	dense Jacobian (aşağıda)
`L = CE(y, t)`	`-Σ t_i log(y_i)`	dense ama softmax ile birleşince zarif

Softmax Jacobian#

y_i = e^{x_i} / Σ_k e^{x_k}

. Türev:

\frac{\partial y_i}{\partial x_j} = \begin{cases} y_i(1 - y_i) & i = j \\ -y_i y_j & i \neq j \end{cases}

Matris formunda:

J = diag(y) - y y^T

Bu Jacobian dense ve vektör girdisinin softmax'ı ile parametrize.

8. Softmax + Cross-Entropy: Sihirli Sadeleşme#

Bir LLM'in son katmanı tipik olarak:

logits → softmax → probs → cross-entropy with target

İki ayrı türev hesaplayıp çarparsan complicated. Ama birleştirip türev alınca çok zarif sonuç:

Setup#

Logits:
z
∈ ℝ^V (V = vocab size)
Probs:
y = softmax(z)
Target:
t
(one-hot veya integer)
Loss:
L = -log(y_t)
(sadece doğru sınıfın log'u)

Türev (manuel)#

L = -log(y_t) = -log(e^{z_t} / Σ e^{z_k})
  = -z_t + log(Σ e^{z_k})

∂L/∂z_i = -[i == t] + e^{z_i} / Σ e^{z_k}
        = -[i == t] + y_i
        = y_i - [i == t]
        = y_i - t_i   (one-hot t için)

Sonuç (göz alıcı)#

\frac{\partial L}{\partial z} = \mathbf{y} - \mathbf{t}

Predicted - target. Bu kadar basit.

Niye önemli?#

Numerik stabilite: softmax + log'u ayrı hesaplamak overflow riski; birleştirilmiş
log_softmax + nll_loss
(veya
cross_entropy
) stabildir.
Hız: tek geçişte gradient.
Sezgi: "ne kadar yanılmıştın" doğrudan output'tan target'ı çıkarınca buluyorsun.

PyTorch'un

F.cross_entropy

fonksiyonu bu birleşimi otomatik yapıyor — sen log_softmax çağırmıyorsun,

nn.CrossEntropyLoss

zaten içerdiği için.

python

import torch
import torch.nn.functional as F
 
# Setup
torch.manual_seed(0)
logits = torch.randn(1, 5, requires_grad=True)  # 1 örnek, 5 sınıf
target = torch.tensor([2])                       # doğru sınıf: 2
 
# Forward
loss = F.cross_entropy(logits, target)
print("Loss:", loss.item())
 
# Backward (autograd)
loss.backward()
print("Autograd grad:", logits.grad)
 
# Manuel: grad = softmax(logits) - one_hot(target)
probs = F.softmax(logits.detach(), dim=-1)
one_hot = torch.zeros_like(probs)
one_hot[0, target] = 1.0
manual_grad = probs - one_hot
print("Manual grad:", manual_grad)
 
# Karşılaştır
print("Diff:", (logits.grad - manual_grad).abs().max().item())  # ~0

Softmax + cross-entropy zarif türevini manuel doğrulama.

9. Manuel Backprop — Bir Mini-Ağda Adım Adım#

Şimdi her şeyi birleştirelim. Basit bir 2-katmanlı ağ:

x ∈ ℝ^3  →  W₁ ∈ ℝ^{4×3}, b₁ ∈ ℝ^4  →  h = ReLU(W₁x + b₁)  →
            W₂ ∈ ℝ^{2×4}, b₂ ∈ ℝ^2  →  z = W₂h + b₂  →  L = CE(z, t)

Forward#

z₁ = W₁ x + b₁
h = ReLU(z₁) = max(0, z₁)
z₂ = W₂ h + b₂
L = -log(softmax(z₂)[t])

Backward (zincir kuralı)#

∂L/∂z₂ = softmax(z₂) - one_hot(t)        (yukarıdaki sihir)
∂L/∂W₂ = ∂L/∂z₂ · h^T                     (outer product)
∂L/∂b₂ = ∂L/∂z₂
∂L/∂h  = W₂^T · ∂L/∂z₂
∂L/∂z₁ = ∂L/∂h · (z₁ > 0)                (ReLU türevi: 1 if z₁>0)
∂L/∂W₁ = ∂L/∂z₁ · x^T
∂L/∂b₁ = ∂L/∂z₁

python

import torch
 
torch.manual_seed(0)
x = torch.randn(3)
W1 = torch.randn(4, 3, requires_grad=True)
b1 = torch.randn(4, requires_grad=True)
W2 = torch.randn(2, 4, requires_grad=True)
b2 = torch.randn(2, requires_grad=True)
t = torch.tensor(1)
 
# Forward
z1 = W1 @ x + b1
h = torch.relu(z1)
z2 = W2 @ h + b2
loss = torch.nn.functional.cross_entropy(z2.unsqueeze(0), t.unsqueeze(0))
 
# Backward (autograd)
loss.backward()
 
# Manuel hesapla
probs = torch.softmax(z2, dim=0)
one_hot = torch.zeros_like(probs)
one_hot[t] = 1.0
dL_dz2 = probs - one_hot                        # (2,)
dL_dW2 = dL_dz2.unsqueeze(1) @ h.detach().unsqueeze(0)   # (2, 4)
dL_db2 = dL_dz2                                  # (2,)
dL_dh = W2.detach().T @ dL_dz2                  # (4,)
dL_dz1 = dL_dh * (z1.detach() > 0).float()      # (4,) ReLU türevi
dL_dW1 = dL_dz1.unsqueeze(1) @ x.unsqueeze(0)   # (4, 3)
dL_db1 = dL_dz1
 
# Karşılaştır
print("W1 grad diff:", (W1.grad - dL_dW1).abs().max().item())
print("b1 grad diff:", (b1.grad - dL_db1).abs().max().item())
print("W2 grad diff:", (W2.grad - dL_dW2).abs().max().item())
print("b2 grad diff:", (b2.grad - dL_db2).abs().max().item())
# Hepsi ~0 → manuel hesap doğru ✓

Manuel backprop ile PyTorch autograd'ı bit-exact eşleme.

🎯 İşte backprop'un tamamı bu

Yukarıdaki 10 satırlık manuel hesap, Llama 3 8B'nin 32 layer'ında 32x tekrarlanan şey. Karmaşıklık ölçektendir, matematik aynıdır. Karpathy nanoGPT'sini 200 satırda yazıyor; sen şimdi her satırını anlıyorsun.

10. Vanishing ve Exploding Gradients#

Zincir kuralı şunu öğretir: gradient, birçok matrisin çarpımıdır.

∂L/∂W₁ = (∂L/∂z₂) · (W₂^T) · (ReLU') · (W₁) · ...

Eğer her matrisin eigenvalue'ları küçük (< 1) ise: çarpım sıfıra gider → vanishing gradient. Erken katmanlar öğrenmiyor.

Eğer eigenvalue'lar büyük (> 1) ise: çarpım patlar → exploding gradient. NaN loss.

Çözümler (Modül 17'de detay)#

Residual connections (skip): zincir kısalır, gradient direkt akar
Layer normalization / RMSNorm: aktivasyonları normalize
Gradient clipping:
||grad||₂ > τ
ise scale
Initialization (Xavier, He, Kaiming, μP): başlangıçta gradient varyansını dengele
Better activations (GeLU, SwiGLU): kaydetme ve gradient akışı dengeli

11. PyTorch Autograd — İçeriye Bir Bakış#

PyTorch autograd'i bir computational graph kurar:

Her tensor (
requires_grad=True
) bir node
Her operasyon bir edge
Forward pass'te ara çıkışlar saklanır (memory!)
backward()
çağrılınca topological order'da reverse'lenir

Kontrol akışı#

x = torch.tensor(1.0, requires_grad=True)
y = x ** 2
z = y * 3
print(z.grad_fn)   # MulBackward0
print(y.grad_fn)   # PowBackward0
print(x.grad_fn)   # None (leaf)
z.backward()
print(x.grad)      # 6.0 (dz/dx = 6x = 6)

`.detach()`
ve
`with torch.no_grad()`
#

Bazı işlemleri graph'ten çıkar: gradient hesaplanmasın, bellek tutmasın.

y = x.detach()                          # x ile bağı koparır
with torch.no_grad():
    z = some_computation(x)             # inference için ideal

`.backward()`
ikinci kez#

Default'ta computational graph free'lenir. Tekrar

backward()

istersen:

retain_graph=True

Modül 2'de autograd'i sıfırdan yazacağız (micrograd Türkçe).

12. Mini Egzersizler#

Skaler türev:
f(x) = log(1 + e^x)
(softplus). Türevi nedir? Sigmoid'le bağı var mı?
Vektör gradient:
f(x) = ||x||₂² = x^T x
(x ∈ ℝ^n).
∇f
nedir?
Matrisle gradient:
f(W) = ||Wx - y||₂²
.
∂f/∂W
nedir? (Hint: chain rule.)
Softmax türev şekli: 5 sınıflı softmax çıktısı için Jacobian shape ne? Dense mi, diagonal mi?
Cross-entropy farklı yazım: t = one-hot yerine t = integer index olsa,
∂L/∂z
formülü değişir mi?

Bu Derste Neler Öğrendik?#

✓ Skaler türev → kısmi türev → gradient ✓ Jacobian (vektör → vektör) ve Hessian (skaler → matris) ✓ Zincir kuralı'nın matris versiyonu — backprop'un kalbi ✓ Numerator vs denominator layout — konvansiyon karmaşası ✓ NN'de yaygın türevler: linear, sigmoid, ReLU, softmax ✓ Softmax + cross-entropy zarif türevi:

predicted - target

✓ Manuel backprop — bir mini ağda adım adım PyTorch'la karşılaştırma ✓ Vanishing/exploding gradient sezgisi ve çözümleri ✓ PyTorch autograd'in nasıl çalıştığına bir bakış

Sıradaki Ders#

1.4 — Chain Rule ve Backpropagation: Mini-Autograd Sıfırdan Karpathy'nin

micrograd

'ını Türkçe sıfırdan inşa edeceğiz.

Value

class'ı,

__add__

__mul__

operator overloading,

backward()

topological sort'u. 100 satırlık autograd motoru — bu kursun en eğitici lab'larından biri.

Frequently Asked Questions

Yes. **PyTorch uses denominator layout** — gradient shape **matches parameter shape**. `W.grad` and `W` are same shape. Adopt this convention and you can guess where transposes go: keep parameter shape. Papers typically use denominator (Goodfellow's Deep Learning Book); numerator is in math textbooks. For PyTorch users: 'don't break parameter shape' is the golden rule.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Ders Haritası#

1. Skaler Türev — Refresher#

2. Kısmi Türev ve Gradient#

Gradient#

LLM'de:#

3. Jacobian — Vektör Çıktılı Fonksiyonlar#

Sezgi#

NN'de Jacobian#

4. Hessian — İkinci Türev Matrisi#

Niye önemli?#

Pratik#

5. Zincir Kuralı — Backpropagation'ın Kalbi#

NN bağlamı#

6. Numerator vs Denominator Layout — Büyük Tuzak#

Numerator layout (Jacobian convention)#

Denominator layout (gradient convention)#

Hangi doğru?#

7. NN'de Sık Karşılaşılan Türevler#

Softmax Jacobian#

8. Softmax + Cross-Entropy: Sihirli Sadeleşme#

Setup#

Türev (manuel)#

Sonuç (göz alıcı)#

Niye önemli?#

9. Manuel Backprop — Bir Mini-Ağda Adım Adım#

Forward#

Backward (zincir kuralı)#

10. Vanishing ve Exploding Gradients#

Çözümler (Modül 17'de detay)#

11. PyTorch Autograd — İçeriye Bir Bakış#

Kontrol akışı#

.detach() ve with torch.no_grad()#

.backward() ikinci kez#

12. Mini Egzersizler#

Bu Derste Neler Öğrendik?#

Sıradaki Ders#

Frequently Asked Questions

Easy way to remember numerator vs denominator layout?

Why combine softmax and cross-entropy instead of computing the softmax derivative and chaining with cross-entropy?

If Hessian is so expensive, how do optimizers use it?

Why is backprop 'reverse mode' autodiff? Why not forward-mode?

Do I need to write manual backprop, or is autograd always enough?

Turkish-specific: does the gradient change because Turkish tokenization is different?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight

`.detach()`
ve
`with torch.no_grad()`
#

`.backward()`
ikinci kez#