Does BatchNorm actually solve 'internal covariate shift' or is the mechanism different?

Disputed. Ioffe-Szegedy 2015 claimed 'covariate shift'. Santurkar et al. 2018 ("How Does Batch Normalization Help Optimization?") ICLR best paper: **actual mechanism is 'loss surface smoothing'**. BN reduces gradient variance, loss landscape becomes more **Lipschitz**, optimization more stable. Modern view: BN's name 'internal covariate shift' is misleading, real benefit is **smoothing**. No practical difference, but the academic debate continues.

If ConvNeXt rivals transformers, why did everyone switch to ViT?

Niche choice. ConvNeXt 2022 competes with ViT on ImageNet but: (1) **Tooling momentum**: HuggingFace, timm, modern frameworks support ViT better. (2) **Multi-modal extension**: ViT naturally fits image-text alignment (CLIP, LLaVA). (3) **Foundation model paradigm**: pretrain + fine-tune cleaner with ViT. (4) **Compute scaling**: ViT scaling laws clearer. ConvNeXt still used in practice (medical imaging, mobile) but **frontier research** is in ViT. Bitter Lesson again.

After AlexNet 2012, why was there a revolution roughly every year?

The field gained 'collective focus'. (1) **ImageNet common benchmark**: everyone competing on same problem. (2) **Open source**: Caffe, Torch, early TensorFlow eased code sharing. (3) **NeurIPS hub**: researchers met twice a year. (4) **GPU cheapening**: thousands of labs could experiment, not one. (5) **Industry funding**: Google/Facebook/Microsoft poured billions into ML labs from 2013. This combination cut the paradigm-shift cycle from 2-3 years to 1 year. 2017 Transformer + 2018 BERT was the peak of this speed.

Why does 1×1 conv resemble a 'low-rank projection'?

Because it is. A 1×1 conv (C_in → C_out) is, in shape, a matrix multiplication: $y = W x$ where W has shape (C_out, C_in). If C_out < C_in, it's a low-rank linear projection. Exactly a **fully connected layer** applied across spatial dimensions. Hence modern transformers call it 'point-wise linear' or 'channel mixing'. **LoRA** (Module 21) is a derivative of this idea: factor a weight matrix into two low-rank matrices.

Is computer vision research strong in Türkiye? How did the 2012-2017 era affect it?

Academic vision is strong, industry slow. (1) **Academy**: Boğaziçi, METU, Bilkent, İTÜ have vision labs. Pinar Duygulu (Hacettepe, image captioning), Sinan Kalkan (METU, vision-language), Aydın Alatan (METU, signal processing) are notable. (2) **Industry**: Aselsan, Havelsan in defense vision (radar, satellite imaging). (3) **Startups**: Vivense, Trendyol (recommender vision), Getir (image moderation). Turkey's vision scene grew over 10-15 years. After ChatGPT, focus shifted to NLP but vision remains strong. Module 60 (Turkish AI Ecosystem) details.

Big Bang in Vision: AlexNet, VGG, Inception, ResNet, BatchNorm — Birth of Modern Architectural Components

Q: Can deep networks really not be trained without skip connections in ResNet?

Generally yes — nuanced. ResNet paper (He 2015) compared 34 vs 56-layer plain networks: 56-layer plain had **higher loss** than 34-layer. So not just vanishing gradient, but **optimization difficulty**. Residual connection broke this barrier. Then 152, 200, 1001-layer ResNets became possible. Modern transformers stop at 32-80 layers because they don't need more — but they also couldn't go deeper without skip connections.

Q: Is computer vision research strong in Türkiye? How did the 2012-2017 era affect it?

Academic vision is strong, industry slow. (1) **Academy**: Boğaziçi, METU, Bilkent, İTÜ have vision labs. Pinar Duygulu (Hacettepe, image captioning), Sinan Kalkan (METU, vision-language), Aydın Alatan (METU, signal processing) are notable. (2) **Industry**: Aselsan, Havelsan in defense vision (radar, satellite imaging). (3) **Startups**: Vivense, Trendyol (recommender vision), Getir (image moderation). Turkey's vision scene grew over 10-15 years. After ChatGPT, focus shifted to NLP but vision remains strong. Module 60 (Turkish AI Ecosystem) details.

The 2012-2017 vision revolution: AlexNet's 5 innovations, VGG's uniformity principle, Inception's multi-scale approach, ResNet's skip connection revolution, BatchNorm's response to internal covariate shift. Detailed analysis of the architectural legacy that led to Transformers.

Şükrü Yusuf KAYA

55 min read

5/13/2026

Intermediate

Vision'da Big Bang: AlexNet, VGG, Inception, ResNet, BatchNorm — Modern Mimari Bileşenlerinin Doğuşu

🌋 Modern AI'ın patladığı yıllar

2012 AlexNet öncesi ile sonrası arasında AI dünyası değişmedi — yeniden doğdu. Bu 5 yıllık dönemde (2012-2017) modern derin öğrenmenin neredeyse tüm temel bileşenleri keşfedildi: ReLU, dropout, batch norm, skip connection, distributed training. Bu dersi okuduktan sonra Llama 3'ün config.json'una bakıp 'bu RMSNorm aslında BatchNorm'un torunu', 'bu residual connection ResNet 2015'ten geldi' diyebileceksin.

Ders Haritası#

ImageNet öncesi vision — neredeydik?
AlexNet 2012: 5 fundamental yenilik
ZFNet 2013: ne yanlış gitmiş, nasıl düzeltildi
VGG 2014: 3×3 conv'un keşfi
Inception/GoogLeNet 2014: multi-scale + 1×1 conv
BatchNorm 2015: training stabilizasyonu
ResNet 2015: skip connection — 152 layer'a kapı
DenseNet 2017: tüm önceki katmana bağlan
SE-Net 2018: attention'ın convnet'e ilk gelişi
Vision Transformer 2020: convnet hegemonyasını biten yıl
Transformer'a bağlantı: ne miras kaldı

1. ImageNet Öncesi Vision — Nereden Gelmiştik#

2011 sonu: vision dünyası hand-crafted features + classical ML ile çalışıyordu.

Klasik pipeline (2010 öncesi)#

Image → SIFT/HOG/LBP features → Bag of Visual Words → SVM → Class

Components:

SIFT (Lowe 1999): Scale-Invariant Feature Transform — anchor point'ler bul
HOG (Dalal-Triggs 2005): Histogram of Oriented Gradients — pedestrian detection
LBP (Local Binary Patterns) — texture
Bag of Visual Words — feature'ları cluster'la, histogram yap

ImageNet 2010, 2011 yarışmacıları bu pipeline'larla

%28-30

top-5 error.

Önceki NN denemeleri#

LeNet-5 (LeCun 1998): MNIST'te %0.7 error ama ImageNet ölçeğinde scale edilemedi
Convolutional Restricted Boltzmann Machines (2011): vision için NN, ama hâlâ feature engineering benzeri yaklaşım
Hinton + DBN (2006-2010): deep, ama discriminative yerine generative

2011: ImageNet'te SuperVision (önceki Krizhevsky)#

2011'de Krizhevsky farklı bir benchmark'ta NN ile %3 ImageNet improvement gösterdi. 2012'ye doğru hazırlanıyordu.

2. AlexNet 2012 — 5 Fundamental Yenilik#

Krizhevsky, Sutskever, Hinton — "ImageNet Classification with Deep Convolutional Neural Networks", NeurIPS 2012.

Sonuç: top-5 error %16.4 (önceki en iyi %26.2). %10 puan iyileştirme bir gecede.

Mimari#

Input (3×224×224)
├── Conv 11×11, 96 filters, stride 4   → 96×55×55
├── Max-pool 3×3, stride 2              → 96×27×27
├── Conv 5×5, 256 filters               → 256×27×27
├── Max-pool                            → 256×13×13
├── Conv 3×3, 384 filters
├── Conv 3×3, 384 filters
├── Conv 3×3, 256 filters
├── Max-pool                            → 256×6×6
├── FC 4096
├── FC 4096
└── FC 1000 (softmax)

~60M parameter, 5 conv + 3 FC layer.

5 Devrimsel Yenilik#

Yenilik 1: ReLU activation

Sigmoid/tanh yerine

max(0, x)

. Niye devrimsel?

Vanishing gradient'i çözüyor: sigmoid türevi max 0.25, derin ağda exponential decay; ReLU türevi 0 veya 1, decay yok
Hesaplama ucuz:
max(0, x)
vs
1/(1+e^-x)
Sparsity: yarısı sıfır → biyolojik nöronlara benzer

Sonradan SwiGLU, GeGLU, GELU çeşitleri çıktı — ama hepsinin atası bu.

Yenilik 2: Dropout

Eğitimde her forward pass'te %50 random nöron "öldür" (zero). Test'te tüm nöronlar ama scale.

Regularization — overfit'i azaltır
Ensemble etkisi — her batch farklı sub-network

Llama 3 pretrain'inde dropout pek kullanılmaz (data overfit etmek için yetmez) ama fine-tuning'de hâlâ.

Yenilik 3: GPU implementasyonu (2 GPU paralel)

CUDA ile NVIDIA GTX 580 (3 GB VRAM her biri) × 2. Model belleğe sığmadığı için layer'lar GPU'lara bölündü — primitif model parallelism.

Bugünün distributed training (FSDP, ZeRO) köklerini buradan alıyor.

Yenilik 4: Data augmentation

Random crops, horizontal flips
PCA color augmentation (intensity perturbation)

Effective dataset size'ı 2048× artırdı. Sentetik veri'nin atası.

Yenilik 5: Local Response Normalization (LRN)

Komşu nöron aktivitelerini normalize. Sonradan BatchNorm ile değiştirildi ama o zaman önemliydi.

Sosyal Etki#

NeurIPS 2012 NIPS sonrası Google, Facebook, Microsoft acil AI lablar kurdular. Hinton'un Toronto lab'ı 2013'te DNNresearch olarak Google'a $44M'a alındı. AI altın çağı başladı.

python

import torch
import torch.nn as nn
 
# Pedagojik AlexNet — modern PyTorch ile
class AlexNet(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(96, 256, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(256, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )
 
    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        return self.classifier(x)
 
# Parameter count
model = AlexNet()
total = sum(p.numel() for p in model.parameters())
print(f"AlexNet parameters: {total/1e6:.1f}M")   # ~61M

AlexNet'in modern PyTorch implementasyonu.

3. ZFNet 2013 — AlexNet'in İncelenmiş Hali#

Zeiler & Fergus (NYU) — "Visualizing and Understanding Convolutional Networks", ECCV 2014.

ZFNet aslında bir devrim değil — AlexNet'in hatalarını düzeltmiş versiyon. Ama önemi:

İlk feature visualization#

CNN içinde her filter ne öğreniyor? Zeiler deconvolution ile görselleştirdi:

Layer 1: Gabor-like edges
Layer 2: corners, color blobs
Layer 3: textures, parts of objects
Layer 4: object parts (faces, wheels)
Layer 5: full objects

Bu, interpretability araştırmasının başlangıcı (modern Anthropic mechanistic interpretability'sinin atası).

Hyperparameter optimizasyonu#

AlexNet'in 11×11 filter'larını 7×7'ye küçülttü, stride'ı 4'ten 2'ye düşürdü. Top-5 error %15.3 → %11.7 (ImageNet 2013 kazandı).

Pratik ders#

Filter sizes ve stride'lar küçültülürse, model daha fine-grained pattern'ler yakalar.

Bu, VGG 2014'ün kalbi.

4. VGG 2014 — Uniformity ve 3×3 Conv#

Simonyan & Zisserman (Oxford VGG group) — "Very Deep Convolutional Networks for Large-Scale Image Recognition", 2014.

Felsefe#

"Tüm conv layer'larda aynı şey: 3×3 conv, ReLU, gerektiğinde pool."

Mimari uniform, hyperparameter sayısı az, kolayca scale edilebilir.

VGG-16 ve VGG-19#

VGG-16: 13 conv + 3 FC = 16 weight layer
VGG-19: 16 conv + 3 FC

Top-5 error: %7.3 (ImageNet 2014, 2. sırada — birinci GoogLeNet).

Neden 3×3?#

Receptive field equivalence: iki 3×3 conv ardışık = 5×5 receptive field. Üç 3×3 = 7×7. Ama parameter sayısı az:

7×7 conv: 49C² parameter (C = channel)
Üç 3×3 conv: 27C² parameter — %45 daha az

Plus: arada 3 ReLU var, daha çok non-linearity.

VGG'nin mirasları#

Modern conv mimarisi default'u: 3×3, padding=1, stride=1
Feature extractor olarak yaygın kullanım: VGG-16 features hâlâ style transfer, perceptual loss için kullanılıyor
Pretrained-then-fine-tune paradigmasının yaygınlaşması

Sınır#

VGG çok büyük: 138M parameter (AlexNet 61M). Bellek ve compute pahalı. Bu sınırı InceptionV1 (GoogLeNet) çözmeye çalıştı.

5. Inception / GoogLeNet 2014 — Multi-scale + 1×1 Conv#

Szegedy et al. (Google) — "Going Deeper with Convolutions", CVPR 2015. ImageNet 2014 kazananı: %6.7 top-5 error.

Felsefe#

"Aynı katmanda farklı receptive field'lardan bilgi yakalayalım."

Inception module#

        ┌─→ 1×1 conv ────────────────┐
Input ──┼─→ 1×1 → 3×3 conv ─────────┤
        ├─→ 1×1 → 5×5 conv ─────────┼─→ concatenate
        └─→ 3×3 max-pool → 1×1 conv ┘

Aynı input, paralel 4 branch ile işlenir, çıktılar concatenate edilir.

1×1 Conv'un sırrı#

Input (C in channels) → 1×1 conv (C out channels) → Output (C out channels)

1×1 conv spatial bilgiyi değiştirmiyor ama channel sayısını değiştiriyor. Niye önemli?

Bottleneck: önce 1×1 ile channel azalt (örn. 256 → 64), sonra 3×3 conv (parametre az), sonra 1×1 ile geri yükselt
Computation tasarrufu: 256-ch 3×3 conv ile karşılaştır — 1×1 bottleneck %75 daha az parameter

Modern Mixture of Experts (MoE) routing 1×1 conv ile router kullanıyor. Modül 12'de detayda.

Inception versiyonları#

InceptionV1 (GoogLeNet) 2014: 22 layer, %6.7
InceptionV2 2015: 5×5 yerine iki 3×3
InceptionV3 2015: factorized convolutions, label smoothing
InceptionV4 / Inception-ResNet 2016: ResNet ile birleşim

Inception'ın mirası#

Multi-scale processing: aynı katmanda farklı scale'ler — modern attention'ın multi-head'i bunun mirası
1×1 conv bottleneck: parametre ekonomik mimariler
Auxiliary loss heads (intermediate supervision) — modern self-distillation atası

6. BatchNorm 2015 — Training Stabilizasyonu#

Ioffe & Szegedy (Google) — "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", ICML 2015.

Problem#

Derin ağlarda bir katmandaki output dağılımı shift ediyor — sonraki katman sürekli adapte olmak zorunda. Bu "internal covariate shift" eğitimi yavaşlatır, weight initialization'a hassas yapar.

BatchNorm çözüm#

Her batch içinde her feature map için mean ve variance'ı normalize et:

\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i

\sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

y_i = \gamma \hat{x}_i + \beta

γ

β

learnable scale ve shift parametreleri.

Pratik etkiler#

Higher learning rate: stabil
Daha az dropout gerek (regularization etkisi)
Initialization daha az hassas
Çok daha hızlı convergence (3-10x)

Sınırlar#

Batch size'a hassas: küçük batch'te (örn. 1-4) noisy
RNN/Transformer için zor: variable-length sequence, batch dim'i tek-statistik için yetmez
Inference farkı: eğitimde batch statistics, inference'da running average — tutarsızlık riski

Modern alternatifler#

LayerNorm (Ba 2016): batch dim yerine feature dim'inde normalize. RNN, Transformer'da standart.
GroupNorm (Wu 2018): channel'ları gruplara böl, group içinde normalize
RMSNorm (Zhang 2019): sadece variance, mean yok. Llama, Mistral, Qwen kullanıyor — daha hızlı, hafifçe daha iyi

Modern LLM'lerde BatchNorm kullanılmaz, ama kavramsal atası RMSNorm'un.

7. ResNet 2015 — Skip Connection Devrimi#

He, Zhang, Ren, Sun (Microsoft Research Asia) — "Deep Residual Learning for Image Recognition", CVPR 2016. ImageNet 2015 winner: %3.57 top-5 error (insan-üstü performans).

Problem#

VGG-19'a daha çok layer eklemek iyileştirme getirmiyordu — aksine, accuracy düşüyordu. Bu vanishing gradient değil (BatchNorm ile çözülmüştü). Optimization zorluğu.

Skip connection (residual block)#

       x
       │
       ├──────────────────┐
       ↓                  │
   F(x) = conv → BN       │  (identity shortcut)
       → ReLU             │
       → conv → BN        │
       ↓                  │
    + ←┘
       │
       ↓
   y = F(x) + x

Bu basit bir matematiksel oyun —

y = F(x) + x

. Ama derin bir etkisi var.

Niye çalışıyor?#

Gradient flow: backward pass'te gradient direkt geçer (kimlik üzerinden) → vanishing gradient daha az
Identity mapping kolayca öğrenilir: bazı layer'lar useless ise
F(x) = 0
kolay,
F(x) = x
zor — skip ile
F(x) = 0
→
y = x
kolay
Loss surface smoothing: empirik kanıt (Li et al. 2018), loss landscape skip'le daha smooth

ResNet versiyonları#

ResNet-50: 50 layer, ~26M parameter
ResNet-101, ResNet-152: daha derin
Wide ResNet (2016): daha az layer, daha geniş

ResNet-152 ImageNet'te %3.57 top-5 error — insan baseline'ı civarında.

Skip connection'ın mirası#

Modern her transformer'da, her decoder bloğunda residual connection vardır.

Llama 3, GPT-5, Mistral — hepsi:

x = x + attention(LN(x))
x = x + ffn(LN(x))

ResNet 2015 olmasaydı, modern LLM'ler olmazdı.

python

import torch
import torch.nn as nn
 
class ResidualBlock(nn.Module):
    """ResNet basic block: y = F(x) + x"""
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)
        self.relu = nn.ReLU(inplace=True)
 
    def forward(self, x):
        identity = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out = out + identity            # ← skip connection
        return self.relu(out)
 
# Transformer block ile karşılaştır
class TransformerBlock(nn.Module):
    """Modern decoder block — aynı pattern!"""
    def __init__(self, d, n_heads):
        super().__init__()
        self.ln1 = nn.LayerNorm(d)
        self.attn = nn.MultiheadAttention(d, n_heads, batch_first=True)
        self.ln2 = nn.LayerNorm(d)
        self.ffn = nn.Sequential(
            nn.Linear(d, 4*d), nn.GELU(), nn.Linear(4*d, d)
        )
 
    def forward(self, x):
        x = x + self.attn(self.ln1(x), self.ln1(x), self.ln1(x))[0]    # residual
        x = x + self.ffn(self.ln2(x))                                    # residual
        return x
 
# Kavramsal aynı — F(x) + x patterni

ResNet block'tan Transformer block'a — aynı pattern.

8. ResNet Sonrası — DenseNet, SE-Net, EfficientNet#

DenseNet 2017#

Huang, Liu, van der Maaten. Her layer tüm önceki layer'lara bağlı (concatenation). Parametre daha az, feature reuse daha iyi.

SE-Net (Squeeze-and-Excitation) 2018#

Hu, Shen, Sun. Channel-wise attention:

1. Global average pool → channel descriptor
2. FC → sigmoid → attention weights
3. Channel'ları weight ile re-scale

Bu attention'ın convnet'e ilk yaygın girişi. ImageNet 2017 kazananı.

EfficientNet 2019#

Tan & Le (Google). Compound scaling: depth, width, resolution'ı aynı anda dengeli ölçeklendir. ImageNet'te daha az parameter ile state-of-the-art.

ConvNeXt 2022#

Liu, Mao, Wu et al. — transformer-inspired convnet tasarımı. LayerNorm, GeLU, depth-wise conv. Pure-transformer ile yarışan modern convnet.

9. Vision Transformer 2020 — Hegemonyanın Sonu#

Dosovitskiy et al. (Google Research) — "An Image Is Worth 16x16 Words", ICLR 2021.

Fikir#

"Görüntüyü 16×16 patch'lere böl, her patch'i bir token gibi düşün, standart transformer çalıştır."

Sonuç#

Yeterince büyük dataset (JFT-300M, ImageNet-21k) ile convnet'leri geçti
ImageNet'te ResNet baseline'ları geride bıraktı

Niye önemli?#

ViT, "vision domain-specific bias gerek" inanışını kırdı. Convolution'ın spatial invariance bias'ı, scale + data ile transformer ile compensate edilebilir.

Bu Bitter Lesson'ın canlı kanıtı (Modül 3.2).

Sonraki#

Swin Transformer (2021): hierarchical, window attention
DeiT (Touvron 2021): data-efficient ViT
DINOv2 (2023): self-supervised ViT pretraining
SAM, SAM 2 (Meta 2023-24): segment anything ViT-based

Hibrit#

ConvNeXt, MaxViT, MobileViT — convnet + transformer karışımı. Pratik production'da hâlâ yaygın.

10. Vision'dan Transformer'a Miras#

Modern LLM'lerin neredeyse her bileşeni vision dönemine borçlu:

Bileşen	Kaynak	Modern karşılığı
ReLU	AlexNet 2012	GELU, SwiGLU
Dropout	AlexNet 2012	Hâlâ kullanılıyor
BatchNorm	Ioffe-Szegedy 2015	LayerNorm, RMSNorm
Skip connection	ResNet 2015	Her transformer bloğu
1×1 conv (bottleneck)	Inception 2014	Linear projection
Channel attention	SE-Net 2018	Self-attention
Compound scaling	EfficientNet 2019	Scaling laws
Multi-scale / multi-head	Inception 2014	Multi-head attention
Data augmentation	AlexNet 2012	Synthetic data, RLHF
GPU distributed training	AlexNet 2012	FSDP, DeepSpeed
Pretrained features	VGG 2014	Foundation models
Self-supervised	DINOv2 2023	LLM pretraining

Tek cümle özet#

AlexNet'in 2012'deki 5 yeniliğinin (ReLU, Dropout, GPU, augmentation, LRN) entelektüel mirası, 2026'daki tüm LLM mimarilerini içeriyor.

11. Mini Egzersizler#

AlexNet vs LeNet-5: ne kadar büyük? Hangi mimari yenilikler farklı?
3×3 vs 5×5 tercihi: Receptive field 5×5 ile aynı ama iki 3×3 daha az parameter — bu trade-off her durumda mı geçerli?
BatchNorm vs LayerNorm: NLP/transformer'da niye BN yerine LN?
Residual connection matematiği:
y = F(x) + x
derken backward'da
∂y/∂x
ne? Vanishing gradient nasıl çözülüyor?
ViT'in zayıflıkları: Convnet'lerin spatial invariance bias'ı niye küçük dataset'te ViT'ten daha iyi? Bu Bitter Lesson'a ters mi?

Bu Derste Neler Öğrendik?#

✓ 2012 öncesi vision: hand-crafted features + SVM, ImageNet %26 error ✓ AlexNet 5 yeniliği: ReLU, Dropout, GPU, Augmentation, LRN — Big Bang ✓ VGG 2014: 3×3 conv uniformity ✓ Inception 2014: multi-scale + 1×1 bottleneck ✓ BatchNorm 2015: internal covariate shift, training stabilization ✓ ResNet 2015: skip connection — modern LLM'in atası ✓ DenseNet, SE-Net, EfficientNet — sonraki evolutions ✓ Vision Transformer 2020: convnet hegemonyasını bitiren ✓ Transformer'a 12 miras: ReLU → SwiGLU, BN → RMSNorm, skip → residual block, etc.

Sıradaki Ders#

3.4 — Sequence Modelleme: RNN, LSTM, GRU'dan Attention'a Giden Yol Vision'la paralel NLP'nin evrimi. RNN'in vanishing gradient'i, LSTM'in çözümü, encoder-decoder ve attention'ın doğuşu. Bu yolculuk bizi 2017 Transformer'a getirecek.

Frequently Asked Questions

Generally yes — nuanced. ResNet paper (He 2015) compared 34 vs 56-layer plain networks: 56-layer plain had **higher loss** than 34-layer. So not just vanishing gradient, but **optimization difficulty**. Residual connection broke this barrier. Then 152, 200, 1001-layer ResNets became possible. Modern transformers stop at 32-80 layers because they don't need more — but they also couldn't go deeper without skip connections.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Ders Haritası#

1. ImageNet Öncesi Vision — Nereden Gelmiştik#

Klasik pipeline (2010 öncesi)#

Önceki NN denemeleri#

2011: ImageNet'te SuperVision (önceki Krizhevsky)#

2. AlexNet 2012 — 5 Fundamental Yenilik#

Mimari#

5 Devrimsel Yenilik#

Yenilik 1: ReLU activation

Yenilik 2: Dropout

Yenilik 3: GPU implementasyonu (2 GPU paralel)

Yenilik 4: Data augmentation

Yenilik 5: Local Response Normalization (LRN)

Sosyal Etki#

3. ZFNet 2013 — AlexNet'in İncelenmiş Hali#

İlk feature visualization#

Hyperparameter optimizasyonu#

Pratik ders#

4. VGG 2014 — Uniformity ve 3×3 Conv#

Felsefe#

VGG-16 ve VGG-19#

Neden 3×3?#

VGG'nin mirasları#

Sınır#

5. Inception / GoogLeNet 2014 — Multi-scale + 1×1 Conv#

Felsefe#

Inception module#

1×1 Conv'un sırrı#

Inception versiyonları#

Inception'ın mirası#

6. BatchNorm 2015 — Training Stabilizasyonu#

Problem#

BatchNorm çözüm#

Pratik etkiler#

Sınırlar#

Modern alternatifler#

7. ResNet 2015 — Skip Connection Devrimi#

Problem#

Skip connection (residual block)#

Niye çalışıyor?#

ResNet versiyonları#

Skip connection'ın mirası#

8. ResNet Sonrası — DenseNet, SE-Net, EfficientNet#

DenseNet 2017#

SE-Net (Squeeze-and-Excitation) 2018#

EfficientNet 2019#

ConvNeXt 2022#

9. Vision Transformer 2020 — Hegemonyanın Sonu#

Fikir#

Sonuç#

Niye önemli?#

Sonraki#

Hibrit#

10. Vision'dan Transformer'a Miras#

Tek cümle özet#

11. Mini Egzersizler#

Bu Derste Neler Öğrendik?#

Sıradaki Ders#

Frequently Asked Questions

Can deep networks really not be trained without skip connections in ResNet?

Does BatchNorm actually solve 'internal covariate shift' or is the mechanism different?

If ConvNeXt rivals transformers, why did everyone switch to ViT?

After AlexNet 2012, why was there a revolution roughly every year?

Why does 1×1 conv resemble a 'low-rank projection'?

Is computer vision research strong in Türkiye? How did the 2012-2017 era affect it?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight