2× RTX 4090 yerine 1× RTX 4090 + bir saat cloud A100 80GB hangisi cost-effective?

Cookbook'un kuralı: dev için 1× 4090 lokal (₺1.6/saat elektrik), production training için cloud H100 (1-2 saat saat ~$24-50). 2× 4090 ekstra ~$1600 yatırım + scaling efficiency %55 — yıllık ₺50K+ cloud kullanımın yoksa rasyonel değil.

NCCL alternatif var mı (RCCL, oneCCL)?

AMD'de RCCL (ROCm). Intel GPU'da oneCCL. Cookbook NVIDIA-only odaklanır. NCCL'in tüm flag set'i (NCCL_IB_*, NCCL_SOCKET_IFNAME) en olgun.

PCIe vs NVLink vs InfiniBand: The Invisible Impact of Bandwidth on Training

Bandwidth is invisible on a single 4090 but at scale-out it alone can slow training. PCIe 4.0/5.0 lane math, NVLink (and why 4090 doesn't have it), NVSwitch topology, InfiniBand 400G, threshold where NCCL all-reduce becomes network-bound, p2p_access detection, GPU-direct.

Şükrü Yusuf KAYA

32 min read

5/14/2026

Advanced

PCIe vs NVLink vs InfiniBand: Bandwidth'in Eğitim Üzerindeki Görünmez Etkisi

🎯 Bu ders ne için

Tek 4090 cookbook'un %85'i ama Part IV'ten itibaren multi-GPU'ya çıkıyoruz. Multi-GPU performans'ı çoğu zaman compute değil, network-bound. Bu ders network'ün ML için gerçekten nasıl çalıştığını açıklar — RTX 4090'da NVLink olmaması ne demek, 2× 4090 PCIe setup'ta neye dikkat etmeli, cloud 8×H100'de NVSwitch + IB nasıl kullanılır.

1. PCIe Bandwidth Tablosu#

PCIe lane bandwidth (full-duplex, raw):

Gen	Per lane (GB/s)	x4	x8	x16
3.0	0.985	3.94	7.88	15.75
4.0	1.97	7.88	15.75	31.5
5.0	3.94	15.75	31.5	63.0

RTX 4090: PCIe 4.0 x16 → 31.5 GB/s (host ↔ GPU veri transferi) H100 PCIe: PCIe 5.0 x16 → 63 GB/s H100 SXM: NVLink (≤900 GB/s) + PCIe 5.0 (host'a bağlantı)

Bu sayılar pratikte ne demek?#

CPU → GPU veri transferi (örn. dataloader → GPU):

1 batch (8 × 4096 token × 2 bytes) = 64 KB → 0.002 ms (4090 PCIe 4.0)
1 batch (vision, 224×224×3×16 image fp16) = 4.8 MB → 0.15 ms
Eğitim adımı (Llama 8B forward) compute = ~250 ms → CPU→GPU transfer görünmez (%0.06)

Sonuç: Tek 4090'da PCIe asla bottleneck değil. Multi-GPU ya da CPU-offload optimizer'da öne çıkıyor.

2. NVLink — RTX 4090'da Neden YOK?#

NVIDIA RTX 4090'da NVLink kaldırıldı (RTX 3090'da vardı). Sebep: AI compute pazarını datacenter SKU'lara (H100/A100) yönlendirmek.

Anlamı:

2× RTX 4090 PCIe-only — GPU-GPU veri transferi sadece PCIe 4.0 x16 üzerinden olur
Eğitim sırasında gradient all-reduce'lar PCIe'den geçer
Yaklaşık 28-30 GB/s effective (overhead sonrası, half-duplex)

Karşılaştırma:

GPU	GPU-GPU BW
2× RTX 3090 + NVLink	~112 GB/s (NVLink 4)
2× RTX 4090 PCIe-only	~28-30 GB/s
8× H100 NVSwitch	~450 GB/s per pair

2× RTX 4090 multi-GPU pratik?#

Llama 3.1 8B QLoRA, DDP (data parallel): PCIe ile ~%55 efficiency (single GPU'ya göre 1.6x speedup, ideal 2x)
Multi-GPU 4090 cookbook'ta tavsiye edilmiyor — single 4090 daha verimli, 2× 4090 yerine 1× A100 80GB cloud daha rasyonel

3. NVSwitch + NVLink — Datacenter Pattern#

H100 SXM 8-GPU sistemleri (DGX H100, HGX H100):

[GPU 0] --- NVSwitch 0 ---|
[GPU 1] --- NVSwitch 0 ---|
[GPU 2] --- NVSwitch 1 ---|--- All-to-all 900 GB/s
[GPU 3] --- NVSwitch 1 ---|     (4 NVSwitch chips per node)
[GPU 4-7] ...               --|

Per-GPU NVLink BW: 900 GB/s (4. nesil NVLink, 18 link × 50 GB/s)
Intra-node all-reduce 1 GB: ~3-5 ms
NCCL ring all-reduce throughput: ~85-90% peak

Niye bu kadar fark eder?#

70B model FT'de gradient size her step ~140 GB (8-GPU sharded). Her step'te 140 GB all-reduce gerekir. PCIe-only multi-GPU'da bu 140 GB / 30 GB/s = 4.7 saniye sadece communication. NVSwitch'le 140 GB / 450 GB/s = 0.3 saniye.

8×H100 sistemde bu communication'ın saatlik maliyeti: $24/saat. 8× RTX 4090 PCIe rig'de aynı iş 5-10x yavaş; iki gün sürer; cloud'da daha pahalı.

4. InfiniBand — Multi-Node Network#

İki+ node'lu cluster'da node-arası bandwidth:

Network	Bandwidth	Latency	Use case
1 Gb Ethernet	0.125 GB/s	~50 µs	hayır
10 Gb Ethernet	1.25 GB/s	~10 µs	minimal
25 Gb Ethernet (RoCEv2)	3.13 GB/s	5 µs	budget
100 Gb IB	12.5 GB/s	1.2 µs	ortak datacenter
200 Gb IB (HDR)	25 GB/s	1.0 µs	premium
400 Gb IB (NDR)	50 GB/s	0.7 µs	H100 cluster, 2026 default
800 Gb IB (XDR)	100 GB/s	<0.5 µs	2026 yeni

CoreWeave / Lambda 8×H100 reference: 8× 400G NDR IB = per-node 400 GB/s node-arası, ~%70 effective NCCL throughput.

NCCL all-reduce'un network-bound olduğu eşik#

Compute: 8B model bf16 forward+backward ~250 ms All-reduce: gradient size N × 2 bytes / network BW

Gradient size	Network BW	All-reduce time	Compute time	Network-bound?
16 GB (8B model)	400 GB/s NVSwitch	0.04 s	0.25 s	hayır
16 GB	50 GB/s NDR IB (cross-node)	0.32 s	0.25 s	evet
16 GB	30 GB/s PCIe (2× 4090)	0.53 s	0.25 s	evet
16 GB	3 GB/s 25GbE	5.3 s	0.25 s	felaket

Kural: All-reduce time > compute time → multi-GPU verim sıfıra düşüyor. Çözüm: ZeRO-3 + grad-compression + overlap.

bash

# === GPU topology'yi keşfet (cookbook'un her cluster Lab'ında ilk komut) ===
nvidia-smi topo -m
 
# Tipik 8×H100 SXM çıktısı:
#         GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7  CPU Affinity  NUMA Affinity
#  GPU0    X    NV18  NV18  NV18  NV18  NV18  NV18  NV18  0-31,64-95     0
#  GPU1   NV18   X    NV18  NV18  NV18  NV18  NV18  NV18  0-31,64-95     0
#  ...
#  GPU4   NV18  NV18  NV18  NV18   X    NV18  NV18  NV18  32-63,96-127   1
#  ...
#
# NV18 = NVLink 18 lane (yüksek BW)
# SYS  = PCIe through CPU (slow)
# PHB  = PCIe Host Bridge
# PXB  = PCIe via PCI switch
# PIX  = PCIe through GPU (P2P)
 
# RTX 4090 PCIe-only 2-GPU sistem örneği:
#         GPU0  GPU1
#  GPU0    X    SYS    ← SYS = PCIe through CPU, slow
#  GPU1   SYS   X
#
# Bu sistemde NCCL all-reduce PCIe 4.0'dan geçer, peak ~28 GB/s

nvidia-smi topo -m — topology okuma

python

# === P2P access kontrolü ===
import torch
 
def check_p2p():
    n = torch.cuda.device_count()
    matrix = [[False]*n for _ in range(n)]
    for i in range(n):
        for j in range(n):
            if i != j:
                matrix[i][j] = torch.cuda.can_device_access_peer(i, j)
    return matrix
 
m = check_p2p()
for i, row in enumerate(m):
    print(f"GPU{i}: {row}")
 
# 8×H100 NVSwitch: hepsi True
# 2×RTX 4090 PCIe-only: BIOS'ta "PCI Above 4G" + IOMMU OFF lazım; çoğu motherboard'da False
# WSL2: genellikle False (host'ta True olabilir)

P2P access kontrolü

🐛 FMD — 'Tek 4090'da 1.78 step/s, 2× 4090 DDP'de 1.05 step/s — ters çevirmiş'

Beklenen: ~3.0 step/s (1.7x speedup). Gerçek: 1.05 step/s → tek GPU'dan yavaş. Hipotezler: (a) PCIe 4.0 x8 (motherboard 2× GPU'yu x8/x8'e böler) → bandwidth yarı. Çözüm: `lspci -vv | grep -i 'lnksta'` ile gerçek lane width'i gör. (b) P2P access kapalı (BIOS) → tüm tensor'lar host RAM üzerinden gidiyor → 5x yavaş. Çözüm: BIOS'ta 'Above 4G Decoding' + 'IOMMU' aç (IOMMU passthrough değil, sadece on). (c) NCCL DDP yerine gloo (CPU)'a düşmüş → `NCCL_DEBUG=INFO` ile doğrula. Drill: bu 3 hipotezi sırayla elimine et.

5. Bench: Multi-GPU Scaling Efficiency#

Llama 3.1 8B QLoRA, batch_per_gpu=2, gradient accumulation=4.

Setup	Bandwidth	step/s	Scaling efficiency
1× RTX 4090	—	1.78	100% (baseline)
2× RTX 4090 PCIe 4.0 x8/x8	~14 GB/s effective	2.05	58%
2× RTX 4090 PCIe 4.0 x16/x16 (rare)	~28 GB/s	2.85	80%
2× A100 80GB NVLink	600 GB/s	3.42	96%
8× H100 SXM NVSwitch	450 GB/s	13.5	95%

Çıkarım: Bandwidth ucuza geldiğinde scaling efficiency ~%55-60'ta kalıyor. Pahalı interconnect = pahalı GPU = pahalı saat, ama efficient saat.

✅ Teslim

`nvidia-smi topo -m` ve `lspci -vv` çıktılarını oku — kendi sistemin/cloud'unun topology'sini anla. 2) Tek 4090'ında bandwidth-bound olmadığını NCCL test ile doğrula. 3) Sonraki ders: 1.6 — Storage I/O Engineering: Dataset'in Hız Sorunu.

Frequently Asked Questions

Sınırlı. WSL2 GPU passthrough P2P access'i çoğu zaman kapalı; tensor'lar host RAM'den geçer → çok yavaş. Native Linux 'PCI Above 4G' + IOMMU + Linux kernel ≥ 6.2 ile P2P aktif olabilir. Cookbook multi-GPU Lab'ları native Linux varsayar. WSL2 single-GPU FT için pratik.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...