FP8 Training: H100 Native, Premature on RTX 4090 — Transformer Engine Internals
FP8 = the future of AI compute. H100 native (FP8 Tensor Cores + WGMMA + Transformer Engine). RTX 4090 (Ada) supports FP8 GEMM but ecosystem unripe — fallbacks common, training pipeline buggy. Cookbook rule: bf16 training on 4090, FP8 inference (vLLM). FP8 training on H100 detailed in Part XIII.
Şükrü Yusuf KAYA
26 min read
Advanced1. FP8 Formatları#
İki ana FP8 format:
- e4m3 (sign + 4 exp + 3 mantissa) — daha geniş range, daha az precision; inference forward için
- e5m2 (sign + 5 exp + 2 mantissa) — daha geniş range, daha az precision; training gradient için
| Format | Range | Precision | Use case |
|---|---|---|---|
| fp32 | ±3.4e38 | 7.2e-8 | reference master |
| bf16 | ±3.4e38 | 7.8e-3 | training default |
| fp16 | ±65504 | 9.8e-4 | eski training |
| fp8-e4m3 | ±448 | 0.125 | inference forward |
| fp8-e5m2 | ±57344 | 0.25 | training gradient |
Range çok küçük → per-tensor scaling factor zorunlu. Transformer Engine (NVIDIA) bu scaling'i otomatik yönetir.
python
# === Transformer Engine FP8 (H100 native) ===import torchimport transformer_engine.pytorch as tefrom transformer_engine.common.recipe import Format, DelayedScaling # FP8 recipe — TE auto-handles scaling factorsfp8_recipe = DelayedScaling( fp8_format=Format.HYBRID, # e4m3 forward, e5m2 backward amax_history_len=16, # scaling factor history amax_compute_algo="max",) # Model — TE layer'ları drop-in replaceclass Llama_FP8(nn.Module): def __init__(self, hidden, ffn_hidden): super().__init__() self.attn_qkv = te.Linear(hidden, 3*hidden, bias=False) self.attn_o = te.Linear(hidden, hidden, bias=False) self.ffn_gate = te.Linear(hidden, ffn_hidden, bias=False) self.ffn_up = te.Linear(hidden, ffn_hidden, bias=False) self.ffn_down = te.Linear(ffn_hidden, hidden, bias=False) # Training loopwith te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe): output = model(input) loss = compute_loss(output)loss.backward()optimizer.step() # H100'de bf16'ya göre +%30-40 throughput# RTX 4090'da bf16'ya göre +%10-15 (kernel'lar Hopper-optimize)Transformer Engine FP8 training (H100 native)
2. RTX 4090'da FP8 Durumu (2026 Başı)#
| Aspect | Durum |
|---|---|
| FP8 Tensor Cores | ✅ Ada has them |
| Transformer Engine | ⚠️ Hopper-optimize, Ada'da fallback |
| FP8 GEMM kernel | ✅ via cuBLASLt |
| FP8 attention | ⚠️ Flash-Attention v3 Hopper-only |
| FP8 inference (vLLM) | ✅ rahat (+%30-40 throughput) |
| FP8 training stability | ⚠️ small scale OK, large scale buggy |
Cookbook'un 2026 kararı:
- RTX 4090 training: bf16 (default, stable). FP8 deneyimsel.
- RTX 4090 inference: FP8 (vLLM destekler, real benefit).
- H100 training/inference: FP8 (Part XIII'te detay).
✅ Teslim
- RTX 4090'da TE FP8 training'i KÜÇÜK bir model üzerinde test et (LLM değil, plain MLP). 2) vLLM FP8 inference ile bf16'yı karşılaştır. 3) Sonraki ders: 10.7 — Int4 QLoRA NF4 + Double-Quant Internals.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations