Skip to content

FP8 Training: H100 Native, Premature on RTX 4090 — Transformer Engine Internals

FP8 = the future of AI compute. H100 native (FP8 Tensor Cores + WGMMA + Transformer Engine). RTX 4090 (Ada) supports FP8 GEMM but ecosystem unripe — fallbacks common, training pipeline buggy. Cookbook rule: bf16 training on 4090, FP8 inference (vLLM). FP8 training on H100 detailed in Part XIII.

Şükrü Yusuf KAYA
26 min read
Advanced
FP8 Training: H100 Native, RTX 4090'da Prematur — Transformer Engine Internals

1. FP8 Formatları#

İki ana FP8 format:
  • e4m3 (sign + 4 exp + 3 mantissa) — daha geniş range, daha az precision; inference forward için
  • e5m2 (sign + 5 exp + 2 mantissa) — daha geniş range, daha az precision; training gradient için
FormatRangePrecisionUse case
fp32±3.4e387.2e-8reference master
bf16±3.4e387.8e-3training default
fp16±655049.8e-4eski training
fp8-e4m3±4480.125inference forward
fp8-e5m2±573440.25training gradient
Range çok küçük → per-tensor scaling factor zorunlu. Transformer Engine (NVIDIA) bu scaling'i otomatik yönetir.
python
# === Transformer Engine FP8 (H100 native) ===
import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling
 
# FP8 recipe — TE auto-handles scaling factors
fp8_recipe = DelayedScaling(
fp8_format=Format.HYBRID, # e4m3 forward, e5m2 backward
amax_history_len=16, # scaling factor history
amax_compute_algo="max",
)
 
# Model — TE layer'ları drop-in replace
class Llama_FP8(nn.Module):
def __init__(self, hidden, ffn_hidden):
super().__init__()
self.attn_qkv = te.Linear(hidden, 3*hidden, bias=False)
self.attn_o = te.Linear(hidden, hidden, bias=False)
self.ffn_gate = te.Linear(hidden, ffn_hidden, bias=False)
self.ffn_up = te.Linear(hidden, ffn_hidden, bias=False)
self.ffn_down = te.Linear(ffn_hidden, hidden, bias=False)
 
# Training loop
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
output = model(input)
loss = compute_loss(output)
loss.backward()
optimizer.step()
 
# H100'de bf16'ya göre +%30-40 throughput
# RTX 4090'da bf16'ya göre +%10-15 (kernel'lar Hopper-optimize)
Transformer Engine FP8 training (H100 native)

2. RTX 4090'da FP8 Durumu (2026 Başı)#

AspectDurum
FP8 Tensor Cores✅ Ada has them
Transformer Engine⚠️ Hopper-optimize, Ada'da fallback
FP8 GEMM kernel✅ via cuBLASLt
FP8 attention⚠️ Flash-Attention v3 Hopper-only
FP8 inference (vLLM)✅ rahat (+%30-40 throughput)
FP8 training stability⚠️ small scale OK, large scale buggy
Cookbook'un 2026 kararı:
  • RTX 4090 training: bf16 (default, stable). FP8 deneyimsel.
  • RTX 4090 inference: FP8 (vLLM destekler, real benefit).
  • H100 training/inference: FP8 (Part XIII'te detay).
✅ Teslim
  1. RTX 4090'da TE FP8 training'i KÜÇÜK bir model üzerinde test et (LLM değil, plain MLP). 2) vLLM FP8 inference ile bf16'yı karşılaştır. 3) Sonraki ders: 10.7 — Int4 QLoRA NF4 + Double-Quant Internals.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content