FP8 Training: H100 Native, Premature on RTX 4090 — Transformer Engine Internals

FP8 = the future of AI compute. H100 native (FP8 Tensor Cores + WGMMA + Transformer Engine). RTX 4090 (Ada) supports FP8 GEMM but ecosystem unripe — fallbacks common, training pipeline buggy. Cookbook rule: bf16 training on 4090, FP8 inference (vLLM). FP8 training on H100 detailed in Part XIII.

Şükrü Yusuf KAYA

26 min read

5/14/2026

Advanced

FP8 Training: H100 Native, RTX 4090'da Prematur — Transformer Engine Internals

1. FP8 Formatları#

İki ana FP8 format:

e4m3 (sign + 4 exp + 3 mantissa) — daha geniş range, daha az precision; inference forward için
e5m2 (sign + 5 exp + 2 mantissa) — daha geniş range, daha az precision; training gradient için

Format	Range	Precision	Use case
fp32	±3.4e38	7.2e-8	reference master
bf16	±3.4e38	7.8e-3	training default
fp16	±65504	9.8e-4	eski training
fp8-e4m3	±448	0.125	inference forward
fp8-e5m2	±57344	0.25	training gradient

Range çok küçük → per-tensor scaling factor zorunlu. Transformer Engine (NVIDIA) bu scaling'i otomatik yönetir.

python

# === Transformer Engine FP8 (H100 native) ===
import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling
 
# FP8 recipe — TE auto-handles scaling factors
fp8_recipe = DelayedScaling(
    fp8_format=Format.HYBRID,           # e4m3 forward, e5m2 backward
    amax_history_len=16,                # scaling factor history
    amax_compute_algo="max",
)
 
# Model — TE layer'ları drop-in replace
class Llama_FP8(nn.Module):
    def __init__(self, hidden, ffn_hidden):
        super().__init__()
        self.attn_qkv = te.Linear(hidden, 3*hidden, bias=False)
        self.attn_o = te.Linear(hidden, hidden, bias=False)
        self.ffn_gate = te.Linear(hidden, ffn_hidden, bias=False)
        self.ffn_up = te.Linear(hidden, ffn_hidden, bias=False)
        self.ffn_down = te.Linear(ffn_hidden, hidden, bias=False)
 
# Training loop
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    output = model(input)
    loss = compute_loss(output)
loss.backward()
optimizer.step()
 
# H100'de bf16'ya göre +%30-40 throughput
# RTX 4090'da bf16'ya göre +%10-15 (kernel'lar Hopper-optimize)

Transformer Engine FP8 training (H100 native)

2. RTX 4090'da FP8 Durumu (2026 Başı)#

Aspect	Durum
FP8 Tensor Cores	✅ Ada has them
Transformer Engine	⚠️ Hopper-optimize, Ada'da fallback
FP8 GEMM kernel	✅ via cuBLASLt
FP8 attention	⚠️ Flash-Attention v3 Hopper-only
FP8 inference (vLLM)	✅ rahat (+%30-40 throughput)
FP8 training stability	⚠️ small scale OK, large scale buggy

Cookbook'un 2026 kararı:

RTX 4090 training: bf16 (default, stable). FP8 deneyimsel.
RTX 4090 inference: FP8 (vLLM destekler, real benefit).
H100 training/inference: FP8 (Part XIII'te detay).

✅ Teslim

RTX 4090'da TE FP8 training'i KÜÇÜK bir model üzerinde test et (LLM değil, plain MLP). 2) vLLM FP8 inference ile bf16'yı karşılaştır. 3) Sonraki ders: 10.7 — Int4 QLoRA NF4 + Double-Quant Internals.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

FP8 Training: H100 Native, Premature on RTX 4090 — Transformer Engine Internals

1. FP8 Formatları#

2. RTX 4090'da FP8 Durumu (2026 Başı)#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter