CUDA Graph Capture: Static-Shape Inference Graph + Eliminating Latency Tail
CUDA Graph — technique to eliminate kernel launch overhead. 'Capture' a compute graph once, then 'replay' — each replay 5-10 µs (vs 30-50 µs kernel launch). Critical for inference latency (especially decode fast-path). vLLM uses it. Requires static shapes.
Şükrü Yusuf KAYA
22 min read
Advancedpython
# === CUDA Graph manual capture ===import torch # Warmupfor _ in range(3): out = model(input_ids)torch.cuda.synchronize() # Static buffersstatic_input = torch.zeros_like(input_ids)static_output = torch.zeros_like(out) # Captureg = torch.cuda.CUDAGraph()with torch.cuda.graph(g): static_output.copy_(model(static_input)) # Replay (fast path)def fast_inference(new_input): static_input.copy_(new_input) g.replay() return static_output.clone() # Bench (Llama 3.1 8B, seq=1024, decode step):# torch.eager: 45 µs / step# torch.compile: 32 µs / step# CUDA graph replay: 8 µs / stepmanual CUDA Graph capture
✅ Teslim
- Mini model üzerinde CUDA Graph capture-replay test. 2) Sonraki ders: 13.8 — Speculative Decoding FT (EAGLE-2 + MEDUSA).
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations