CUDA Graph Capture: Static-Shape Inference Graph + Latency Tail Bitirme

CUDA Graph — kernel launch overhead'ini eliminating teknik. Bir compute graph'i tek seferlik 'capture' et, sonra 'replay' et — her replay 5-10 µs (kernel launch'un 30-50 µs'sinden çok daha az). Inference latency için kritik (özellikle decoded tokens fast-path). vLLM kullanır. Static-shape gerek (shape değişirse re-capture).

Şükrü Yusuf KAYA

22 dakikalık okuma

14.05.2026

İleri

CUDA Graph Capture: Static-Shape Inference Graph + Latency Tail Bitirme

python

# === CUDA Graph manual capture ===
import torch
 
# Warmup
for _ in range(3):
    out = model(input_ids)
torch.cuda.synchronize()
 
# Static buffers
static_input = torch.zeros_like(input_ids)
static_output = torch.zeros_like(out)
 
# Capture
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    static_output.copy_(model(static_input))
 
# Replay (fast path)
def fast_inference(new_input):
    static_input.copy_(new_input)
    g.replay()
    return static_output.clone()
 
# Bench (Llama 3.1 8B, seq=1024, decode step):
# torch.eager:        45 µs / step
# torch.compile:      32 µs / step
# CUDA graph replay:  8 µs / step