Liger Kernel Tour: RMSNorm + SwiGLU + GeGLU + Fused Linear+CE — Source Reading
Liger Kernel (LinkedIn, 2024) — production-grade Triton kernel suite. Fused RMSNorm + dropout, SwiGLU + GeGLU + GeLU, RoPE rotary, fused linear+CE (memory savings), CrossEntropy chunked. Llama 3.1 8B FT throughput +20%, memory -30% on RTX 4090. Source reading: production Triton patterns.
Şükrü Yusuf KAYA
26 min read
Advancedpython
# === Liger Kernel kullanım — Llama 3.1 8B FT ===from liger_kernel.transformers import apply_liger_kernel_to_llamafrom transformers import AutoModelForCausalLM # Apply Liger replacementsapply_liger_kernel_to_llama( rope=True, # Fused RoPE cross_entropy=False, # Skip if not using lm_head CE fused_linear_cross_entropy=True,# Fused linear + CE (huge memory save) rms_norm=True, # Fused RMSNorm swiglu=True, # Fused SwiGLU) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Meta-Llama-3.1-8B-Instruct", torch_dtype="bfloat16", attn_implementation="flash_attention_2",)# Now model uses Liger kernels everywhere — drop-in replacement # Bench (RTX 4090 + Llama 3.1 8B QLoRA):# Vanilla HF + FA2: 1.78 step/s, peak 13.4 GB# + Liger Kernel: 2.14 step/s (+%20), peak 9.5 GB (-%29)# + Unsloth (everything): 3.10 step/s, peak 11.8 GBLiger Kernel + Llama 8B FT
✅ Teslim
- Liger Kernel kur. 2) Llama 8B FT'inde Liger açık vs kapalı bench. 3) Sonraki ders: 13.5 — PagedAttention (vLLM).
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations