Skip to content

Llama 3.3 70B QLoRA + FSDP: 8×H100 SXM Recipe (5.6h 1 Epoch)

Full Lab recipe for Llama 3.3 70B-Instruct: 8×H100 SXM cloud (Lambda $24/h), QLoRA NF4 + FSDP FULL_SHARD, bitsandbytes 4-bit, gradient checkpointing, paged AdamW. 50K TR Alpaca 1 epoch in 5.6h. TR-MMLU base 55.4 → 60.8.

Şükrü Yusuf KAYA
38 min read
Advanced
Llama 3.3 70B QLoRA + FSDP: 8×H100 SXM Reçetesi (5.6 Saat 1 Epoch)

1. Maliyet & Süre#

SenaryoDonanımCostSüre 1 epoch
8×H100 SXM (Lambda on-demand)$23.92/saat$1345.6h
8×H100 SXM (Lambda 1-yr reserve)$15.92/saat$895.6h
8×A100 80GB~$10/saat$708.3h
4×H100 PCIe 80GB$12/saat$15613h
RTX 4090 + CPU offload₺1.75/saat elektrik₺8750+ saat
Cookbook tavsiyesi: 8×H100 SXM Lambda on-demand → 1 epoch 5.6 saat $134. Ucuz değilse alternatif yok — 70B FT'nin yapısal maliyeti.
python
# === Llama 3.3 70B QLoRA + FSDP — 8×H100 SXM ===
# Cluster setup: 1 node × 8 H100 SXM
# Run: torchrun --nproc_per_node=8 train.py
 
import os, torch
import torch.distributed as dist
from torch.distributed.fsdp import (
FullyShardedDataParallel as FSDP,
MixedPrecision, ShardingStrategy, BackwardPrefetch,
)
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
 
# Distributed init
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
 
# 1. NF4 quant config
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_storage=torch.bfloat16, # FSDP uyum için
)
 
# 2. Model — quantized 70B
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.3-70B-Instruct",
quantization_config=bnb,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
)
model = prepare_model_for_kbit_training(model)
 
# 3. LoRA — 70B için r=16 yeter
lora = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
 
# 4. FSDP wrap
mp_policy = MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.float32,
buffer_dtype=torch.bfloat16,
)
 
model = FSDP(
model,
sharding_strategy=ShardingStrategy.FULL_SHARD,
mixed_precision=mp_policy,
auto_wrap_policy=transformer_auto_wrap_policy(transformer_layer_cls={LlamaDecoderLayer}),
backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
use_orig_params=True,
device_id=local_rank,
sync_module_states=True,
forward_prefetch=True,
limit_all_gathers=True,
)
 
# 5. Dataset
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct")
tok.pad_token = tok.eos_token
dataset = load_dataset("malhajar/alpaca-gpt4-tr", split="train")
 
# 6. SFT
cfg = SFTConfig(
output_dir="llama-3.3-70b-tr-qlora",
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=8, # effective 64
learning_rate=1e-4,
warmup_ratio=0.03, lr_scheduler_type="cosine",
bf16=True, optim="paged_adamw_8bit",
max_seq_length=4096, packing=True,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
dataset_text_field="text",
logging_steps=5, save_steps=100,
fsdp="full_shard auto_wrap",
fsdp_transformer_layer_cls_to_wrap="LlamaDecoderLayer",
report_to="wandb",
)
 
trainer = SFTTrainer(model=model, tokenizer=tok, train_dataset=dataset, args=cfg)
trainer.train()
 
# Bench (8×H100 SXM, ölçülmüş):
# - 1.4 step/s
# - 5.6 saat 1 epoch (50K samples)
# - Peak per-GPU memory: 42 GB
# - TR-MMLU: 55.4 base → 60.8 post-FT (+5.4)
# - Cost: $134 on-demand, $89 reserve
Llama 3.3 70B QLoRA + FSDP — 8×H100 SXM tam reçete
✅ Teslim
  1. Eğer cloud erişimin varsa 8×H100 ile mini-run (1000 sample, 100 step). 2) Throughput + memory peak ölç. 3) Sonraki ders: 4.8 — Qwen 2.5 32B / 72B Math + Code Mastery.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content