Llama 3.3 70B QLoRA + FSDP: 8×H100 SXM Reçetesi (5.6 Saat 1 Epoch)
Llama 3.3 70B-Instruct'in tam Lab reçetesi: 8×H100 SXM cloud (Lambda $24/saat), QLoRA NF4 + FSDP FULL_SHARD, bitsandbytes 4-bit, gradient checkpointing, paged AdamW. 50K TR Alpaca üzerinde 1 epoch 5.6 saat. TR-MMLU baseline 55.4 → fine-tune 60.8.
Şükrü Yusuf KAYA
38 dakikalık okuma
İleri1. Maliyet & Süre#
| Senaryo | Donanım | Cost | Süre 1 epoch |
|---|---|---|---|
| 8×H100 SXM (Lambda on-demand) | $23.92/saat | $134 | 5.6h |
| 8×H100 SXM (Lambda 1-yr reserve) | $15.92/saat | $89 | 5.6h |
| 8×A100 80GB | ~$10/saat | $70 | 8.3h |
| 4×H100 PCIe 80GB | $12/saat | $156 | 13h |
| RTX 4090 + CPU offload | ₺1.75/saat elektrik | ₺87 | 50+ saat |
Cookbook tavsiyesi: 8×H100 SXM Lambda on-demand → 1 epoch 5.6 saat $134. Ucuz değilse alternatif yok — 70B FT'nin yapısal maliyeti.
python
# === Llama 3.3 70B QLoRA + FSDP — 8×H100 SXM ===# Cluster setup: 1 node × 8 H100 SXM# Run: torchrun --nproc_per_node=8 train.py import os, torchimport torch.distributed as distfrom torch.distributed.fsdp import ( FullyShardedDataParallel as FSDP, MixedPrecision, ShardingStrategy, BackwardPrefetch,)from torch.distributed.fsdp.wrap import transformer_auto_wrap_policyfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfigfrom transformers.models.llama.modeling_llama import LlamaDecoderLayerfrom peft import LoraConfig, get_peft_model, prepare_model_for_kbit_trainingfrom trl import SFTTrainer, SFTConfigfrom datasets import load_dataset # Distributed initdist.init_process_group(backend="nccl")local_rank = int(os.environ["LOCAL_RANK"])torch.cuda.set_device(local_rank) # 1. NF4 quant configbnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_storage=torch.bfloat16, # FSDP uyum için) # 2. Model — quantized 70Bmodel = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.3-70B-Instruct", quantization_config=bnb, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16,)model = prepare_model_for_kbit_training(model) # 3. LoRA — 70B için r=16 yeterlora = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, target_modules=["q_proj","k_proj","v_proj","o_proj", "gate_proj","up_proj","down_proj"], task_type="CAUSAL_LM",)model = get_peft_model(model, lora) # 4. FSDP wrapmp_policy = MixedPrecision( param_dtype=torch.bfloat16, reduce_dtype=torch.float32, buffer_dtype=torch.bfloat16,) model = FSDP( model, sharding_strategy=ShardingStrategy.FULL_SHARD, mixed_precision=mp_policy, auto_wrap_policy=transformer_auto_wrap_policy(transformer_layer_cls={LlamaDecoderLayer}), backward_prefetch=BackwardPrefetch.BACKWARD_PRE, use_orig_params=True, device_id=local_rank, sync_module_states=True, forward_prefetch=True, limit_all_gathers=True,) # 5. Datasettok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct")tok.pad_token = tok.eos_tokendataset = load_dataset("malhajar/alpaca-gpt4-tr", split="train") # 6. SFTcfg = SFTConfig( output_dir="llama-3.3-70b-tr-qlora", num_train_epochs=1, per_device_train_batch_size=1, gradient_accumulation_steps=8, # effective 64 learning_rate=1e-4, warmup_ratio=0.03, lr_scheduler_type="cosine", bf16=True, optim="paged_adamw_8bit", max_seq_length=4096, packing=True, gradient_checkpointing=True, gradient_checkpointing_kwargs={"use_reentrant": False}, dataset_text_field="text", logging_steps=5, save_steps=100, fsdp="full_shard auto_wrap", fsdp_transformer_layer_cls_to_wrap="LlamaDecoderLayer", report_to="wandb",) trainer = SFTTrainer(model=model, tokenizer=tok, train_dataset=dataset, args=cfg)trainer.train() # Bench (8×H100 SXM, ölçülmüş):# - 1.4 step/s# - 5.6 saat 1 epoch (50K samples)# - Peak per-GPU memory: 42 GB# - TR-MMLU: 55.4 base → 60.8 post-FT (+5.4)# - Cost: $134 on-demand, $89 reserveLlama 3.3 70B QLoRA + FSDP — 8×H100 SXM tam reçete
✅ Teslim
- Eğer cloud erişimin varsa 8×H100 ile mini-run (1000 sample, 100 step). 2) Throughput + memory peak ölç. 3) Sonraki ders: 4.8 — Qwen 2.5 32B / 72B Math + Code Mastery.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
İlgili İçerikler
Part 0 — Engineering Foundations
Fine-Tuning Cookbook'a Hoş Geldin: Sistematik, Stage Taksonomisi ve Reproducibility Kontratı
Öğrenmeye BaşlaPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags ve Deterministic CUDA — 'Sende Niye Çalışıyor Bende Çalışmıyor' Sorununu Bitir
Öğrenmeye BaşlaPart 0 — Engineering Foundations