Multi-Node Run + Fault-Tolerant Training: 2 Node × 8 H100 NCCL Cluster
Reality of cluster training: nodes fail, NCCL hangs, checkpoints get corrupted. Cookbook's fault-tolerant recipe: NCCL_TIMEOUT, watchdog, signal handling (SIGUSR1), elastic launcher, graceful preemption resume. Survival kit for 70B model 2-day training.
Şükrü Yusuf KAYA
32 min read
Advanced1. Cluster Failure Pattern'ları#
| Failure | Belirti | Mitigation |
|---|---|---|
| Node OOM kill | Tüm run hang olur, NCCL timeout | Memory headroom %20+ |
| GPU thermal throttle | Throughput aniden düşer | Cooling check, nvidia-smi loops |
| NCCL bandwidth degradation | Step time 2-3x artar | IB topology kontrol |
| Disk full (logs/ckpt) | Save hata | NFS quota monitor |
| Network partition | Bazı rank'ler hang | Retry/elastic |
| Power outage | Tüm cluster down | Cloud UPS / cold start |
| Spot preemption | 30 sn'lik notice + kill | Graceful checkpoint |
Cookbook'un kuralı: Her 100 step'te checkpoint + graceful preemption handler + NCCL watchdog.
python
# === Fault-tolerant training framework ===import os, signal, timeimport torchimport torch.distributed as distfrom pathlib import Path class GracefulPreemption: def __init__(self): self.requested = False signal.signal(signal.SIGUSR1, self._handler) # Slurm preempt signal.signal(signal.SIGTERM, self._handler) # Docker/RunPod kill signal.signal(signal.SIGINT, self._handler) # Ctrl+C def _handler(self, signum, frame): print(f"[preempt] signal {signum} received, flushing") self.requested = True class NCCLWatchdog: """NCCL'i monitor et — eğer 10 dk inactive ise abort.""" def __init__(self, threshold_min=10): self.last_step = time.time() self.threshold = threshold_min * 60 self.thread = None def heartbeat(self): self.last_step = time.time() def check(self): if time.time() - self.last_step > self.threshold: print(f"[watchdog] NCCL inactive {self.threshold/60} min — aborting") os._exit(1) # force exit, all ranks killed # Setuppreempt = GracefulPreemption()watchdog = NCCLWatchdog(threshold_min=10) # NCCL configos.environ["NCCL_TIMEOUT"] = "3600" # 1 hour timeout (default 30 min)os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1" # Training loop with fault-toleranceckpt_dir = Path("ckpt")last_ckpt = ckpt_dir / "latest" if last_ckpt.exists(): print(f"[resume] loading from {last_ckpt}") state = torch.load(last_ckpt / "state.pt", map_location="cpu") model.load_state_dict(state["model"]) optimizer.load_state_dict(state["optim"]) start_step = state["step"]else: start_step = 0 for step in range(start_step, total_steps): try: batch = next(loader_iter) loss = train_step(model, batch, optimizer) watchdog.heartbeat() if step % 100 == 0 or preempt.requested: # Atomic checkpoint staging = ckpt_dir / f"step-{step:08d}.partial" staging.mkdir(parents=True, exist_ok=True) if dist.get_rank() == 0: torch.save({ "step": step, "model": model.state_dict(), "optim": optimizer.state_dict(), }, staging / "state.pt") dist.barrier() if dist.get_rank() == 0: final = ckpt_dir / f"step-{step:08d}" if final.exists(): import shutil; shutil.rmtree(final) os.rename(staging, final) # Update latest symlink if last_ckpt.exists() or last_ckpt.is_symlink(): last_ckpt.unlink() last_ckpt.symlink_to(final.relative_to(ckpt_dir)) dist.barrier() if preempt.requested: print(f"[preempt] saved at step {step}, exiting cleanly") break except Exception as e: print(f"[error] step {step}: {e}") # Cluster-wide abort + Slurm requeue os._exit(1)fault-tolerant training framework — preempt + watchdog + atomic ckpt
✅ Part IV tamamlandı
- Yukarıdaki framework'ü kendi train.py'ına entegre et. 2) SIGUSR1 manuel simulate ederek graceful save test. 3) Sonraki Part: Part V — MoE Internals & Fine-Tuning. Mixtral, DeepSeek-V3, Qwen3-MoE.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations