Multi-Node Run + Fault-Tolerant Training: 2 Node × 8 H100 NCCL Cluster

Reality of cluster training: nodes fail, NCCL hangs, checkpoints get corrupted. Cookbook's fault-tolerant recipe: NCCL_TIMEOUT, watchdog, signal handling (SIGUSR1), elastic launcher, graceful preemption resume. Survival kit for 70B model 2-day training.

Şükrü Yusuf KAYA

32 min read

5/14/2026

Advanced

Multi-Node Run + Fault-Tolerant Training: 2 Node × 8 H100 NCCL Cluster

1. Cluster Failure Pattern'ları#

Failure	Belirti	Mitigation
Node OOM kill	Tüm run hang olur, NCCL timeout	Memory headroom %20+
GPU thermal throttle	Throughput aniden düşer	Cooling check, nvidia-smi loops
NCCL bandwidth degradation	Step time 2-3x artar	IB topology kontrol
Disk full (logs/ckpt)	Save hata	NFS quota monitor
Network partition	Bazı rank'ler hang	Retry/elastic
Power outage	Tüm cluster down	Cloud UPS / cold start
Spot preemption	30 sn'lik notice + kill	Graceful checkpoint

Cookbook'un kuralı: Her 100 step'te checkpoint + graceful preemption handler + NCCL watchdog.

python

# === Fault-tolerant training framework ===
import os, signal, time
import torch
import torch.distributed as dist
from pathlib import Path
 
class GracefulPreemption:
    def __init__(self):
        self.requested = False
        signal.signal(signal.SIGUSR1, self._handler)   # Slurm preempt
        signal.signal(signal.SIGTERM, self._handler)   # Docker/RunPod kill
        signal.signal(signal.SIGINT, self._handler)    # Ctrl+C
 
    def _handler(self, signum, frame):
        print(f"[preempt] signal {signum} received, flushing")
        self.requested = True
 
class NCCLWatchdog:
    """NCCL'i monitor et — eğer 10 dk inactive ise abort."""
    def __init__(self, threshold_min=10):
        self.last_step = time.time()
        self.threshold = threshold_min * 60
        self.thread = None
 
    def heartbeat(self):
        self.last_step = time.time()
 
    def check(self):
        if time.time() - self.last_step > self.threshold:
            print(f"[watchdog] NCCL inactive {self.threshold/60} min — aborting")
            os._exit(1)            # force exit, all ranks killed
 
# Setup
preempt = GracefulPreemption()
watchdog = NCCLWatchdog(threshold_min=10)
 
# NCCL config
os.environ["NCCL_TIMEOUT"] = "3600"             # 1 hour timeout (default 30 min)
os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"
os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1"
 
# Training loop with fault-tolerance
ckpt_dir = Path("ckpt")
last_ckpt = ckpt_dir / "latest"
 
if last_ckpt.exists():
    print(f"[resume] loading from {last_ckpt}")
    state = torch.load(last_ckpt / "state.pt", map_location="cpu")
    model.load_state_dict(state["model"])
    optimizer.load_state_dict(state["optim"])
    start_step = state["step"]
else:
    start_step = 0
 
for step in range(start_step, total_steps):
    try:
        batch = next(loader_iter)
        loss = train_step(model, batch, optimizer)
        watchdog.heartbeat()
 
        if step % 100 == 0 or preempt.requested:
            # Atomic checkpoint
            staging = ckpt_dir / f"step-{step:08d}.partial"
            staging.mkdir(parents=True, exist_ok=True)
 
            if dist.get_rank() == 0:
                torch.save({
                    "step": step,
                    "model": model.state_dict(),
                    "optim": optimizer.state_dict(),
                }, staging / "state.pt")
 
            dist.barrier()
            if dist.get_rank() == 0:
                final = ckpt_dir / f"step-{step:08d}"
                if final.exists():
                    import shutil; shutil.rmtree(final)
                os.rename(staging, final)
                # Update latest symlink
                if last_ckpt.exists() or last_ckpt.is_symlink():
                    last_ckpt.unlink()
                last_ckpt.symlink_to(final.relative_to(ckpt_dir))
 
            dist.barrier()
 
            if preempt.requested:
                print(f"[preempt] saved at step {step}, exiting cleanly")
                break
 
    except Exception as e:
        print(f"[error] step {step}: {e}")
        # Cluster-wide abort + Slurm requeue
        os._exit(1)

fault-tolerant training framework — preempt + watchdog + atomic ckpt

✅ Part IV tamamlandı

Yukarıdaki framework'ü kendi train.py'ına entegre et. 2) SIGUSR1 manuel simulate ederek graceful save test. 3) Sonraki Part: Part V — MoE Internals & Fine-Tuning. Mixtral, DeepSeek-V3, Qwen3-MoE.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Multi-Node Run + Fault-Tolerant Training: 2 Node × 8 H100 NCCL Cluster

1. Cluster Failure Pattern'ları#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter