Lambda vs CoreWeave vs RunPod — cookbook hangisini öneriyor?

Lab başına ayrı öneri: **Spike/Reference 4090** → RunPod Community (en ucuz, hızlı boot). **Reference 8×H100, 1-4 saat** → Lambda On-Demand (en yaygın, no surprise). **Production train uzun süreli** → CoreWeave 1-yr reserve veya Lambda Reserve. Cookbook hiçbiriyle affiliated değil; saf cost-perf analizi.

Spot instance preemption riskini cookbook nasıl yönetiyor?

Spot'ta cookbook **Spike + Reference** Lab'ları tavsiye eder. **Production training Spot'a koymak risk**: 70B FT 8 saat sürüyor, spot 3 kez kesilirse 3× restart = total cost rezerve'den fazla olabilir. Reserve fiyat 1-yr ~%35-40 indirim — predictable workload için tercih.

Container & Slurm Recipes: Tek 4090'dan Cloud Multi-Node'a Doğru Köprü

Tek 4090'da hazırladığın eğitimi 8×H100 cluster'a taşıma kılavuzu: Slurm sbatch şablonu, multi-node NCCL setup, EFA/InfiniBand sanity check, Lambda/RunPod/CoreWeave/Vast'ın gerçek saat fiyatları, preemption-tolerant training, checkpoint manifest, FAULT_TOLERANCE prensipleri.

Şükrü Yusuf KAYA

38 dakikalık okuma

14.05.2026

İleri

Container & Slurm Recipes: Tek 4090'dan Cloud Multi-Node'a Doğru Köprü

🎯 Bu ders ne için

Cookbook'un büyük çoğunluğu RTX 4090'da koşar — ama Part IV (70B FT), Part V (DeepSeek-V3 671B), Part XII (R1-style RL) cloud multi-node ister. Bu ders, dataset-prep ve sanity-check'i 4090'da bitirip cluster'a transfer etmek için lazım olan disiplini öğretir. Cluster pahalıdır; 4090'da olabildiğince çok şey doğrulanmalı.

1. Cloud GPU Ekonomi Tablosu (2026 Başı, Spot Hariç)#

Sağlayıcı	GPU	$/saat (on-demand)	$/saat (1-yr reserve)	Notlar
Lambda	H100 SXM 80GB	$2.99	$1.99	en yaygın, hızlı boot
Lambda	A100 80GB	$1.79	—	bütçe dostu
Lambda	8×H100 SXM	$23.92	—	InfiniBand 3.2 Tb/s
RunPod (Community)	RTX 4090	$0.34-0.69	—	spot risk, hızlı
RunPod (Secure)	H100 PCIe 80GB	$2.49	—	enterprise
CoreWeave	8×H100 SXM	$24.32	$18.40	400G InfiniBand
TogetherAI	H100 cluster	$2.40-$ 3.49	—	training service
Vast.ai	RTX 4090	$0.20-$ 0.50	—	spot, kaybedebilirsin
AWS	p5.48xlarge (8×H100)	$98.32	$55.62	EFA, en pahalı
Hyperbolic	H100	$1.49	—	yeni, agresif fiyat

Cookbook'un kuralı: Lab S1 (Spike) → RTX 4090 local. Lab S2 (Reference) → 4090 local + bir 8×H100 saat (~$24, ortalama 1 saat). Lab S3/S4 (Production/Research) → Lambda/CoreWeave 1-year reserve veya hybrid.

Türkiye'de saatlik elektrik:#

RTX 4090 full load 450W. ₺3.5/kWh ile bir Lab saati ≈ ₺1.6 (~

0.05). 1000 saat (40 gün full kullanım) → ₺1600. Cloud aynı süre 4090 RunPod community

400+. Lokal donanım amortismanı 4-5 ay.

2. Slurm sbatch Şablonu — Cookbook Reference#

Slurm = HPC cluster scheduler. Lambda, CoreWeave, AWS ParallelCluster, akademik HPC — hepsi Slurm konuşur. Cookbook'un her multi-node Lab'ı şu şablonu kullanır:

bash

#!/bin/bash
#SBATCH --job-name=ftc-llama-70b-qlora
#SBATCH --nodes=2                            # 2 node × 8 GPU = 16 GPU
#SBATCH --ntasks-per-node=8                  # 1 task per GPU
#SBATCH --gres=gpu:h100:8
#SBATCH --cpus-per-task=16                   # 8 CPU per GPU
#SBATCH --mem=0                              # all available
#SBATCH --time=08:00:00                      # walltime limit
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err
#SBATCH --signal=B:SIGUSR1@90                # 90s before walltime → graceful shutdown
#SBATCH --requeue                            # preemption sonrası yeniden zamanla
 
set -euo pipefail
 
# === Repro environment ===
export PYTHONHASHSEED=42
export CUBLAS_WORKSPACE_CONFIG=:4096:8
export TOKENIZERS_PARALLELISM=false
export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_HOME=/scratch/$USER/hf-cache
 
# === NCCL & distributed ===
export NCCL_DEBUG=WARN                       # INFO debug için, prod'da WARN
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_IB_GID_INDEX=3                   # InfiniBand RoCE
export NCCL_IB_DISABLE=0                     # IB açık (yoksa TCP'ye düşer)
export NCCL_SOCKET_IFNAME=^lo,docker0        # exclude loopback & docker
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_SOCKET_NTHREADS=2
export OMP_NUM_THREADS=4
 
# === Master addr — Slurm yardımıyla ===
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
MASTER_PORT=$((RANDOM % 10000 + 30000))
export MASTER_ADDR MASTER_PORT
echo "[$(date)] MASTER_ADDR=$MASTER_ADDR  MASTER_PORT=$MASTER_PORT"
 
# === Graceful preemption handler ===
trap 'echo "[$(date)] Got SIGUSR1, snapshotting checkpoint then requeue"; \
      scancel --signal=USR1 $SLURM_JOBID; \
      wait; \
      scontrol requeue $SLURM_JOBID; \
      exit 0' SIGUSR1
 
# === Run ===
srun --label \
  apptainer run --nv \
    --bind /scratch:/scratch \
    --bind /home/$USER:/home/$USER \
    ftc.sif \
    bash -c "\
      cd /workspace && \
      uv run torchrun \
        --nproc_per_node=8 \
        --nnodes=$SLURM_NNODES \
        --node_rank=$SLURM_NODEID \
        --master_addr=$MASTER_ADDR \
        --master_port=$MASTER_PORT \
        --rdzv_id=$SLURM_JOB_ID \
        --rdzv_backend=c10d \
        --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
        train.py \
        --config configs/llama3_70b_qlora_fsdp.yaml \
        --resume_from_checkpoint /scratch/$USER/ckpts/last \
    "

FTC.Slurm.sbatch — multi-node + preemption-tolerant FT şablonu

Bu şablonda dikkat edilmesi gereken 6 nokta:#

--signal=B:SIGUSR1@90
— Slurm walltime'dan 90 saniye önce SIGUSR1 atar. Bu sinyali yakalayıp checkpoint kaydedip
scontrol requeue
ile job'u kuyruğa geri at = preemption-tolerant.
--requeue
— node failure'da otomatik retry.
MASTER_PORT
rastgele seçilir — birden fazla job aynı node'da çakışmasın diye.
NCCL_SOCKET_IFNAME=^lo,docker0
— loopback ve docker interface'ini exclude et (yoksa NCCL bunlardan geçmeye çalışır, hang olur).
rdzv_backend=c10d
— eski
static
yerine elastic backend; node katılım/ayrılma için.
apptainer
— Slurm + container kombinasyonu için tercih (Docker daemon'a sudo gerekmez).

3. NCCL Sanity Check — Cluster'ı Eğitmeden Önce 5 Dakikalık Test#

Her yeni cluster'a girince eğitime başlamadan önce NCCL test koş. 50 satır kod, 5 dakika, kaç bin dolar tasarruf.

python

# nccl_test.py — Multi-node NCCL bandwidth & latency sanity check
import os, time
import torch
import torch.distributed as dist
 
def main():
    rank = int(os.environ["RANK"])
    world = int(os.environ["WORLD_SIZE"])
    local = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local)
    dist.init_process_group("nccl", rank=rank, world_size=world)
 
    if rank == 0:
        print(f"world={world}  device={torch.cuda.get_device_name(0)}")
 
    # 1) All-reduce bandwidth — eğitimin "gradient sync" benchmarki
    for size_mb in [1, 4, 16, 64, 256, 1024]:
        x = torch.randn(size_mb * 256 * 1024, device="cuda")  # MB → fp32 floats
        # Warmup
        for _ in range(5):
            dist.all_reduce(x)
        torch.cuda.synchronize()
        # Bench
        t0 = time.perf_counter()
        for _ in range(20):
            dist.all_reduce(x)
        torch.cuda.synchronize()
        t = (time.perf_counter() - t0) / 20
        bw = (size_mb * 1024 * 1024 * 4) / t / 1e9  # GB/s
        if rank == 0:
            print(f"all-reduce {size_mb:5d} MB  →  {t*1000:7.2f} ms   ~{bw:6.1f} GB/s")
 
    # 2) Send/recv ping
    if rank == 0:
        x = torch.zeros(1, device="cuda")
        dist.send(x, dst=1)
        dist.recv(x, src=1)
        print("ping/pong OK")
    elif rank == 1:
        x = torch.zeros(1, device="cuda")
        dist.recv(x, src=0)
        dist.send(x, dst=0)
 
    dist.barrier()
    dist.destroy_process_group()
 
if __name__ == "__main__":
    main()

nccl_test.py — eğitimden önce 5 dakikalık zorunlu sanity check

Beklenen sayılar (8×H100 SXM, NVLink + InfiniBand):#

Op size	Süre	Bandwidth
1 MB	~0.2 ms	~5 GB/s (latency-bound)
16 MB	~0.5 ms	~32 GB/s
256 MB	~3-4 ms	~70-85 GB/s
1 GB	~12-15 ms	~70-85 GB/s

Eğer 256MB all-reduce 30 ms+ alıyorsa veya bandwidth 30 GB/s'in altında çıkıyorsa NCCL TCP fallback yapmış demektir, IB devre dışı.

NCCL_DEBUG=INFO

ile rerun:

Using network IB

yerine

Using network Socket

görüyorsan IB kapalı,

NCCL_IB_DISABLE=0

ve interface adlarını doğrula.

Tek 4090'da bu test ne anlam ifade eder? Tek node, intra-node NCCL → tek GPU sandı, all-reduce bypass'a düşer. Yine de yararlı: NCCL'in kurulumunu, env'in doğru yüklendiğini test eder. Çok-GPU 4090 setup'larında (1 PC, 2× 4090 PCIe) bandwidth ~50 GB/s'i geçmemeli (PCIe 4.0 x16 sınırı).

4. Preemption-Tolerant Training (Spot Instance'lar İçin Hayat-Memat)#

Vast.ai spot, RunPod community, AWS Spot — saatte $0.20'a 4090 alabilirsin ama her an düşebilir. Cookbook'un preemption disiplini:

Checkpoint manifest#

Her N step (cookbook varsayılan: 50-100) tam state'i kaydet:

python

# ckpt_manifest.py — Atomik, resumable checkpoint kaydetme
import os, json, shutil, hashlib
from pathlib import Path
import torch
from accelerate import Accelerator
 
class CheckpointManager:
    """
    Atomik checkpoint kaydı + manifest.
    Yarı-yazılmış ckpt'yi tespit eder ve önceki sağlamına döner.
    """
    def __init__(self, out_dir: str, accelerator: Accelerator, keep_last: int = 3):
        self.out = Path(out_dir)
        self.out.mkdir(parents=True, exist_ok=True)
        self.acc = accelerator
        self.keep_last = keep_last
        self.manifest_path = self.out / "manifest.json"
 
    def _hash_state(self, state: dict) -> str:
        h = hashlib.sha256()
        for k in sorted(state):
            v = state[k]
            if torch.is_tensor(v):
                h.update(v.cpu().numpy().tobytes())
            else:
                h.update(json.dumps(v, sort_keys=True, default=str).encode())
        return h.hexdigest()[:16]
 
    def save(self, step: int, model, optimizer, scheduler, scaler, metadata):
        if not self.acc.is_main_process:
            self.acc.wait_for_everyone()
            return
 
        tag = f"step-{step:08d}"
        staging = self.out / f"{tag}.partial"
        final = self.out / tag
 
        staging.mkdir(exist_ok=True)
        # accelerate state
        self.acc.save_state(staging)
        # extra
        torch.save({"scheduler": scheduler.state_dict(),
                    "scaler": scaler.state_dict() if scaler else None,
                    "metadata": metadata}, staging / "extras.pt")
 
        manifest = {
            "step": step,
            "tag": tag,
            "metadata": metadata,
            "files": sorted(p.name for p in staging.iterdir()),
        }
        with open(staging / "manifest.json", "w") as f:
            json.dump(manifest, f, indent=2)
 
        # Atomic rename
        if final.exists():
            shutil.rmtree(final)
        os.rename(staging, final)
 
        # Update top-level manifest (atomik write)
        tmp_top = self.manifest_path.with_suffix(".tmp")
        with open(tmp_top, "w") as f:
            json.dump({"last": tag, "step": step, "metadata": metadata}, f, indent=2)
        os.replace(tmp_top, self.manifest_path)
 
        # Cleanup old
        ckpts = sorted([d for d in self.out.iterdir() if d.is_dir() and d.name.startswith("step-")])
        for old in ckpts[:-self.keep_last]:
            shutil.rmtree(old)
 
        self.acc.wait_for_everyone()
 
    def load_latest(self):
        if not self.manifest_path.exists():
            return None
        with open(self.manifest_path) as f:
            mf = json.load(f)
        return self.out / mf["tag"]

atomik, resumable checkpoint manager — preemption-safe

Graceful shutdown signaling#

SIGUSR1

(Slurm preempt sinyali) veya

SIGTERM

(Docker/RunPod kill) yakalanıp çıkmadan önce mecburi checkpoint:

python

# train.py içinde
import signal
 
class GracefulPreemption:
    def __init__(self):
        self.requested = False
        signal.signal(signal.SIGUSR1, self._handler)
        signal.signal(signal.SIGTERM, self._handler)
 
    def _handler(self, signum, frame):
        print(f"[preempt] received signal {signum} — flushing checkpoint")
        self.requested = True
 
preempt = GracefulPreemption()
 
# Training loop
for step, batch in enumerate(loader):
    loss = train_step(batch)
    if step % 50 == 0 or preempt.requested:
        ckpt_mgr.save(step, model, opt, sched, scaler,
                      metadata={"loss": loss.item(), "preempted": preempt.requested})
        if preempt.requested:
            print("[preempt] checkpoint saved, exiting cleanly")
            break

preemption-safe training loop

🐛 Failure Mode Drill #3 — '8 saat eğittim, preempt oldu, baştan başlıyor'

Senaryo: Spot instance preempt edildi, yeniden başlatıldı, training step=0'dan başlıyor. Olası nedenler: (a) Checkpoint atomic değil — son yazılma yarım kaldı, manifest broken → atomic rename + manifest validation gerek. (b) `resume_from_checkpoint` flag'i sbatch script'inde yok ya da yol yanlış. (c) Optimizer state .safetensors yerine yalnız weight kaydedildi → momentum/variance kaybı → effectively baştan. (d) Scheduler state kaydedilmedi → LR sıfırdan warmup baştan. Drill: 4 hipotezi de tek tek test et, hangisi senin kök nedeninin?

5. Bench: Tek 4090 → 8×H100 — Transfer Sayıları#

Aşama	Tek 4090	8×H100 SXM	Hangisinde yapmalı
Dataset prep + tokenize 50K	8 dk	8 dk (CPU-bound)	4090 (cluster'da CPU saati yak)
QLoRA Llama 3.1 8B Sanity (50 step)	1.4 dk	0.4 dk	4090 (ucuz)
QLoRA Llama 3.1 8B Reference (3 epoch)	1.2 saat	0.3 saat	4090 (ucuz)
Full SFT Llama 3.3 70B (3 epoch, 50K samples)	imkânsız	5.6 saat	8×H100 (zorunlu)
DPO Llama 3.1 70B (1 epoch)	imkânsız	7.2 saat	8×H100
GRPO Llama 3.3 70B (1000 step)	imkânsız	12 saat	8×H100

Transfer pratiği:

4090'da dataset hazırla + cookbook'un Spike/Reference Lab'ını koş → kod doğru çalışıyor mu emin ol.
Aynı container image'ini cluster'a push'la (registry üzerinden).
Slurm sbatch ile çalıştır.
Maliyet kontrolü:
squeue
,
sacct -j JOBID --format=JobID,JobName,State,Elapsed,MaxRSS
.

✅ Bu dersin teslimi

Yukarıdaki `nccl_test.py`'ı kendi cluster'ında çalıştır (yoksa local'de 2× GPU varsa intra-node ile). 2) Yukarıdaki `CheckpointManager`'ı 10-step'lik bir dummy train loop'a entegre et — atomik save'i simüle etmek için yazma esnasında SIGKILL et, sonra resume — manifest sağlam kalmalı. 3) Sonraki ders: 0.5 — Experiment Tracking Mimarisi: W&B + Hydra + DVC.

Sık Sorulan Sorular

K8s + Kubeflow MPIJob/PyTorchJob için \`torchrun\` invocation aynı; sadece outer-loop scheduler farklı. Cookbook'un Part XVI (Operations) dersinde K8s Operator (PyTorchJob CR) reçetesi de var. Disiplin (env var, NCCL flags, preemption handler) ortak.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Part 0 — Engineering Foundations

Container & Slurm Recipes: Tek 4090'dan Cloud Multi-Node'a Doğru Köprü

1. Cloud GPU Ekonomi Tablosu (2026 Başı, Spot Hariç)#

Türkiye'de saatlik elektrik:#

2. Slurm sbatch Şablonu — Cookbook Reference#

Bu şablonda dikkat edilmesi gereken 6 nokta:#

3. NCCL Sanity Check — Cluster'ı Eğitmeden Önce 5 Dakikalık Test#

Beklenen sayılar (8×H100 SXM, NVLink + InfiniBand):#

4. Preemption-Tolerant Training (Spot Instance'lar İçin Hayat-Memat)#

Checkpoint manifest#

Graceful shutdown signaling#

5. Bench: Tek 4090 → 8×H100 — Transfer Sayıları#

Sık Sorulan Sorular

Slurm'um yok, K8s var. Cookbook'un sbatch şablonu çöp mü?

Lambda vs CoreWeave vs RunPod — cookbook hangisini öneriyor?

Spot instance preemption riskini cookbook nasıl yönetiyor?

Yorumlar & Soru-Cevap

İlgili İçerikler

Fine-Tuning Cookbook'a Hoş Geldin: Sistematik, Stage Taksonomisi ve Reproducibility Kontratı

Reproducibility Stack: Seeds, cuDNN Flags ve Deterministic CUDA — 'Sende Niye Çalışıyor Bende Çalışmıyor' Sorununu Bitir

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix ve Container Reçeteleri

Bültenime Abone Olun