İçeriğe geç

Streaming & Sharded Datasets: 500GB+ Veriye Disk Olmadan Eğitim

1 TB dataset 4090'ın 2TB NVMe'sine sığar ama tokenize edip cache'lemek 5 TB ister. Çözüm: streaming. HuggingFace datasets.IterableDataset, WebDataset .tar shard'lar, MosaicML Streaming (MDS), S3 streaming, resumable streaming (epoch'tan yarıdayken duruyor, resume). Multi-worker collator pattern.

Şükrü Yusuf KAYA
28 dakikalık okuma
İleri
Streaming & Sharded Datasets: 500GB+ Veriye Disk Olmadan Eğitim

1. 3 Streaming Seçeneği#

SeçenekKullanım kolaylığıCloud-friendlyResumableShuffle quality
HF
IterableDataset
yüksekiyisınırlıiyi (shuffle buffer)
WebDataset
.tar
ortamükemmeliyishard-level
MosaicML MDSdüşükmükemmelmükemmelmükemmel
Cookbook tercih:
  • 5-50 GB dataset → HF IterableDataset
  • 50 GB - 1 TB → WebDataset
  • 1 TB veya production → MosaicML MDS
python
# === WebDataset streaming — S3 + tokenize on-the-fly ===
import webdataset as wds
from transformers import AutoTokenizer
 
tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
 
# Shard URL pattern
urls = "pipe:aws s3 cp s3://my-bucket/tr-corpus/shard-{000000..000500}.tar -"
 
def tokenize_pair(text):
enc = tok(text, max_length=2048, truncation=True, return_tensors="pt")
return enc["input_ids"].squeeze(0), enc["attention_mask"].squeeze(0)
 
ds = (
wds.WebDataset(urls, shardshuffle=100, resampled=True)
.shuffle(1000) # shuffle buffer
.decode("torch")
.to_tuple("text")
.map_tuple(lambda t: tokenize_pair(t))
.batched(8)
)
loader = wds.WebLoader(ds, num_workers=4, batch_size=None,
persistent_workers=True)
 
# Training loop
for batch in loader:
input_ids, attn_mask = batch
# ...
WebDataset S3 streaming + on-the-fly tokenize

2. MosaicML Streaming (MDS)#

MDS = tar'a alternatif, random-access + S3-friendly + resumable.
# Convert HF dataset to MDS python -m streaming.text.convert \ --input s3://bucket/raw/ \ --output s3://bucket/mds/ \ --shard_size 67108864 # 64MB shards
from streaming import StreamingDataset ds = StreamingDataset( local="/tmp/cache", # disk cache remote="s3://bucket/mds/", shuffle=True, shuffle_seed=42, batch_size=8, num_canonical_nodes=1, )
Avantajlar:
  • Random-access (her sample'a O(1) erişim)
  • Exact-resumable (step-N'den devam)
  • Multi-node deterministic sharding
  • Memory-efficient (samples mmap'lenir)
✅ Teslim
  1. Bir küçük dataset'i WebDataset shard'lara böl. 2) Streaming loader'ı tek 4090'da koş. 3) Sonraki ders: 2.11 — Long-Context Dataset Engineering.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler