Sequence + Context Parallel: Ulysses + Ring Attention + 1M Context

Sequence Parallel + Context Parallel: Ulysses + Ring Attention + 1M Context

Breaking long-context FT's physics limit: split sequence/context across GPUs. DeepSpeed-Ulysses (sequence parallel — head-wise), Ring Attention (Berkeley), Megatron Sequence Parallel. Enable 1M context. Technical foundation of Kimi-1.5's (Moonshot) 2M context recipe.

Şükrü Yusuf KAYA

32 min read

6/26/2026

Advanced

1. Niye Sequence Parallel?#

Long-context FT'in sınırı activation memory:

seq=128K, batch=1, Llama 70B → activation 60 GB+
1M token istersen → 500 GB+ → tek GPU değil, hatta 8×H100 bile zorlanır

Çözüm: Sequence dimension'da paralel hesaplama.

Method	Split Dim	Comm pattern	Use case
DeepSpeed-Ulysses	head dimension (attention)	all-to-all 4x	sequence parallel attention
Ring Attention (UC Berkeley)	sequence dimension	ring all-gather	1M context FT
Megatron Sequence Parallel	sequence dim in MLP/LN	reduce-scatter	TP'ın long-seq extension

2. DeepSpeed-Ulysses#

Insight: attention'ın

Q K^T

softmax(...) V

operasyonları head dimension'da bağımsız.

GPU 0: head 0-3 için Q, K, V
GPU 1: head 4-7 için Q, K, V
...
Forward (attention):
  - Q, K, V'yi sequence dim'inde gather (all-to-all input)
  - Per-head attention compute (no comm)
  - Output'u tekrar sequence dim'inde scatter (all-to-all output)

Avantaj: N GPU ile sequence length N× uzun mümkün.

Sayı (4×H100 SXM):

Standard FA2: max seq=128K (Llama 70B + grad-ckpt)
- Ulysses (4-way SP): max seq=512K
- Ulysses (8-way SP, 8 GPU): max seq=1M

3. Ring Attention (Berkeley)#

Insight:

softmax(QK^T)V

operasyonu iteratif olarak hesaplanabilir. Her GPU sequence'ın bir parçasını tutar, ring topolojisi ile parça parça hesaplar.

GPU 0: seq tokens [0..127]   K, V'leri
GPU 1: seq tokens [128..255] K, V'leri
GPU 2: seq tokens [256..383] K, V'leri
GPU 3: seq tokens [384..511] K, V'leri

Adım 1: Her GPU lokal Q × lokal K (sub-block)
Adım 2: K, V'yi sağa pass (ring)
Adım 3: Lokal Q × yeni K (cross-block)
...
N adım sonra: tüm Q × tüm K hesaplanmış

Online softmax ile birikim numerically kararlı.

Avantaj: Compute scaling N-way + bandwidth efficient (ring topology). Dezavantaj: Implementation karmaşık (Megatron / DeepSpeed integration gerekir).

✅ Teslim

DeepSpeed-Ulysses paper'ı oku. 2) 4-GPU Ulysses ile bir uzun-context senaryosu simulate et. 3) Sonraki ders: 4.7 — Llama 3.3 70B QLoRA + FSDP.

Sequence Parallel + Context Parallel: Ulysses + Ring Attention + 1M Context

1. Niye Sequence Parallel?#

2. DeepSpeed-Ulysses#

3. Ring Attention (Berkeley)#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter