# Data Parallelism (DDP): Foundation of Multi-GPU LLM Training — AllReduce and NCCL Anatomy

> Source: https://sukruyusufkaya.com/en/learn/llm-muhendisligi/ddp-data-parallel-allreduce-nccl
> Updated: 2026-05-13T13:00:29.136Z
> Category: LLM Mühendisliği
> Module: Module 13: Distributed Training — Multi-GPU/Multi-Node
**TLDR:** Distributed Data Parallel (DDP) anatomy: model replication across GPUs, mini-batch split, forward/backward independent per GPU, gradient AllReduce synchronization. NCCL (NVIDIA Collective Communication Library), ring-allreduce algorithm, bandwidth math. PyTorch DDP API, launch scripts, common pitfalls (uneven batches, batch norm sync).