Skip to content
Technical GlossaryGenerative AI and LLM

Tensor Parallelism

A technique that scales inference and training by splitting large model computations across devices within layers.

Tensor parallelism is one of the fundamental distributed computing techniques for running models that do not fit into the memory of a single device. Matrix operations inside layers are split across multiple GPUs. This not only makes model execution possible, but also shapes performance design in large-scale inference serving.