Technical GlossaryGenerative AI and LLM
Tensor Parallelism
A technique that scales inference and training by splitting large model computations across devices within layers.
Tensor parallelism is one of the fundamental distributed computing techniques for running models that do not fit into the memory of a single device. Matrix operations inside layers are split across multiple GPUs. This not only makes model execution possible, but also shapes performance design in large-scale inference serving.
You Might Also Like
Explore these concepts to continue your artificial intelligence journey.
