Skip to content
Technical GlossaryGenerative AI and LLM

Multimodal Transformer

A model design that processes different data types such as text, images, audio, or video within a shared attention architecture.

A multimodal Transformer aims to learn relationships across different modalities inside a shared representation space. By combining contextual signals from multiple data types, it enables richer reasoning and generation. It plays a central role in multimodal agent systems and the broader vision of unified foundation models.