Skip to content
Technical GlossaryComputer Vision

Video Transformer

A modern architectural approach that tokenizes video across time and space and models it with attention mechanisms.

Video Transformer architectures go beyond CNN-based video modeling by learning long-range spatio-temporal relations through attention mechanisms. This can be especially powerful for complex action sequences, long video context, and global scene interactions. However, computational cost and context-length management remain central challenges in this area.