Glossary Library

Technical GlossaryComputer Vision

Video Transformer

In One Line

A modern architectural approach that tokenizes video across time and space and models it with attention mechanisms.

Video Transformer architectures go beyond CNN-based video modeling by learning long-range spatio-temporal relations through attention mechanisms. This can be especially powerful for complex action sequences, long video context, and global scene interactions. However, computational cost and context-length management remain central challenges in this area.

You Might Also Like

Explore these concepts to continue your artificial intelligence journey.

Glossary Cover

bilgisayarli-goru

Action Anticipation

A task that attempts to predict a future action from a partially observed video stream before it fully unfolds.

Glossary Cover

bilgisayarli-goru

Action Recognition

A task focused on recognizing action classes from human or object motion in video.

Glossary Cover

Additive Attention

An early attention approach that compares query and context representations through a learnable combination function.