Skip to content
Technical GlossaryComputer Vision

Vision Transformer Features

A modern visual feature structure that splits images into patch tokens and learns representations through global attention.

Vision Transformer features are among the strongest examples of a representation learning paradigm outside CNNs. The image is split into fixed-size patches, which are then processed like tokens. This approach is especially strong at learning global contextual relations. In recent years, it has become a powerful and increasingly standard representation family for classification, segmentation, and multimodal systems.