Skip to content
Technical GlossaryComputer Vision

Visual Question Answering

A multimodal task that answers natural language questions about an image based on visual context.

Visual Question Answering is a powerful task that jointly tests visual perception and language understanding. The model must connect linguistic cues in the question to relevant regions of the image in order to produce the correct answer. It goes beyond simple object recognition by requiring relations, counting, color, location, and sometimes reasoning. It is a foundational component for multimodal assistants and interactive visual agents.