Skip to content
Back to full roadmap
topicadvanced

Vision-Based UI Understanding

Model understands UI from screenshots — can work without DOM.

2 hours1 resources1 prereqs

Browser-based agents can reach DOM (HTML query). But desktop apps, mobile apps, native software have no DOM — vision is mandatory.

Models:

  • Claude 3.5/4 Sonnet — strong GUI element detection, click coordinate prediction
  • GPT-4o — similar capability
  • OmniParser (Microsoft) — fine-tuned for UI element segmentation, feeds agents

Pattern: screenshot → extract element bboxes with OmniParser → LLM says "click bbox 5" → click coordinates.

Prerequisites

Resources(1)

Vision-Based UI Understanding · AI Agent Engineer Roadmap | SYK