topicadvanced

Vision-Based UI Understanding

Model understands UI from screenshots — can work without DOM.

2 hours1 resources1 prereqs

Browser-based agents can reach DOM (HTML query). But desktop apps, mobile apps, native software have no DOM — vision is mandatory.

Models:

Claude 3.5/4 Sonnet — strong GUI element detection, click coordinate prediction
GPT-4o — similar capability
OmniParser (Microsoft) — fine-tuned for UI element segmentation, feeds agents

Pattern: screenshot → extract element bboxes with OmniParser → LLM says "click bbox 5" → click coordinates.

Prerequisites