Back to full roadmap
topicadvanced
Vision-Based UI Understanding
Model understands UI from screenshots — can work without DOM.
2 hours1 resources1 prereqs
Browser-based agents can reach DOM (HTML query). But desktop apps, mobile apps, native software have no DOM — vision is mandatory.
Models:
- Claude 3.5/4 Sonnet — strong GUI element detection, click coordinate prediction
- GPT-4o — similar capability
- OmniParser (Microsoft) — fine-tuned for UI element segmentation, feeds agents
Pattern: screenshot → extract element bboxes with OmniParser → LLM says "click bbox 5" → click coordinates.