dan/skills

dan bd83887669 research: vision model UI understanding benchmark

Tested Claude Opus 4.5 on btop and GitHub screenshots.
Findings: excellent text/state/layout, approximate coordinates.
Recommendation: hybrid AT-SPI + vision approach.

2025-12-29 15:26:13 -05:00

3.7 KiB

Raw Blame History

Vision Model UI Understanding Benchmark

Date: 2025-12-29 Model: Claude Opus 4.5 (claude-opus-4-5-20251101) Issue: skills-ebl

Test Cases

Test 1: btop System Monitor (Terminal UI)

Screenshot: /tmp/ui-test-btop.png

Query Type	Query	Result	Accuracy
Element Location	Where is CPU usage?	Top-right quadrant, cores C0-C19	✓ Correct
Element ID	What sections visible?	CPU, Memory, Disks, Process, Network	✓ Complete
State Detection	Battery status?	BAT▲ 77% charging, 11.89W	✓ Correct
Text Extraction	System uptime?	8d 20:30	✓ Exact
Layout	Describe layout	Header + 2x2 grid + process tree	✓ Accurate

Test 2: GitHub Homepage (Web UI)

Screenshot: /tmp/ui-test-github.png

Query Type	Query	Result	Accuracy
Element Location	Where is Sign up?	Top-right, ~x:1208 y:36	✓ Approximate
Element ID	What buttons visible?	5 buttons identified	✓ Complete
State Detection	Email field filled?	No, placeholder showing	✓ Correct
Text Extraction	Main headline?	"The future of building..."	✓ Exact
Layout	Describe navigation	Logo + 6 nav items + auth	✓ Accurate

Findings

Strengths

Text extraction: Near-perfect accuracy on readable text
Element identification: Can enumerate UI components reliably
State detection: Understands filled/empty, checked/unchecked, enabled/disabled
Layout understanding: Accurately describes spatial relationships
Complex UIs: Handles busy interfaces like btop well

Limitations

Coordinate precision: Can give approximate regions but not pixel-accurate coordinates
Bounding boxes: Cannot provide exact element boundaries without prompting
Small elements: May miss very small icons or indicators
Overlapping elements: Could struggle with layered UI

Prompt Engineering Insights

What works:

Direct questions about specific elements
Asking for enumeration ("list all buttons")
State queries ("is X checked/filled/enabled?")
Layout descriptions ("what's in the sidebar?")

What needs refinement:

Getting precise coordinates requires specific prompting
Bounding box extraction not native - would need grid overlay
Click targets need "where would you click to X?" framing

Recommendations for Desktop Automation

Hybrid approach: Use AT-SPI for precise coordinates, vision for semantic understanding
Verification: Vision can verify AT-SPI found the right element
Fallback: Vision can work when AT-SPI support is poor (Electron apps)
Planning: Vision excels at high-level task planning ("how would I save this file?")

Comparison with AT-SPI

Capability	Vision Model	AT-SPI
Text extraction	✓ Excellent	✓ Excellent
Element enumeration	✓ Good	✓ Excellent
Coordinates	~ Approximate	✓ Exact
State detection	✓ Good	✓ Exact
Semantic understanding	✓ Excellent	✗ Limited
Works on Electron	✓ Yes	~ Varies
Latency	~2-5s	~50ms
Cost	API tokens	Free

Conclusion

Vision models are highly capable for UI understanding tasks. Best used for:

Task planning and high-level navigation
Verifying correct element selection
Handling apps with poor accessibility support
Semantic queries ("find the settings button")

AT-SPI preferred for:

Precise click coordinates
Fast enumeration of all elements
Programmatic automation
Cost-sensitive batch operations

Recommendation: Implement hybrid approach - AT-SPI for mechanics, vision for semantics.

3.7 KiB Raw Blame History