skills/docs/research/vision-ui-benchmark-2025-12-29.md
dan bd83887669 research: vision model UI understanding benchmark
Tested Claude Opus 4.5 on btop and GitHub screenshots.
Findings: excellent text/state/layout, approximate coordinates.
Recommendation: hybrid AT-SPI + vision approach.
2025-12-29 15:26:13 -05:00

3.7 KiB

Vision Model UI Understanding Benchmark

Date: 2025-12-29 Model: Claude Opus 4.5 (claude-opus-4-5-20251101) Issue: skills-ebl

Test Cases

Test 1: btop System Monitor (Terminal UI)

Screenshot: /tmp/ui-test-btop.png

Query Type Query Result Accuracy
Element Location Where is CPU usage? Top-right quadrant, cores C0-C19 ✓ Correct
Element ID What sections visible? CPU, Memory, Disks, Process, Network ✓ Complete
State Detection Battery status? BAT▲ 77% charging, 11.89W ✓ Correct
Text Extraction System uptime? 8d 20:30 ✓ Exact
Layout Describe layout Header + 2x2 grid + process tree ✓ Accurate

Test 2: GitHub Homepage (Web UI)

Screenshot: /tmp/ui-test-github.png

Query Type Query Result Accuracy
Element Location Where is Sign up? Top-right, ~x:1208 y:36 ✓ Approximate
Element ID What buttons visible? 5 buttons identified ✓ Complete
State Detection Email field filled? No, placeholder showing ✓ Correct
Text Extraction Main headline? "The future of building..." ✓ Exact
Layout Describe navigation Logo + 6 nav items + auth ✓ Accurate

Findings

Strengths

  1. Text extraction: Near-perfect accuracy on readable text
  2. Element identification: Can enumerate UI components reliably
  3. State detection: Understands filled/empty, checked/unchecked, enabled/disabled
  4. Layout understanding: Accurately describes spatial relationships
  5. Complex UIs: Handles busy interfaces like btop well

Limitations

  1. Coordinate precision: Can give approximate regions but not pixel-accurate coordinates
  2. Bounding boxes: Cannot provide exact element boundaries without prompting
  3. Small elements: May miss very small icons or indicators
  4. Overlapping elements: Could struggle with layered UI

Prompt Engineering Insights

What works:

  • Direct questions about specific elements
  • Asking for enumeration ("list all buttons")
  • State queries ("is X checked/filled/enabled?")
  • Layout descriptions ("what's in the sidebar?")

What needs refinement:

  • Getting precise coordinates requires specific prompting
  • Bounding box extraction not native - would need grid overlay
  • Click targets need "where would you click to X?" framing

Recommendations for Desktop Automation

  1. Hybrid approach: Use AT-SPI for precise coordinates, vision for semantic understanding
  2. Verification: Vision can verify AT-SPI found the right element
  3. Fallback: Vision can work when AT-SPI support is poor (Electron apps)
  4. Planning: Vision excels at high-level task planning ("how would I save this file?")

Comparison with AT-SPI

Capability Vision Model AT-SPI
Text extraction ✓ Excellent ✓ Excellent
Element enumeration ✓ Good ✓ Excellent
Coordinates ~ Approximate ✓ Exact
State detection ✓ Good ✓ Exact
Semantic understanding ✓ Excellent ✗ Limited
Works on Electron ✓ Yes ~ Varies
Latency ~2-5s ~50ms
Cost API tokens Free

Conclusion

Vision models are highly capable for UI understanding tasks. Best used for:

  • Task planning and high-level navigation
  • Verifying correct element selection
  • Handling apps with poor accessibility support
  • Semantic queries ("find the settings button")

AT-SPI preferred for:

  • Precise click coordinates
  • Fast enumeration of all elements
  • Programmatic automation
  • Cost-sensitive batch operations

Recommendation: Implement hybrid approach - AT-SPI for mechanics, vision for semantics.