# Vision Model UI Understanding Benchmark Date: 2025-12-29 Model: Claude Opus 4.5 (claude-opus-4-5-20251101) Issue: skills-ebl ## Test Cases ### Test 1: btop System Monitor (Terminal UI) Screenshot: /tmp/ui-test-btop.png | Query Type | Query | Result | Accuracy | |------------|-------|--------|----------| | Element Location | Where is CPU usage? | Top-right quadrant, cores C0-C19 | ✓ Correct | | Element ID | What sections visible? | CPU, Memory, Disks, Process, Network | ✓ Complete | | State Detection | Battery status? | BAT▲ 77% charging, 11.89W | ✓ Correct | | Text Extraction | System uptime? | 8d 20:30 | ✓ Exact | | Layout | Describe layout | Header + 2x2 grid + process tree | ✓ Accurate | ### Test 2: GitHub Homepage (Web UI) Screenshot: /tmp/ui-test-github.png | Query Type | Query | Result | Accuracy | |------------|-------|--------|----------| | Element Location | Where is Sign up? | Top-right, ~x:1208 y:36 | ✓ Approximate | | Element ID | What buttons visible? | 5 buttons identified | ✓ Complete | | State Detection | Email field filled? | No, placeholder showing | ✓ Correct | | Text Extraction | Main headline? | "The future of building..." | ✓ Exact | | Layout | Describe navigation | Logo + 6 nav items + auth | ✓ Accurate | ## Findings ### Strengths 1. **Text extraction**: Near-perfect accuracy on readable text 2. **Element identification**: Can enumerate UI components reliably 3. **State detection**: Understands filled/empty, checked/unchecked, enabled/disabled 4. **Layout understanding**: Accurately describes spatial relationships 5. **Complex UIs**: Handles busy interfaces like btop well ### Limitations 1. **Coordinate precision**: Can give approximate regions but not pixel-accurate coordinates 2. **Bounding boxes**: Cannot provide exact element boundaries without prompting 3. **Small elements**: May miss very small icons or indicators 4. **Overlapping elements**: Could struggle with layered UI ### Prompt Engineering Insights **What works:** - Direct questions about specific elements - Asking for enumeration ("list all buttons") - State queries ("is X checked/filled/enabled?") - Layout descriptions ("what's in the sidebar?") **What needs refinement:** - Getting precise coordinates requires specific prompting - Bounding box extraction not native - would need grid overlay - Click targets need "where would you click to X?" framing ### Recommendations for Desktop Automation 1. **Hybrid approach**: Use AT-SPI for precise coordinates, vision for semantic understanding 2. **Verification**: Vision can verify AT-SPI found the right element 3. **Fallback**: Vision can work when AT-SPI support is poor (Electron apps) 4. **Planning**: Vision excels at high-level task planning ("how would I save this file?") ### Comparison with AT-SPI | Capability | Vision Model | AT-SPI | |------------|--------------|--------| | Text extraction | ✓ Excellent | ✓ Excellent | | Element enumeration | ✓ Good | ✓ Excellent | | Coordinates | ~ Approximate | ✓ Exact | | State detection | ✓ Good | ✓ Exact | | Semantic understanding | ✓ Excellent | ✗ Limited | | Works on Electron | ✓ Yes | ~ Varies | | Latency | ~2-5s | ~50ms | | Cost | API tokens | Free | ## Conclusion Vision models are highly capable for UI understanding tasks. Best used for: - Task planning and high-level navigation - Verifying correct element selection - Handling apps with poor accessibility support - Semantic queries ("find the settings button") AT-SPI preferred for: - Precise click coordinates - Fast enumeration of all elements - Programmatic automation - Cost-sensitive batch operations **Recommendation**: Implement hybrid approach - AT-SPI for mechanics, vision for semantics.