skills/docs/worklogs/2025-12-17-wayland-desktop-automation-seeing-problem.org

11 KiB

Wayland Desktop Automation: The Seeing Problem

Session Summary

Date: 2025-12-17

Focus Area: Exploring desktop automation solutions for Wayland/niri compositor

Accomplishments

  • Fixed orch skill documentation - updated to use globally installed CLI instead of cd ~/proj/orch && uv run
  • Closed skills-d87 (orch skill documentation-only bug)
  • Created epic skills-kg7 for "Desktop automation for Wayland/niri"
  • Framed "the seeing problem" - how AI agents understand screen content
  • Investigated AT-SPI accessibility framework on NixOS
  • Discovered AT-SPI is disabled (NO_AT_BRIDGE=1, GTK_A11Y=none)
  • Created task skills-pdg documenting AT-SPI enablement requirements
  • Created benchmark tasks for both vision and AT-SPI paths
  • Conducted first vision model benchmark experiment with ChatGPT screenshot
  • Identified "the acting problem" - discovered no mouse input tool installed
  • Verified ydotool would work (user already in uinput group)
  • ydotool installation task not yet created

Key Decisions

Decision 1: Hybrid approach for UI understanding

  • Context: Need to understand what's on screen (element types, coordinates) for automation
  • Options considered:

    1. AT-SPI only - structured data, precise coords, but requires system config + app compliance
    2. Vision model only - universal coverage, but coordinate precision unknown
    3. Hybrid - benchmark both, use appropriately
  • Rationale: Both paths have tradeoffs; measuring both allows informed decisions per use case
  • Impact: Created parallel benchmark tasks, AT-SPI becomes opt-in system config

Decision 2: Boot-time option for AT-SPI

  • Context: AT-SPI adds runtime overhead to all GTK/Qt apps
  • Options considered:

    1. Always-on accessibility
    2. Boot-time toggle
    3. Session-level toggle
  • Rationale: Overhead only worth paying when automation is needed
  • Impact: Documented in skills-pdg as future implementation direction

Decision 3: Frame as "the seeing problem"

  • Context: Needed clear terminology for the capability gap
  • Rationale: "Seeing" captures the essence - understanding what's visible on screen
  • Impact: Epic and tasks now organized around this framing; led to discovering complementary "acting problem"

Problems & Solutions

Problem Solution Learning
AT-SPI test crashed with "AT-SPI: Couldn't connect to accessibility bus" AT-SPI needs to be running at session start; can't spin up ad-hoc AT-SPI is session-level, not on-demand
Started fresh AT-SPI bus showed 0 apps Existing apps connected to dbus at their startup, don't see new bus Apps must start AFTER AT-SPI is enabled
bd sync failed with worktree error Manually committed beads files with git add/commit bd sync has edge cases; manual fallback works
Couldn't test vision model coordinate predictions No mouse click tool installed (ydotool missing) "Seeing" is only half the problem; "acting" matters too

Technical Details

Code Changes

  • Total files modified: 4
  • Key files changed:

    • skills/orch/SKILL.md - Removed cd ~/proj/orch && uv run prefix from all examples
    • skills/orch/README.md - Updated prerequisites and examples
  • Beads changes:

    • .beads/issues.jsonl - Added skills-kg7 (epic), skills-pdg, skills-ebl, skills-bww

Commands Used

# Check AT-SPI status
gsettings get org.gnome.desktop.interface toolkit-accessibility
# Result: gsettings not available

# Check if AT-SPI bus running
pgrep -a at-spi
dbus-send --session --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.ListNames | grep -i access
# Result: Nothing running

# Check environment variables
echo "NO_AT_BRIDGE=$NO_AT_BRIDGE"  # = 1
echo "GTK_A11Y=$GTK_A11Y"          # = none

# Test AT-SPI in nix-shell (failed - bus not connected)
nix-shell -p at-spi2-core python312Packages.pyatspi python312Packages.pygobject3 --run 'python3 /tmp/test_atspi.py'

# Start AT-SPI bus manually (worked, but 0 apps)
/nix/store/.../at-spi-bus-launcher &
# Output: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry
# Desktop has 0 accessible apps

# Check uinput access for ydotool
ls -la /dev/uinput
# crw-rw---- 1 root uinput 10, 223 Dec 16 11:52 /dev/uinput
groups | grep -w input
# User in both input and uinput groups - ydotool would work

# Capture window for vision benchmark
niri msg action screenshot-window --id 10 --write-to-disk true

# Check available input tools
which ydotool  # not installed
which wtype    # /etc/profiles/per-user/dan/bin/wtype (keyboard only)

Architecture Notes

The Seeing Problem - Layered Model

┌─────────────────────────────────────────────────────────────┐
│                     "Seeing" Layers                         │
├─────────────────────────────────────────────────────────────┤
│ 4. Understanding: "What can I click? Where's the submit?"   │
│    → Vision model + prompt engineering                      │
│    → AT-SPI UI tree (if enabled)                           │
├─────────────────────────────────────────────────────────────┤
│ 3. Pixel capture: Screenshots                               │
│    → niri screenshot-window ✅                              │
├─────────────────────────────────────────────────────────────┤
│ 2. Window geometry: Size, position, monitor                 │
│    → niri windows/focused-window ✅                         │
├─────────────────────────────────────────────────────────────┤
│ 1. Window awareness: What's open, app_id, title             │
│    → niri windows ✅                                        │
└─────────────────────────────────────────────────────────────┘

Wayland Security Model Impact

  • X11 allowed any app to read keystrokes, inject input, screenshot anything
  • Wayland blocks all of this by design (security feature)
  • Workarounds require:

    • Compositor-specific IPC (niri msg)
    • Kernel bypass for input (ydotool → uinput)
    • App opt-in for UI tree (AT-SPI)
    • Portal permissions for screenshots

NixOS AT-SPI Configuration

# To enable AT-SPI
services.gnome.at-spi2-core.enable = true;
# For Qt apps also need:
# environment.variables.QT_LINUX_ACCESSIBILITY_ALWAYS_ON = "1";

Process and Workflow

What Worked Well

  • Beads workflow for tracking exploration as issues
  • Incremental investigation - check what exists before planning
  • Framing the problem clearly ("seeing problem") helped structure thinking
  • Quick vision benchmark experiment provided concrete data

What Was Challenging

  • AT-SPI complexity - session vs on-demand, app registration timing
  • NixOS options discovery - no easy way to search for accessibility options
  • bd sync worktree error - had to fall back to manual git

Learning and Insights

Technical Insights

  • AT-SPI is fundamentally session-level - apps register at startup, can't retroactively add
  • NO_AT_BRIDGE=1 and GTK_A11Y=none are NixOS defaults when AT-SPI not enabled
  • ydotool uses kernel uinput, completely separate from AT-SPI
  • niri IPC provides rich window metadata including exact pixel dimensions
  • wtype exists for keyboard input on Wayland but no mouse equivalent installed

Process Insights

  • "OCR is redundant with vision models" - good insight from user that reframed AT-SPI's value
  • AT-SPI value is semantic structure + precise coordinates, not text extraction
  • Benchmark-driven approach better than picking one path upfront

Architectural Insights

  • Desktop automation on Wayland requires multiple independent pieces:

    • Compositor IPC (window management)
    • Kernel uinput (input injection)
    • D-Bus accessibility (UI tree)
    • Portal protocols (screenshots, permissions)
  • No unified "Playwright for desktop" exists because each piece has different requirements

Context for Future Work

Open Questions

  • What coordinate precision can vision models achieve? (benchmark incomplete - need clicking)
  • Which apps actually expose useful AT-SPI data?
  • What's the runtime overhead of AT-SPI in practice?
  • Should AT-SPI be boot-time toggle or always-available?

Next Steps

  • Create ydotool setup task (enables clicking to verify vision coordinates)
  • Enable AT-SPI in NixOS config (when ready to test)
  • Complete vision model benchmark with actual coordinate verification
  • Document AT-SPI coverage per app (Firefox, Ghostty, etc.)

Related Work

  • 2025-11-08: Invisible Window Capture - niri screenshot capability
  • 2025-11-08: Screenshot Analysis - earlier wayland capture research
  • skills-kg7: Desktop automation for Wayland/niri (epic)
  • skills-pdg: Enable AT-SPI for UI tree access
  • skills-ebl: Benchmark vision model UI understanding
  • skills-bww: Benchmark AT-SPI overhead and coverage

Raw Notes

Vision Model Benchmark Experiment

Captured Firefox ChatGPT window (1280x1408px). Made coordinate predictions:

Element Predicted X Predicted Y Confidence
"New chat" button 100 176 High
"Search chats" 112 218 High
"Ask anything" input 680 500 Medium
"Thinking" dropdown 465 555 Medium
Microphone icon 927 555 Medium
Black voice button 979 555 High
"+" button (attach) 377 555 Medium
User profile 90 981 High

Could not verify via clicking - no mouse input tool available.

niri IPC Data Available

{
  "id": 5,
  "title": "✳ Boot-time Option",
  "app_id": "com.mitchellh.ghostty",
  "pid": 11854,
  "workspace_id": 3,
  "is_focused": true,
  "layout": {
    "tile_size": [1280.0, 1408.0],
    "window_size": [1280, 1408]
  }
}

Also available: multi-monitor info with logical positions, scale factors, modes.

AT-SPI Test Script

import gi
gi.require_version("Atspi", "2.0")
from gi.repository import Atspi

desktop = Atspi.get_desktop(0)
count = desktop.get_child_count()
print(f"Desktop has {count} accessible apps")

for i in range(count):
    app = desktop.get_child_at_index(i)
    if app:
        name = app.get_name() or "(unnamed)"
        print(f"  - {name}")

Session Metrics

  • Commits made: 4
  • Files touched: 4 (2 orch docs, 2 beads files)
  • Lines added/removed: ~+1536/-25 (mostly beads JSON)
  • Tests added: 0
  • Tests passing: N/A
  • Beads issues created: 4 (1 epic, 3 tasks)
  • Beads issues closed: 1 (skills-d87)