docs: worklog for wayland desktop automation session

2025-12-17 14:32:20 -08:00 · 2025-12-17 14:32:20 -08:00 · e366343dd7
parent 906f2bc7ee
commit e366343dd7
1 changed files with 259 additions and 0 deletions
--- a/docs/worklogs/2025-12-17-wayland-desktop-automation-seeing-problem.org
+++ b/docs/worklogs/2025-12-17-wayland-desktop-automation-seeing-problem.org
@ -0,0 +1,259 @@
+#+TITLE: Wayland Desktop Automation: The Seeing Problem
+#+DATE: 2025-12-17
+#+KEYWORDS: wayland, niri, desktop-automation, at-spi, vision-model, accessibility, ydotool, seeing-problem
+#+COMMITS: 4
+#+COMPRESSION_STATUS: uncompressed
+
+* Session Summary
+** Date: 2025-12-17
+** Focus Area: Exploring desktop automation solutions for Wayland/niri compositor
+
+* Accomplishments
+- [X] Fixed orch skill documentation - updated to use globally installed CLI instead of ~cd ~/proj/orch && uv run~
+- [X] Closed skills-d87 (orch skill documentation-only bug)
+- [X] Created epic skills-kg7 for "Desktop automation for Wayland/niri"
+- [X] Framed "the seeing problem" - how AI agents understand screen content
+- [X] Investigated AT-SPI accessibility framework on NixOS
+- [X] Discovered AT-SPI is disabled (~NO_AT_BRIDGE=1~, ~GTK_A11Y=none~)
+- [X] Created task skills-pdg documenting AT-SPI enablement requirements
+- [X] Created benchmark tasks for both vision and AT-SPI paths
+- [X] Conducted first vision model benchmark experiment with ChatGPT screenshot
+- [X] Identified "the acting problem" - discovered no mouse input tool installed
+- [X] Verified ydotool would work (user already in uinput group)
+- [ ] ydotool installation task not yet created
+
+* Key Decisions
+** Decision 1: Hybrid approach for UI understanding
+- Context: Need to understand what's on screen (element types, coordinates) for automation
+- Options considered:
+  1. AT-SPI only - structured data, precise coords, but requires system config + app compliance
+  2. Vision model only - universal coverage, but coordinate precision unknown
+  3. Hybrid - benchmark both, use appropriately
+- Rationale: Both paths have tradeoffs; measuring both allows informed decisions per use case
+- Impact: Created parallel benchmark tasks, AT-SPI becomes opt-in system config
+
+** Decision 2: Boot-time option for AT-SPI
+- Context: AT-SPI adds runtime overhead to all GTK/Qt apps
+- Options considered:
+  1. Always-on accessibility
+  2. Boot-time toggle
+  3. Session-level toggle
+- Rationale: Overhead only worth paying when automation is needed
+- Impact: Documented in skills-pdg as future implementation direction
+
+** Decision 3: Frame as "the seeing problem"
+- Context: Needed clear terminology for the capability gap
+- Rationale: "Seeing" captures the essence - understanding what's visible on screen
+- Impact: Epic and tasks now organized around this framing; led to discovering complementary "acting problem"
+
+* Problems & Solutions
+| Problem | Solution | Learning |
+|---------|----------|----------|
+| AT-SPI test crashed with "AT-SPI: Couldn't connect to accessibility bus" | AT-SPI needs to be running at session start; can't spin up ad-hoc | AT-SPI is session-level, not on-demand |
+| Started fresh AT-SPI bus showed 0 apps | Existing apps connected to dbus at their startup, don't see new bus | Apps must start AFTER AT-SPI is enabled |
+| bd sync failed with worktree error | Manually committed beads files with git add/commit | bd sync has edge cases; manual fallback works |
+| Couldn't test vision model coordinate predictions | No mouse click tool installed (ydotool missing) | "Seeing" is only half the problem; "acting" matters too |
+
+* Technical Details
+
+** Code Changes
+- Total files modified: 4
+- Key files changed:
+  - ~skills/orch/SKILL.md~ - Removed ~cd ~/proj/orch && uv run~ prefix from all examples
+  - ~skills/orch/README.md~ - Updated prerequisites and examples
+- Beads changes:
+  - ~.beads/issues.jsonl~ - Added skills-kg7 (epic), skills-pdg, skills-ebl, skills-bww
+
+** Commands Used
+#+begin_src bash
+# Check AT-SPI status
+gsettings get org.gnome.desktop.interface toolkit-accessibility
+# Result: gsettings not available
+
+# Check if AT-SPI bus running
+pgrep -a at-spi
+dbus-send --session --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.ListNames | grep -i access
+# Result: Nothing running
+
+# Check environment variables
+echo "NO_AT_BRIDGE=$NO_AT_BRIDGE"  # = 1
+echo "GTK_A11Y=$GTK_A11Y"          # = none
+
+# Test AT-SPI in nix-shell (failed - bus not connected)
+nix-shell -p at-spi2-core python312Packages.pyatspi python312Packages.pygobject3 --run 'python3 /tmp/test_atspi.py'
+
+# Start AT-SPI bus manually (worked, but 0 apps)
+/nix/store/.../at-spi-bus-launcher &
+# Output: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry
+# Desktop has 0 accessible apps
+
+# Check uinput access for ydotool
+ls -la /dev/uinput
+# crw-rw---- 1 root uinput 10, 223 Dec 16 11:52 /dev/uinput
+groups | grep -w input
+# User in both input and uinput groups - ydotool would work
+
+# Capture window for vision benchmark
+niri msg action screenshot-window --id 10 --write-to-disk true
+
+# Check available input tools
+which ydotool  # not installed
+which wtype    # /etc/profiles/per-user/dan/bin/wtype (keyboard only)
+#+end_src
+
+** Architecture Notes
+*** The Seeing Problem - Layered Model
+#+begin_example
+┌─────────────────────────────────────────────────────────────┐
+│                     "Seeing" Layers                         │
+├─────────────────────────────────────────────────────────────┤
+│ 4. Understanding: "What can I click? Where's the submit?"   │
+│    → Vision model + prompt engineering                      │
+│    → AT-SPI UI tree (if enabled)                           │
+├─────────────────────────────────────────────────────────────┤
+│ 3. Pixel capture: Screenshots                               │
+│    → niri screenshot-window ✅                              │
+├─────────────────────────────────────────────────────────────┤
+│ 2. Window geometry: Size, position, monitor                 │
+│    → niri windows/focused-window ✅                         │
+├─────────────────────────────────────────────────────────────┤
+│ 1. Window awareness: What's open, app_id, title             │
+│    → niri windows ✅                                        │
+└─────────────────────────────────────────────────────────────┘
+#+end_example
+
+*** Wayland Security Model Impact
+- X11 allowed any app to read keystrokes, inject input, screenshot anything
+- Wayland blocks all of this by design (security feature)
+- Workarounds require:
+  - Compositor-specific IPC (niri msg)
+  - Kernel bypass for input (ydotool → uinput)
+  - App opt-in for UI tree (AT-SPI)
+  - Portal permissions for screenshots
+
+*** NixOS AT-SPI Configuration
+#+begin_src nix
+# To enable AT-SPI
+services.gnome.at-spi2-core.enable = true;
+# For Qt apps also need:
+# environment.variables.QT_LINUX_ACCESSIBILITY_ALWAYS_ON = "1";
+#+end_src
+
+* Process and Workflow
+
+** What Worked Well
+- Beads workflow for tracking exploration as issues
+- Incremental investigation - check what exists before planning
+- Framing the problem clearly ("seeing problem") helped structure thinking
+- Quick vision benchmark experiment provided concrete data
+
+** What Was Challenging
+- AT-SPI complexity - session vs on-demand, app registration timing
+- NixOS options discovery - no easy way to search for accessibility options
+- bd sync worktree error - had to fall back to manual git
+
+* Learning and Insights
+
+** Technical Insights
+- AT-SPI is fundamentally session-level - apps register at startup, can't retroactively add
+- ~NO_AT_BRIDGE=1~ and ~GTK_A11Y=none~ are NixOS defaults when AT-SPI not enabled
+- ydotool uses kernel uinput, completely separate from AT-SPI
+- niri IPC provides rich window metadata including exact pixel dimensions
+- wtype exists for keyboard input on Wayland but no mouse equivalent installed
+
+** Process Insights
+- "OCR is redundant with vision models" - good insight from user that reframed AT-SPI's value
+- AT-SPI value is semantic structure + precise coordinates, not text extraction
+- Benchmark-driven approach better than picking one path upfront
+
+** Architectural Insights
+- Desktop automation on Wayland requires multiple independent pieces:
+  - Compositor IPC (window management)
+  - Kernel uinput (input injection)
+  - D-Bus accessibility (UI tree)
+  - Portal protocols (screenshots, permissions)
+- No unified "Playwright for desktop" exists because each piece has different requirements
+
+* Context for Future Work
+
+** Open Questions
+- What coordinate precision can vision models achieve? (benchmark incomplete - need clicking)
+- Which apps actually expose useful AT-SPI data?
+- What's the runtime overhead of AT-SPI in practice?
+- Should AT-SPI be boot-time toggle or always-available?
+
+** Next Steps
+- [ ] Create ydotool setup task (enables clicking to verify vision coordinates)
+- [ ] Enable AT-SPI in NixOS config (when ready to test)
+- [ ] Complete vision model benchmark with actual coordinate verification
+- [ ] Document AT-SPI coverage per app (Firefox, Ghostty, etc.)
+
+** Related Work
+- [[file:2025-11-08-invisible-window-capture-niri.org][2025-11-08: Invisible Window Capture]] - niri screenshot capability
+- [[file:2025-11-08-screenshot-analysis-over-engineering-discovery.org][2025-11-08: Screenshot Analysis]] - earlier wayland capture research
+- skills-kg7: Desktop automation for Wayland/niri (epic)
+- skills-pdg: Enable AT-SPI for UI tree access
+- skills-ebl: Benchmark vision model UI understanding
+- skills-bww: Benchmark AT-SPI overhead and coverage
+
+* Raw Notes
+
+** Vision Model Benchmark Experiment
+Captured Firefox ChatGPT window (1280x1408px). Made coordinate predictions:
+
+| Element | Predicted X | Predicted Y | Confidence |
+|---------+-------------+-------------+------------|
+| "New chat" button | 100 | 176 | High |
+| "Search chats" | 112 | 218 | High |
+| "Ask anything" input | 680 | 500 | Medium |
+| "Thinking" dropdown | 465 | 555 | Medium |
+| Microphone icon | 927 | 555 | Medium |
+| Black voice button | 979 | 555 | High |
+| "+" button (attach) | 377 | 555 | Medium |
+| User profile | 90 | 981 | High |
+
+Could not verify via clicking - no mouse input tool available.
+
+** niri IPC Data Available
+#+begin_src json
+{
+  "id": 5,
+  "title": "✳ Boot-time Option",
+  "app_id": "com.mitchellh.ghostty",
+  "pid": 11854,
+  "workspace_id": 3,
+  "is_focused": true,
+  "layout": {
+    "tile_size": [1280.0, 1408.0],
+    "window_size": [1280, 1408]
+  }
+}
+#+end_src
+
+Also available: multi-monitor info with logical positions, scale factors, modes.
+
+** AT-SPI Test Script
+#+begin_src python
+import gi
+gi.require_version("Atspi", "2.0")
+from gi.repository import Atspi
+
+desktop = Atspi.get_desktop(0)
+count = desktop.get_child_count()
+print(f"Desktop has {count} accessible apps")
+
+for i in range(count):
+    app = desktop.get_child_at_index(i)
+    if app:
+        name = app.get_name() or "(unnamed)"
+        print(f"  - {name}")
+#+end_src
+
+* Session Metrics
+- Commits made: 4
+- Files touched: 4 (2 orch docs, 2 beads files)
+- Lines added/removed: ~+1536/-25 (mostly beads JSON)
+- Tests added: 0
+- Tests passing: N/A
+- Beads issues created: 4 (1 epic, 3 tasks)
+- Beads issues closed: 1 (skills-d87)