diff --git a/docs/worklogs/2025-12-17-wayland-desktop-automation-seeing-problem.org b/docs/worklogs/2025-12-17-wayland-desktop-automation-seeing-problem.org new file mode 100644 index 0000000..abb0a7b --- /dev/null +++ b/docs/worklogs/2025-12-17-wayland-desktop-automation-seeing-problem.org @@ -0,0 +1,259 @@ +#+TITLE: Wayland Desktop Automation: The Seeing Problem +#+DATE: 2025-12-17 +#+KEYWORDS: wayland, niri, desktop-automation, at-spi, vision-model, accessibility, ydotool, seeing-problem +#+COMMITS: 4 +#+COMPRESSION_STATUS: uncompressed + +* Session Summary +** Date: 2025-12-17 +** Focus Area: Exploring desktop automation solutions for Wayland/niri compositor + +* Accomplishments +- [X] Fixed orch skill documentation - updated to use globally installed CLI instead of ~cd ~/proj/orch && uv run~ +- [X] Closed skills-d87 (orch skill documentation-only bug) +- [X] Created epic skills-kg7 for "Desktop automation for Wayland/niri" +- [X] Framed "the seeing problem" - how AI agents understand screen content +- [X] Investigated AT-SPI accessibility framework on NixOS +- [X] Discovered AT-SPI is disabled (~NO_AT_BRIDGE=1~, ~GTK_A11Y=none~) +- [X] Created task skills-pdg documenting AT-SPI enablement requirements +- [X] Created benchmark tasks for both vision and AT-SPI paths +- [X] Conducted first vision model benchmark experiment with ChatGPT screenshot +- [X] Identified "the acting problem" - discovered no mouse input tool installed +- [X] Verified ydotool would work (user already in uinput group) +- [ ] ydotool installation task not yet created + +* Key Decisions +** Decision 1: Hybrid approach for UI understanding +- Context: Need to understand what's on screen (element types, coordinates) for automation +- Options considered: + 1. AT-SPI only - structured data, precise coords, but requires system config + app compliance + 2. Vision model only - universal coverage, but coordinate precision unknown + 3. Hybrid - benchmark both, use appropriately +- Rationale: Both paths have tradeoffs; measuring both allows informed decisions per use case +- Impact: Created parallel benchmark tasks, AT-SPI becomes opt-in system config + +** Decision 2: Boot-time option for AT-SPI +- Context: AT-SPI adds runtime overhead to all GTK/Qt apps +- Options considered: + 1. Always-on accessibility + 2. Boot-time toggle + 3. Session-level toggle +- Rationale: Overhead only worth paying when automation is needed +- Impact: Documented in skills-pdg as future implementation direction + +** Decision 3: Frame as "the seeing problem" +- Context: Needed clear terminology for the capability gap +- Rationale: "Seeing" captures the essence - understanding what's visible on screen +- Impact: Epic and tasks now organized around this framing; led to discovering complementary "acting problem" + +* Problems & Solutions +| Problem | Solution | Learning | +|---------|----------|----------| +| AT-SPI test crashed with "AT-SPI: Couldn't connect to accessibility bus" | AT-SPI needs to be running at session start; can't spin up ad-hoc | AT-SPI is session-level, not on-demand | +| Started fresh AT-SPI bus showed 0 apps | Existing apps connected to dbus at their startup, don't see new bus | Apps must start AFTER AT-SPI is enabled | +| bd sync failed with worktree error | Manually committed beads files with git add/commit | bd sync has edge cases; manual fallback works | +| Couldn't test vision model coordinate predictions | No mouse click tool installed (ydotool missing) | "Seeing" is only half the problem; "acting" matters too | + +* Technical Details + +** Code Changes +- Total files modified: 4 +- Key files changed: + - ~skills/orch/SKILL.md~ - Removed ~cd ~/proj/orch && uv run~ prefix from all examples + - ~skills/orch/README.md~ - Updated prerequisites and examples +- Beads changes: + - ~.beads/issues.jsonl~ - Added skills-kg7 (epic), skills-pdg, skills-ebl, skills-bww + +** Commands Used +#+begin_src bash +# Check AT-SPI status +gsettings get org.gnome.desktop.interface toolkit-accessibility +# Result: gsettings not available + +# Check if AT-SPI bus running +pgrep -a at-spi +dbus-send --session --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.ListNames | grep -i access +# Result: Nothing running + +# Check environment variables +echo "NO_AT_BRIDGE=$NO_AT_BRIDGE" # = 1 +echo "GTK_A11Y=$GTK_A11Y" # = none + +# Test AT-SPI in nix-shell (failed - bus not connected) +nix-shell -p at-spi2-core python312Packages.pyatspi python312Packages.pygobject3 --run 'python3 /tmp/test_atspi.py' + +# Start AT-SPI bus manually (worked, but 0 apps) +/nix/store/.../at-spi-bus-launcher & +# Output: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry +# Desktop has 0 accessible apps + +# Check uinput access for ydotool +ls -la /dev/uinput +# crw-rw---- 1 root uinput 10, 223 Dec 16 11:52 /dev/uinput +groups | grep -w input +# User in both input and uinput groups - ydotool would work + +# Capture window for vision benchmark +niri msg action screenshot-window --id 10 --write-to-disk true + +# Check available input tools +which ydotool # not installed +which wtype # /etc/profiles/per-user/dan/bin/wtype (keyboard only) +#+end_src + +** Architecture Notes +*** The Seeing Problem - Layered Model +#+begin_example +┌─────────────────────────────────────────────────────────────┐ +│ "Seeing" Layers │ +├─────────────────────────────────────────────────────────────┤ +│ 4. Understanding: "What can I click? Where's the submit?" │ +│ → Vision model + prompt engineering │ +│ → AT-SPI UI tree (if enabled) │ +├─────────────────────────────────────────────────────────────┤ +│ 3. Pixel capture: Screenshots │ +│ → niri screenshot-window ✅ │ +├─────────────────────────────────────────────────────────────┤ +│ 2. Window geometry: Size, position, monitor │ +│ → niri windows/focused-window ✅ │ +├─────────────────────────────────────────────────────────────┤ +│ 1. Window awareness: What's open, app_id, title │ +│ → niri windows ✅ │ +└─────────────────────────────────────────────────────────────┘ +#+end_example + +*** Wayland Security Model Impact +- X11 allowed any app to read keystrokes, inject input, screenshot anything +- Wayland blocks all of this by design (security feature) +- Workarounds require: + - Compositor-specific IPC (niri msg) + - Kernel bypass for input (ydotool → uinput) + - App opt-in for UI tree (AT-SPI) + - Portal permissions for screenshots + +*** NixOS AT-SPI Configuration +#+begin_src nix +# To enable AT-SPI +services.gnome.at-spi2-core.enable = true; +# For Qt apps also need: +# environment.variables.QT_LINUX_ACCESSIBILITY_ALWAYS_ON = "1"; +#+end_src + +* Process and Workflow + +** What Worked Well +- Beads workflow for tracking exploration as issues +- Incremental investigation - check what exists before planning +- Framing the problem clearly ("seeing problem") helped structure thinking +- Quick vision benchmark experiment provided concrete data + +** What Was Challenging +- AT-SPI complexity - session vs on-demand, app registration timing +- NixOS options discovery - no easy way to search for accessibility options +- bd sync worktree error - had to fall back to manual git + +* Learning and Insights + +** Technical Insights +- AT-SPI is fundamentally session-level - apps register at startup, can't retroactively add +- ~NO_AT_BRIDGE=1~ and ~GTK_A11Y=none~ are NixOS defaults when AT-SPI not enabled +- ydotool uses kernel uinput, completely separate from AT-SPI +- niri IPC provides rich window metadata including exact pixel dimensions +- wtype exists for keyboard input on Wayland but no mouse equivalent installed + +** Process Insights +- "OCR is redundant with vision models" - good insight from user that reframed AT-SPI's value +- AT-SPI value is semantic structure + precise coordinates, not text extraction +- Benchmark-driven approach better than picking one path upfront + +** Architectural Insights +- Desktop automation on Wayland requires multiple independent pieces: + - Compositor IPC (window management) + - Kernel uinput (input injection) + - D-Bus accessibility (UI tree) + - Portal protocols (screenshots, permissions) +- No unified "Playwright for desktop" exists because each piece has different requirements + +* Context for Future Work + +** Open Questions +- What coordinate precision can vision models achieve? (benchmark incomplete - need clicking) +- Which apps actually expose useful AT-SPI data? +- What's the runtime overhead of AT-SPI in practice? +- Should AT-SPI be boot-time toggle or always-available? + +** Next Steps +- [ ] Create ydotool setup task (enables clicking to verify vision coordinates) +- [ ] Enable AT-SPI in NixOS config (when ready to test) +- [ ] Complete vision model benchmark with actual coordinate verification +- [ ] Document AT-SPI coverage per app (Firefox, Ghostty, etc.) + +** Related Work +- [[file:2025-11-08-invisible-window-capture-niri.org][2025-11-08: Invisible Window Capture]] - niri screenshot capability +- [[file:2025-11-08-screenshot-analysis-over-engineering-discovery.org][2025-11-08: Screenshot Analysis]] - earlier wayland capture research +- skills-kg7: Desktop automation for Wayland/niri (epic) +- skills-pdg: Enable AT-SPI for UI tree access +- skills-ebl: Benchmark vision model UI understanding +- skills-bww: Benchmark AT-SPI overhead and coverage + +* Raw Notes + +** Vision Model Benchmark Experiment +Captured Firefox ChatGPT window (1280x1408px). Made coordinate predictions: + +| Element | Predicted X | Predicted Y | Confidence | +|---------+-------------+-------------+------------| +| "New chat" button | 100 | 176 | High | +| "Search chats" | 112 | 218 | High | +| "Ask anything" input | 680 | 500 | Medium | +| "Thinking" dropdown | 465 | 555 | Medium | +| Microphone icon | 927 | 555 | Medium | +| Black voice button | 979 | 555 | High | +| "+" button (attach) | 377 | 555 | Medium | +| User profile | 90 | 981 | High | + +Could not verify via clicking - no mouse input tool available. + +** niri IPC Data Available +#+begin_src json +{ + "id": 5, + "title": "✳ Boot-time Option", + "app_id": "com.mitchellh.ghostty", + "pid": 11854, + "workspace_id": 3, + "is_focused": true, + "layout": { + "tile_size": [1280.0, 1408.0], + "window_size": [1280, 1408] + } +} +#+end_src + +Also available: multi-monitor info with logical positions, scale factors, modes. + +** AT-SPI Test Script +#+begin_src python +import gi +gi.require_version("Atspi", "2.0") +from gi.repository import Atspi + +desktop = Atspi.get_desktop(0) +count = desktop.get_child_count() +print(f"Desktop has {count} accessible apps") + +for i in range(count): + app = desktop.get_child_at_index(i) + if app: + name = app.get_name() or "(unnamed)" + print(f" - {name}") +#+end_src + +* Session Metrics +- Commits made: 4 +- Files touched: 4 (2 orch docs, 2 beads files) +- Lines added/removed: ~+1536/-25 (mostly beads JSON) +- Tests added: 0 +- Tests passing: N/A +- Beads issues created: 4 (1 epic, 3 tasks) +- Beads issues closed: 1 (skills-d87)