docs: worklog for wayland desktop automation session
This commit is contained in:
parent
906f2bc7ee
commit
e366343dd7
|
|
@ -0,0 +1,259 @@
|
||||||
|
#+TITLE: Wayland Desktop Automation: The Seeing Problem
|
||||||
|
#+DATE: 2025-12-17
|
||||||
|
#+KEYWORDS: wayland, niri, desktop-automation, at-spi, vision-model, accessibility, ydotool, seeing-problem
|
||||||
|
#+COMMITS: 4
|
||||||
|
#+COMPRESSION_STATUS: uncompressed
|
||||||
|
|
||||||
|
* Session Summary
|
||||||
|
** Date: 2025-12-17
|
||||||
|
** Focus Area: Exploring desktop automation solutions for Wayland/niri compositor
|
||||||
|
|
||||||
|
* Accomplishments
|
||||||
|
- [X] Fixed orch skill documentation - updated to use globally installed CLI instead of ~cd ~/proj/orch && uv run~
|
||||||
|
- [X] Closed skills-d87 (orch skill documentation-only bug)
|
||||||
|
- [X] Created epic skills-kg7 for "Desktop automation for Wayland/niri"
|
||||||
|
- [X] Framed "the seeing problem" - how AI agents understand screen content
|
||||||
|
- [X] Investigated AT-SPI accessibility framework on NixOS
|
||||||
|
- [X] Discovered AT-SPI is disabled (~NO_AT_BRIDGE=1~, ~GTK_A11Y=none~)
|
||||||
|
- [X] Created task skills-pdg documenting AT-SPI enablement requirements
|
||||||
|
- [X] Created benchmark tasks for both vision and AT-SPI paths
|
||||||
|
- [X] Conducted first vision model benchmark experiment with ChatGPT screenshot
|
||||||
|
- [X] Identified "the acting problem" - discovered no mouse input tool installed
|
||||||
|
- [X] Verified ydotool would work (user already in uinput group)
|
||||||
|
- [ ] ydotool installation task not yet created
|
||||||
|
|
||||||
|
* Key Decisions
|
||||||
|
** Decision 1: Hybrid approach for UI understanding
|
||||||
|
- Context: Need to understand what's on screen (element types, coordinates) for automation
|
||||||
|
- Options considered:
|
||||||
|
1. AT-SPI only - structured data, precise coords, but requires system config + app compliance
|
||||||
|
2. Vision model only - universal coverage, but coordinate precision unknown
|
||||||
|
3. Hybrid - benchmark both, use appropriately
|
||||||
|
- Rationale: Both paths have tradeoffs; measuring both allows informed decisions per use case
|
||||||
|
- Impact: Created parallel benchmark tasks, AT-SPI becomes opt-in system config
|
||||||
|
|
||||||
|
** Decision 2: Boot-time option for AT-SPI
|
||||||
|
- Context: AT-SPI adds runtime overhead to all GTK/Qt apps
|
||||||
|
- Options considered:
|
||||||
|
1. Always-on accessibility
|
||||||
|
2. Boot-time toggle
|
||||||
|
3. Session-level toggle
|
||||||
|
- Rationale: Overhead only worth paying when automation is needed
|
||||||
|
- Impact: Documented in skills-pdg as future implementation direction
|
||||||
|
|
||||||
|
** Decision 3: Frame as "the seeing problem"
|
||||||
|
- Context: Needed clear terminology for the capability gap
|
||||||
|
- Rationale: "Seeing" captures the essence - understanding what's visible on screen
|
||||||
|
- Impact: Epic and tasks now organized around this framing; led to discovering complementary "acting problem"
|
||||||
|
|
||||||
|
* Problems & Solutions
|
||||||
|
| Problem | Solution | Learning |
|
||||||
|
|---------|----------|----------|
|
||||||
|
| AT-SPI test crashed with "AT-SPI: Couldn't connect to accessibility bus" | AT-SPI needs to be running at session start; can't spin up ad-hoc | AT-SPI is session-level, not on-demand |
|
||||||
|
| Started fresh AT-SPI bus showed 0 apps | Existing apps connected to dbus at their startup, don't see new bus | Apps must start AFTER AT-SPI is enabled |
|
||||||
|
| bd sync failed with worktree error | Manually committed beads files with git add/commit | bd sync has edge cases; manual fallback works |
|
||||||
|
| Couldn't test vision model coordinate predictions | No mouse click tool installed (ydotool missing) | "Seeing" is only half the problem; "acting" matters too |
|
||||||
|
|
||||||
|
* Technical Details
|
||||||
|
|
||||||
|
** Code Changes
|
||||||
|
- Total files modified: 4
|
||||||
|
- Key files changed:
|
||||||
|
- ~skills/orch/SKILL.md~ - Removed ~cd ~/proj/orch && uv run~ prefix from all examples
|
||||||
|
- ~skills/orch/README.md~ - Updated prerequisites and examples
|
||||||
|
- Beads changes:
|
||||||
|
- ~.beads/issues.jsonl~ - Added skills-kg7 (epic), skills-pdg, skills-ebl, skills-bww
|
||||||
|
|
||||||
|
** Commands Used
|
||||||
|
#+begin_src bash
|
||||||
|
# Check AT-SPI status
|
||||||
|
gsettings get org.gnome.desktop.interface toolkit-accessibility
|
||||||
|
# Result: gsettings not available
|
||||||
|
|
||||||
|
# Check if AT-SPI bus running
|
||||||
|
pgrep -a at-spi
|
||||||
|
dbus-send --session --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.ListNames | grep -i access
|
||||||
|
# Result: Nothing running
|
||||||
|
|
||||||
|
# Check environment variables
|
||||||
|
echo "NO_AT_BRIDGE=$NO_AT_BRIDGE" # = 1
|
||||||
|
echo "GTK_A11Y=$GTK_A11Y" # = none
|
||||||
|
|
||||||
|
# Test AT-SPI in nix-shell (failed - bus not connected)
|
||||||
|
nix-shell -p at-spi2-core python312Packages.pyatspi python312Packages.pygobject3 --run 'python3 /tmp/test_atspi.py'
|
||||||
|
|
||||||
|
# Start AT-SPI bus manually (worked, but 0 apps)
|
||||||
|
/nix/store/.../at-spi-bus-launcher &
|
||||||
|
# Output: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry
|
||||||
|
# Desktop has 0 accessible apps
|
||||||
|
|
||||||
|
# Check uinput access for ydotool
|
||||||
|
ls -la /dev/uinput
|
||||||
|
# crw-rw---- 1 root uinput 10, 223 Dec 16 11:52 /dev/uinput
|
||||||
|
groups | grep -w input
|
||||||
|
# User in both input and uinput groups - ydotool would work
|
||||||
|
|
||||||
|
# Capture window for vision benchmark
|
||||||
|
niri msg action screenshot-window --id 10 --write-to-disk true
|
||||||
|
|
||||||
|
# Check available input tools
|
||||||
|
which ydotool # not installed
|
||||||
|
which wtype # /etc/profiles/per-user/dan/bin/wtype (keyboard only)
|
||||||
|
#+end_src
|
||||||
|
|
||||||
|
** Architecture Notes
|
||||||
|
*** The Seeing Problem - Layered Model
|
||||||
|
#+begin_example
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ "Seeing" Layers │
|
||||||
|
├─────────────────────────────────────────────────────────────┤
|
||||||
|
│ 4. Understanding: "What can I click? Where's the submit?" │
|
||||||
|
│ → Vision model + prompt engineering │
|
||||||
|
│ → AT-SPI UI tree (if enabled) │
|
||||||
|
├─────────────────────────────────────────────────────────────┤
|
||||||
|
│ 3. Pixel capture: Screenshots │
|
||||||
|
│ → niri screenshot-window ✅ │
|
||||||
|
├─────────────────────────────────────────────────────────────┤
|
||||||
|
│ 2. Window geometry: Size, position, monitor │
|
||||||
|
│ → niri windows/focused-window ✅ │
|
||||||
|
├─────────────────────────────────────────────────────────────┤
|
||||||
|
│ 1. Window awareness: What's open, app_id, title │
|
||||||
|
│ → niri windows ✅ │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
#+end_example
|
||||||
|
|
||||||
|
*** Wayland Security Model Impact
|
||||||
|
- X11 allowed any app to read keystrokes, inject input, screenshot anything
|
||||||
|
- Wayland blocks all of this by design (security feature)
|
||||||
|
- Workarounds require:
|
||||||
|
- Compositor-specific IPC (niri msg)
|
||||||
|
- Kernel bypass for input (ydotool → uinput)
|
||||||
|
- App opt-in for UI tree (AT-SPI)
|
||||||
|
- Portal permissions for screenshots
|
||||||
|
|
||||||
|
*** NixOS AT-SPI Configuration
|
||||||
|
#+begin_src nix
|
||||||
|
# To enable AT-SPI
|
||||||
|
services.gnome.at-spi2-core.enable = true;
|
||||||
|
# For Qt apps also need:
|
||||||
|
# environment.variables.QT_LINUX_ACCESSIBILITY_ALWAYS_ON = "1";
|
||||||
|
#+end_src
|
||||||
|
|
||||||
|
* Process and Workflow
|
||||||
|
|
||||||
|
** What Worked Well
|
||||||
|
- Beads workflow for tracking exploration as issues
|
||||||
|
- Incremental investigation - check what exists before planning
|
||||||
|
- Framing the problem clearly ("seeing problem") helped structure thinking
|
||||||
|
- Quick vision benchmark experiment provided concrete data
|
||||||
|
|
||||||
|
** What Was Challenging
|
||||||
|
- AT-SPI complexity - session vs on-demand, app registration timing
|
||||||
|
- NixOS options discovery - no easy way to search for accessibility options
|
||||||
|
- bd sync worktree error - had to fall back to manual git
|
||||||
|
|
||||||
|
* Learning and Insights
|
||||||
|
|
||||||
|
** Technical Insights
|
||||||
|
- AT-SPI is fundamentally session-level - apps register at startup, can't retroactively add
|
||||||
|
- ~NO_AT_BRIDGE=1~ and ~GTK_A11Y=none~ are NixOS defaults when AT-SPI not enabled
|
||||||
|
- ydotool uses kernel uinput, completely separate from AT-SPI
|
||||||
|
- niri IPC provides rich window metadata including exact pixel dimensions
|
||||||
|
- wtype exists for keyboard input on Wayland but no mouse equivalent installed
|
||||||
|
|
||||||
|
** Process Insights
|
||||||
|
- "OCR is redundant with vision models" - good insight from user that reframed AT-SPI's value
|
||||||
|
- AT-SPI value is semantic structure + precise coordinates, not text extraction
|
||||||
|
- Benchmark-driven approach better than picking one path upfront
|
||||||
|
|
||||||
|
** Architectural Insights
|
||||||
|
- Desktop automation on Wayland requires multiple independent pieces:
|
||||||
|
- Compositor IPC (window management)
|
||||||
|
- Kernel uinput (input injection)
|
||||||
|
- D-Bus accessibility (UI tree)
|
||||||
|
- Portal protocols (screenshots, permissions)
|
||||||
|
- No unified "Playwright for desktop" exists because each piece has different requirements
|
||||||
|
|
||||||
|
* Context for Future Work
|
||||||
|
|
||||||
|
** Open Questions
|
||||||
|
- What coordinate precision can vision models achieve? (benchmark incomplete - need clicking)
|
||||||
|
- Which apps actually expose useful AT-SPI data?
|
||||||
|
- What's the runtime overhead of AT-SPI in practice?
|
||||||
|
- Should AT-SPI be boot-time toggle or always-available?
|
||||||
|
|
||||||
|
** Next Steps
|
||||||
|
- [ ] Create ydotool setup task (enables clicking to verify vision coordinates)
|
||||||
|
- [ ] Enable AT-SPI in NixOS config (when ready to test)
|
||||||
|
- [ ] Complete vision model benchmark with actual coordinate verification
|
||||||
|
- [ ] Document AT-SPI coverage per app (Firefox, Ghostty, etc.)
|
||||||
|
|
||||||
|
** Related Work
|
||||||
|
- [[file:2025-11-08-invisible-window-capture-niri.org][2025-11-08: Invisible Window Capture]] - niri screenshot capability
|
||||||
|
- [[file:2025-11-08-screenshot-analysis-over-engineering-discovery.org][2025-11-08: Screenshot Analysis]] - earlier wayland capture research
|
||||||
|
- skills-kg7: Desktop automation for Wayland/niri (epic)
|
||||||
|
- skills-pdg: Enable AT-SPI for UI tree access
|
||||||
|
- skills-ebl: Benchmark vision model UI understanding
|
||||||
|
- skills-bww: Benchmark AT-SPI overhead and coverage
|
||||||
|
|
||||||
|
* Raw Notes
|
||||||
|
|
||||||
|
** Vision Model Benchmark Experiment
|
||||||
|
Captured Firefox ChatGPT window (1280x1408px). Made coordinate predictions:
|
||||||
|
|
||||||
|
| Element | Predicted X | Predicted Y | Confidence |
|
||||||
|
|---------+-------------+-------------+------------|
|
||||||
|
| "New chat" button | 100 | 176 | High |
|
||||||
|
| "Search chats" | 112 | 218 | High |
|
||||||
|
| "Ask anything" input | 680 | 500 | Medium |
|
||||||
|
| "Thinking" dropdown | 465 | 555 | Medium |
|
||||||
|
| Microphone icon | 927 | 555 | Medium |
|
||||||
|
| Black voice button | 979 | 555 | High |
|
||||||
|
| "+" button (attach) | 377 | 555 | Medium |
|
||||||
|
| User profile | 90 | 981 | High |
|
||||||
|
|
||||||
|
Could not verify via clicking - no mouse input tool available.
|
||||||
|
|
||||||
|
** niri IPC Data Available
|
||||||
|
#+begin_src json
|
||||||
|
{
|
||||||
|
"id": 5,
|
||||||
|
"title": "✳ Boot-time Option",
|
||||||
|
"app_id": "com.mitchellh.ghostty",
|
||||||
|
"pid": 11854,
|
||||||
|
"workspace_id": 3,
|
||||||
|
"is_focused": true,
|
||||||
|
"layout": {
|
||||||
|
"tile_size": [1280.0, 1408.0],
|
||||||
|
"window_size": [1280, 1408]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
#+end_src
|
||||||
|
|
||||||
|
Also available: multi-monitor info with logical positions, scale factors, modes.
|
||||||
|
|
||||||
|
** AT-SPI Test Script
|
||||||
|
#+begin_src python
|
||||||
|
import gi
|
||||||
|
gi.require_version("Atspi", "2.0")
|
||||||
|
from gi.repository import Atspi
|
||||||
|
|
||||||
|
desktop = Atspi.get_desktop(0)
|
||||||
|
count = desktop.get_child_count()
|
||||||
|
print(f"Desktop has {count} accessible apps")
|
||||||
|
|
||||||
|
for i in range(count):
|
||||||
|
app = desktop.get_child_at_index(i)
|
||||||
|
if app:
|
||||||
|
name = app.get_name() or "(unnamed)"
|
||||||
|
print(f" - {name}")
|
||||||
|
#+end_src
|
||||||
|
|
||||||
|
* Session Metrics
|
||||||
|
- Commits made: 4
|
||||||
|
- Files touched: 4 (2 orch docs, 2 beads files)
|
||||||
|
- Lines added/removed: ~+1536/-25 (mostly beads JSON)
|
||||||
|
- Tests added: 0
|
||||||
|
- Tests passing: N/A
|
||||||
|
- Beads issues created: 4 (1 epic, 3 tasks)
|
||||||
|
- Beads issues closed: 1 (skills-d87)
|
||||||
Loading…
Reference in a new issue