dan/skills

Fork 0

dan e366343dd7 docs: worklog for wayland desktop automation session

2025-12-17 14:32:20 -08:00

11 KiB

Raw Blame History

Wayland Desktop Automation: The Seeing Problem

Session Summary
- Date: 2025-12-17
- Focus Area: Exploring desktop automation solutions for Wayland/niri compositor
Accomplishments
Key Decisions
Problems & Solutions
Technical Details
Process and Workflow
- What Worked Well
- What Was Challenging
Learning and Insights
Context for Future Work
Raw Notes
Session Metrics

Session Summary

Date: 2025-12-17

Focus Area: Exploring desktop automation solutions for Wayland/niri compositor

Accomplishments

Fixed orch skill documentation - updated to use globally installed CLI instead of cd ~/proj/orch && uv run
Closed skills-d87 (orch skill documentation-only bug)
Created epic skills-kg7 for "Desktop automation for Wayland/niri"
Framed "the seeing problem" - how AI agents understand screen content
Investigated AT-SPI accessibility framework on NixOS
Discovered AT-SPI is disabled (NO_AT_BRIDGE=1, GTK_A11Y=none)
Created task skills-pdg documenting AT-SPI enablement requirements
Created benchmark tasks for both vision and AT-SPI paths
Conducted first vision model benchmark experiment with ChatGPT screenshot
Identified "the acting problem" - discovered no mouse input tool installed
Verified ydotool would work (user already in uinput group)
ydotool installation task not yet created

Key Decisions

Decision 1: Hybrid approach for UI understanding

Context: Need to understand what's on screen (element types, coordinates) for automation
Options considered:
1. AT-SPI only - structured data, precise coords, but requires system config + app compliance
2. Vision model only - universal coverage, but coordinate precision unknown
3. Hybrid - benchmark both, use appropriately
Rationale: Both paths have tradeoffs; measuring both allows informed decisions per use case
Impact: Created parallel benchmark tasks, AT-SPI becomes opt-in system config

Decision 2: Boot-time option for AT-SPI

Context: AT-SPI adds runtime overhead to all GTK/Qt apps
Options considered:
1. Always-on accessibility
2. Boot-time toggle
3. Session-level toggle
Rationale: Overhead only worth paying when automation is needed
Impact: Documented in skills-pdg as future implementation direction

Decision 3: Frame as "the seeing problem"

Context: Needed clear terminology for the capability gap
Rationale: "Seeing" captures the essence - understanding what's visible on screen
Impact: Epic and tasks now organized around this framing; led to discovering complementary "acting problem"

Problems & Solutions

Problem	Solution	Learning
AT-SPI test crashed with "AT-SPI: Couldn't connect to accessibility bus"	AT-SPI needs to be running at session start; can't spin up ad-hoc	AT-SPI is session-level, not on-demand
Started fresh AT-SPI bus showed 0 apps	Existing apps connected to dbus at their startup, don't see new bus	Apps must start AFTER AT-SPI is enabled
bd sync failed with worktree error	Manually committed beads files with git add/commit	bd sync has edge cases; manual fallback works
Couldn't test vision model coordinate predictions	No mouse click tool installed (ydotool missing)	"Seeing" is only half the problem; "acting" matters too

Technical Details

Code Changes

Total files modified: 4
Key files changed:
- skills/orch/SKILL.md - Removed cd ~/proj/orch && uv run prefix from all examples
- skills/orch/README.md - Updated prerequisites and examples
Beads changes:
- .beads/issues.jsonl - Added skills-kg7 (epic), skills-pdg, skills-ebl, skills-bww

Commands Used

# Check AT-SPI status
gsettings get org.gnome.desktop.interface toolkit-accessibility
# Result: gsettings not available

# Check if AT-SPI bus running
pgrep -a at-spi
dbus-send --session --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.ListNames | grep -i access
# Result: Nothing running

# Check environment variables
echo "NO_AT_BRIDGE=$NO_AT_BRIDGE"  # = 1
echo "GTK_A11Y=$GTK_A11Y"          # = none

# Test AT-SPI in nix-shell (failed - bus not connected)
nix-shell -p at-spi2-core python312Packages.pyatspi python312Packages.pygobject3 --run 'python3 /tmp/test_atspi.py'

# Start AT-SPI bus manually (worked, but 0 apps)
/nix/store/.../at-spi-bus-launcher &
# Output: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry
# Desktop has 0 accessible apps

# Check uinput access for ydotool
ls -la /dev/uinput
# crw-rw---- 1 root uinput 10, 223 Dec 16 11:52 /dev/uinput
groups | grep -w input
# User in both input and uinput groups - ydotool would work

# Capture window for vision benchmark
niri msg action screenshot-window --id 10 --write-to-disk true

# Check available input tools
which ydotool  # not installed
which wtype    # /etc/profiles/per-user/dan/bin/wtype (keyboard only)

Architecture Notes

The Seeing Problem - Layered Model

┌─────────────────────────────────────────────────────────────┐
│                     "Seeing" Layers                         │
├─────────────────────────────────────────────────────────────┤
│ 4. Understanding: "What can I click? Where's the submit?"   │
│    → Vision model + prompt engineering                      │
│    → AT-SPI UI tree (if enabled)                           │
├─────────────────────────────────────────────────────────────┤
│ 3. Pixel capture: Screenshots                               │
│    → niri screenshot-window ✅                              │
├─────────────────────────────────────────────────────────────┤
│ 2. Window geometry: Size, position, monitor                 │
│    → niri windows/focused-window ✅                         │
├─────────────────────────────────────────────────────────────┤
│ 1. Window awareness: What's open, app_id, title             │
│    → niri windows ✅                                        │
└─────────────────────────────────────────────────────────────┘

Wayland Security Model Impact

X11 allowed any app to read keystrokes, inject input, screenshot anything
Wayland blocks all of this by design (security feature)
Workarounds require:
- Compositor-specific IPC (niri msg)
- Kernel bypass for input (ydotool → uinput)
- App opt-in for UI tree (AT-SPI)
- Portal permissions for screenshots

NixOS AT-SPI Configuration

# To enable AT-SPI
services.gnome.at-spi2-core.enable = true;
# For Qt apps also need:
# environment.variables.QT_LINUX_ACCESSIBILITY_ALWAYS_ON = "1";

Process and Workflow

What Worked Well

Beads workflow for tracking exploration as issues
Incremental investigation - check what exists before planning
Framing the problem clearly ("seeing problem") helped structure thinking
Quick vision benchmark experiment provided concrete data

What Was Challenging

AT-SPI complexity - session vs on-demand, app registration timing
NixOS options discovery - no easy way to search for accessibility options
bd sync worktree error - had to fall back to manual git

Learning and Insights

Technical Insights

AT-SPI is fundamentally session-level - apps register at startup, can't retroactively add
NO_AT_BRIDGE=1 and GTK_A11Y=none are NixOS defaults when AT-SPI not enabled
ydotool uses kernel uinput, completely separate from AT-SPI
niri IPC provides rich window metadata including exact pixel dimensions
wtype exists for keyboard input on Wayland but no mouse equivalent installed

Process Insights

"OCR is redundant with vision models" - good insight from user that reframed AT-SPI's value
AT-SPI value is semantic structure + precise coordinates, not text extraction
Benchmark-driven approach better than picking one path upfront

Architectural Insights

Desktop automation on Wayland requires multiple independent pieces:
- Compositor IPC (window management)
- Kernel uinput (input injection)
- D-Bus accessibility (UI tree)
- Portal protocols (screenshots, permissions)
No unified "Playwright for desktop" exists because each piece has different requirements

Context for Future Work

Open Questions

What coordinate precision can vision models achieve? (benchmark incomplete - need clicking)
Which apps actually expose useful AT-SPI data?
What's the runtime overhead of AT-SPI in practice?
Should AT-SPI be boot-time toggle or always-available?

Next Steps

Create ydotool setup task (enables clicking to verify vision coordinates)
Enable AT-SPI in NixOS config (when ready to test)
Complete vision model benchmark with actual coordinate verification
Document AT-SPI coverage per app (Firefox, Ghostty, etc.)

Related Work

2025-11-08: Invisible Window Capture - niri screenshot capability
2025-11-08: Screenshot Analysis - earlier wayland capture research
skills-kg7: Desktop automation for Wayland/niri (epic)
skills-pdg: Enable AT-SPI for UI tree access
skills-ebl: Benchmark vision model UI understanding
skills-bww: Benchmark AT-SPI overhead and coverage

Raw Notes

Vision Model Benchmark Experiment

Captured Firefox ChatGPT window (1280x1408px). Made coordinate predictions:

Element	Predicted X	Predicted Y	Confidence
"New chat" button	100	176	High
"Search chats"	112	218	High
"Ask anything" input	680	500	Medium
"Thinking" dropdown	465	555	Medium
Microphone icon	927	555	Medium
Black voice button	979	555	High
"+" button (attach)	377	555	Medium
User profile	90	981	High

Could not verify via clicking - no mouse input tool available.

niri IPC Data Available

{
  "id": 5,
  "title": "✳ Boot-time Option",
  "app_id": "com.mitchellh.ghostty",
  "pid": 11854,
  "workspace_id": 3,
  "is_focused": true,
  "layout": {
    "tile_size": [1280.0, 1408.0],
    "window_size": [1280, 1408]
  }
}

Also available: multi-monitor info with logical positions, scale factors, modes.

AT-SPI Test Script

import gi
gi.require_version("Atspi", "2.0")
from gi.repository import Atspi

desktop = Atspi.get_desktop(0)
count = desktop.get_child_count()
print(f"Desktop has {count} accessible apps")

for i in range(count):
    app = desktop.get_child_at_index(i)
    if app:
        name = app.get_name() or "(unnamed)"
        print(f"  - {name}")

Session Metrics

Commits made: 4
Files touched: 4 (2 orch docs, 2 beads files)
Lines added/removed: ~+1536/-25 (mostly beads JSON)
Tests added: 0
Tests passing: N/A
Beads issues created: 4 (1 epic, 3 tasks)
Beads issues closed: 1 (skills-d87)

11 KiB Raw Blame History