beads: add seeing problem benchmark tasks

- skills-ebl: vision model UI understanding benchmark - skills-bww: AT-SPI overhead/coverage benchmark - Updated skills-kg7 epic with hybrid approach
2025-12-17 14:13:51 -08:00 · 2025-12-17 14:13:51 -08:00 · 906f2bc7ee
parent 0b971558a5
commit 906f2bc7ee
1 changed files with 3 additions and 1 deletions
--- a/.beads/issues.jsonl
+++ b/.beads/issues.jsonl
@ -21,6 +21,7 @@
 {"id":"skills-al5","title":"Consider repo-setup-verification skill","description":"The dotfiles repo has a repo-setup-prompt.md verification checklist that could become a skill.\n\n**Source**: ~/proj/dotfiles/docs/repo-setup-prompt.md\n\n**What it does**:\n- Verifies .envrc has use_api_keys and skills loading\n- Checks .skills manifest exists with appropriate skills\n- Optionally checks beads setup\n- Verifies API keys are loaded\n\n**As a skill it could**:\n- Be invoked to audit any repo's agent setup\n- Offer to fix missing pieces\n- Provide consistent onboarding for new repos\n\n**Questions**:\n- Is this better as a skill vs a slash command?\n- Should it auto-fix or just report?\n- Does it belong in skills repo or dotfiles?","status":"open","priority":2,"issue_type":"task","created_at":"2025-12-06T12:38:32.561337354-08:00","updated_at":"2025-12-06T12:38:32.561337354-08:00"}
 {"id":"skills-bcu","title":"Design doc-review skill","description":"# doc-review skill\n\nFight documentation drift with a non-interactive review process that generates patchfiles for human review.\n\n## Problem\n- No consistent documentation system across repos\n- Stale content accumulates\n- Structural inconsistencies (docs not optimized for agents)\n\n## Envisioned Workflow\n\n```bash\n# Phase 1: Generate patches (non-interactive, use spare credits, test models)\ndoc-review scan ~/proj/foo --model claude-sonnet --output /tmp/foo-patches/\n\n# Phase 2: Review patches (interactive session)\ncd ~/proj/foo\nclaude  # human reviews patches, applies selectively\n```\n\n## Design Decisions Made\n\n- **Trigger**: Manual invocation (not CI). Use case includes burning extra LLM credits, testing models repeatably.\n- **Source of truth**: Style guide embedded in prompt template. Blessed defaults, overridable per-repo.\n- **Output**: Patchfiles for human review in interactive Claude session.\n- **Chunking**: Based on absolute size, not file count. Logical chunks easy for Claude to review.\n- **Scope detection**: Graph-based discovery starting from README.md or AGENTS.md, not glob-all-markdown.\n\n## Open Design Work\n\n### Agent-Friendly Doc Conventions (needs brainstorming)\nWhat makes docs agent-readable?\n- Explicit context (no \"as mentioned above\")\n- Clear section headers for navigation\n- Self-contained sections\n- Consistent terminology\n- Front-loaded summaries\n- ???\n\n### Prompt Content\nFull design round needed on:\n- What conventions to enforce\n- How to express them in prompt\n- Examples of \"good\" vs \"bad\"\n\n### Graph-Based Discovery\nHow does traversal work?\n- Parse links from README/AGENTS.md?\n- Follow relative markdown links?\n- Depth limit?\n\n## Skill Structure (tentative)\n```\nskills/doc-review/\n├── prompt.md           # Core review instructions + style guide\n├── scan.sh             # Orchestrates: find docs → invoke claude → emit patches\n└── README.md\n```\n\n## Out of Scope (for now)\n- Cross-repo standardization (broader than skills repo)\n- CI integration\n- Auto-apply without human review","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-12-04T14:01:43.305653729-08:00","updated_at":"2025-12-04T16:44:03.468118288-08:00","closed_at":"2025-12-04T16:44:03.468118288-08:00","dependencies":[{"issue_id":"skills-bcu","depends_on_id":"skills-1ig","type":"blocks","created_at":"2025-12-04T14:02:17.144414636-08:00","created_by":"daemon"},{"issue_id":"skills-bcu","depends_on_id":"skills-53k","type":"blocks","created_at":"2025-12-04T14:02:17.164968463-08:00","created_by":"daemon"}]}
 {"id":"skills-bvz","title":"spec-review: Add Definition of Ready checklists for each phase","description":"'Ready for /speckit.plan' and similar are underspecified.\n\nAdd concrete checklists:\n- Spec ready for planning: problem statement, goals, constraints, acceptance criteria, etc.\n- Plan ready for tasks: milestones, risks, dependencies, test strategy, etc.\n- Tasks ready for bd: each task has acceptance criteria, dependencies explicit, etc.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-15T00:23:24.877531852-08:00","updated_at":"2025-12-15T14:05:26.880419097-08:00","closed_at":"2025-12-15T14:05:26.880419097-08:00"}
+{"id":"skills-bww","title":"Benchmark AT-SPI overhead and coverage","description":"## Goal\nMeasure AT-SPI's runtime overhead and coverage across apps.\n\n## Prerequisites\n- Enable `services.gnome.at-spi2-core.enable = true` in NixOS\n- Set `QT_LINUX_ACCESSIBILITY_ALWAYS_ON=1` for Qt apps\n- Rebuild and re-login\n\n## Overhead benchmarks\n1. **Startup time**: App launch with/without AT-SPI\n2. **Memory**: RSS delta with AT-SPI enabled\n3. **CPU**: Idle CPU with AT-SPI bus running\n4. **UI latency**: Input-to-paint latency (if measurable)\n\n## Coverage audit\nFor each app, document:\n- Does it expose accessibility tree?\n- How complete is the tree? (all elements vs partial)\n- Are coordinates accurate?\n- Are element types/roles correct?\n\n### Apps to test\n- [ ] Firefox\n- [ ] Ghostty terminal\n- [ ] Nautilus/file manager\n- [ ] VS Code / Electron app\n- [ ] A Qt app (if any installed)\n\n## Query benchmarks\n- Time to enumerate all elements in a window\n- Time to find element by role/name\n- Memory overhead of pyatspi queries\n\n## Depends on\n- skills-pdg (Enable AT-SPI for UI tree access)","status":"open","priority":2,"issue_type":"task","created_at":"2025-12-17T14:13:21.599259773-08:00","updated_at":"2025-12-17T14:13:21.599259773-08:00","dependencies":[{"issue_id":"skills-bww","depends_on_id":"skills-pdg","type":"blocks","created_at":"2025-12-17T14:13:41.633210539-08:00","created_by":"daemon"}]}
 {"id":"skills-cc0","title":"spec-review: Add anti-hallucination constraints to prompts","description":"Models may paraphrase and present as quotes, or invent requirements/risks not in the doc.\n\nAdd:\n- 'Quotes must be verbatim'\n- 'Do not assume technologies/constraints not stated'\n- 'If missing info, list as open questions rather than speculating'","status":"closed","priority":3,"issue_type":"task","created_at":"2025-12-15T00:23:26.045478292-08:00","updated_at":"2025-12-15T14:07:19.556888057-08:00","closed_at":"2025-12-15T14:07:19.556888057-08:00"}
 {"id":"skills-cjx","title":"Create spec-review skill for orch + spec-kit integration","description":"A new skill that integrates orch multi-model consensus with spec-kit workflows.\n\n**Purpose**: Use different models/temps/stances to review spec-kit artifacts before phase transitions.\n\n**Proposed commands**:\n- /spec-review.spec - Critique current spec for completeness, ambiguity, gaps\n- /spec-review.plan - Evaluate architecture decisions in plan\n- /spec-review.gate - Go/no-go consensus before phase transition\n\n**Structure**:\n```\nskills/spec-review/\n├── SKILL.md\n├── commands/\n│   ├── spec.md\n│   ├── plan.md\n│   └── gate.md\n└── prompts/\n    └── ...\n```\n\n**Key design points**:\n- Finds spec/plan files from current branch or specs/ directory\n- Invokes orch with appropriate prompt, models, stances\n- Presents consensus/critique results\n- AI reviewing AI is valuable redundancy (different models/temps/stances)\n\n**Dependencies**:\n- orch CLI must be available (blocked on dotfiles-3to)\n- spec-kit project structure conventions","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-12-14T17:50:13.22879874-08:00","updated_at":"2025-12-15T00:10:23.122342449-08:00","closed_at":"2025-12-15T00:10:23.122342449-08:00"}
 {"id":"skills-cnc","title":"Add direnv helper for per-repo skill deployment","description":"Create sourceable helper script and documentation for the standard per-repo skill deployment pattern using direnv + nix build.","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-11-30T12:19:20.71056749-08:00","updated_at":"2025-11-30T12:37:47.22638278-08:00","closed_at":"2025-11-30T12:37:47.22638278-08:00"}
@ -29,11 +30,12 @@
 {"id":"skills-d87","title":"orch skill is documentation-only, needs working invocation mechanism","description":"The orch skill provides SKILL.md documentation but no working invocation mechanism.\n\n**Resolution**: Install orch globally via home-manager (dotfiles-3to). The skill documents a system tool, doesn't need to bundle it.\n\n**Blocked by**: dotfiles-3to (Add orch CLI to home-manager packages)","status":"closed","priority":2,"issue_type":"bug","created_at":"2025-12-14T11:54:03.157039164-08:00","updated_at":"2025-12-16T18:45:24.39235833-08:00","closed_at":"2025-12-16T18:45:24.39235833-08:00","close_reason":"Updated docs to use globally installed orch CLI"}
 {"id":"skills-dpw","title":"orch: add command to show available/configured models","description":"## Problem\n\nWhen trying to use orch, you have to trial-and-error through models to find which ones have API keys configured. Each failure looks like:\n\n```\nError: GEMINI_API_KEY not set. Required for Google Gemini models.\n```\n\nNo way to know upfront which models are usable.\n\n## Proposed Solution\n\nAdd `orch models` or `orch status` command:\n\n```bash\n$ orch models\nAvailable models:\n  ✓ flash      (GEMINI_API_KEY set)\n  ✓ gemini     (GEMINI_API_KEY set)\n  ✗ deepseek   (OPENROUTER_KEY not set)\n  ✗ qwen       (OPENROUTER_KEY not set)\n  ✓ gpt        (OPENAI_API_KEY set)\n```\n\nOr at minimum, on failure suggest alternatives:\n```\nError: GEMINI_API_KEY not set. Try --model gpt or --model deepseek instead.\n```\n\n## Context\n\nHit this while trying to brainstorm with high-temp gemini - had to try 4 models before realizing none were configured in this environment.","status":"closed","priority":3,"issue_type":"feature","created_at":"2025-12-04T14:10:07.069103175-08:00","updated_at":"2025-12-04T14:11:05.49122538-08:00","closed_at":"2025-12-04T14:11:05.49122538-08:00"}
 {"id":"skills-ebh","title":"Compare bd-issue-tracking skill files with upstream","description":"Fetch upstream beads skill files and compare with our condensed versions to identify differences","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:14:07.886535859-08:00","updated_at":"2025-12-03T20:19:37.579815337-08:00","closed_at":"2025-12-03T20:19:37.579815337-08:00"}
+{"id":"skills-ebl","title":"Benchmark vision model UI understanding","description":"## Goal\nMeasure how well vision models can answer UI questions from screenshots.\n\n## Test cases\n1. **Element location**: \"Where is the Save button?\" → coordinates\n2. **Element identification**: \"What buttons are visible?\" → list\n3. **State detection**: \"Is the checkbox checked?\" → boolean\n4. **Text extraction**: \"What does the error message say?\" → text\n5. **Layout understanding**: \"What's in the sidebar?\" → structure\n\n## Metrics\n- Accuracy: Does the answer match ground truth?\n- Precision: How close are coordinates to actual element centers?\n- Latency: Time from query to response\n- Cost: Tokens consumed per query\n\n## Prompt engineering questions\n- Does adding a grid overlay help coordinate precision?\n- What prompt format gives most actionable coordinates?\n- Can we get bounding boxes vs point coordinates?\n\n## Comparison baseline\n- Manual annotation of test screenshots\n- AT-SPI data (once enabled) as ground truth\n\n## Depends on\n- Test screenshots from real apps\n- Ground truth annotations","status":"open","priority":2,"issue_type":"task","created_at":"2025-12-17T14:13:10.038933798-08:00","updated_at":"2025-12-17T14:13:10.038933798-08:00"}
 {"id":"skills-fo3","title":"Compare WORKFLOWS.md with upstream","description":"","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:15:54.283175561-08:00","updated_at":"2025-12-03T20:19:28.897037199-08:00","closed_at":"2025-12-03T20:19:28.897037199-08:00","dependencies":[{"issue_id":"skills-fo3","depends_on_id":"skills-ebh","type":"discovered-from","created_at":"2025-12-03T20:15:54.286009672-08:00","created_by":"daemon"}]}
 {"id":"skills-fvx","title":"use-skills.sh: stderr from nix build corrupts symlinks when repo is dirty","description":"In use-skills.sh, the line:\n\n```bash\nout=$(nix build --print-out-paths --no-link \"${SKILLS_REPO}#${skill}\" 2\u003e\u00261) || {\n```\n\nThe `2\u003e\u00261` merges stderr into stdout. When the skills repo is dirty, nix emits a warning to stderr which gets captured into $out and used as the symlink target.\n\nResult: symlinks like:\n```\norch -\u003e warning: Git tree '/home/dan/proj/skills' is dirty\n/nix/store/j952hgxixifscafb42vmw9vgdphi1djs-ai-skill-orch\n```\n\nFix: redirect stderr to /dev/null or filter it out before creating symlink.","status":"closed","priority":1,"issue_type":"bug","created_at":"2025-12-14T11:54:03.06502295-08:00","updated_at":"2025-12-14T11:59:25.472044754-08:00","closed_at":"2025-12-14T11:59:25.472044754-08:00"}
 {"id":"skills-gas","title":"spec-review: File discovery is brittle, can pick wrong file silently","description":"The fallback `find ... | head -1` is non-deterministic and can select wrong spec/plan/tasks file without user noticing. Branch names with `/` also break path construction.\n\nFixes:\n- Fail fast if expected file missing\n- Print chosen file path before proceeding\n- Require explicit confirmation if falling back\n- Handle branch names with slashes","status":"closed","priority":2,"issue_type":"bug","created_at":"2025-12-15T00:23:22.762045913-08:00","updated_at":"2025-12-15T00:46:16.231199231-08:00","closed_at":"2025-12-15T00:46:16.231199231-08:00"}
 {"id":"skills-h9f","title":"spec-review: Balance negativity bias in prompts","description":"'Be critical' and 'devil's advocate' can bias toward over-flagging without acknowledging what's good.\n\nAdd:\n- 'List top 3 strongest parts of the document'\n- 'Call out where document is sufficiently clear/testable'\n- Categorize :against concerns as confirmed/plausible/rejected","status":"closed","priority":3,"issue_type":"task","created_at":"2025-12-15T00:23:26.418087998-08:00","updated_at":"2025-12-15T14:07:39.520818417-08:00","closed_at":"2025-12-15T14:07:39.520818417-08:00"}
-{"id":"skills-kg7","title":"Desktop automation for Wayland/niri","description":"Explore and implement desktop automation solutions for Wayland (niri compositor).\n\n## First goal: Seeing what's on screen\n\n### Already have (niri-window-capture skill)\n- ✅ Window metadata: list windows, titles, app_id, workspace, position\n- ✅ Screenshots: capture any window invisibly, even from other workspaces\n\n### Missing pieces\n- ❌ **UI tree (AT-SPI)**: semantic structure - buttons, labels, text fields, menus\n  - Works for GTK/Qt apps that expose accessibility\n  - Electron apps often poor, custom toolkits nothing\n  - Query: \"what buttons are visible?\", \"where is the Save button?\"\n  \n- ❌ **OCR fallback**: extract text from screenshots when AT-SPI unavailable\n  - tesseract for local OCR\n  - vision LLM (Claude) for understanding layout + text\n  - Needed for apps that don't expose AT-SPI\n\n### Levels of \"seeing\"\n1. **Window awareness**: what's open, where (have this)\n2. **Pixel capture**: screenshots (have this)\n3. **Semantic understanding**: UI elements, their roles, states (need AT-SPI)\n4. **Text extraction**: OCR when semantic not available (need OCR)\n\n## Context\nWayland's security model intentionally blocks X11-style global automation. Solutions require:\n- Privileged daemons (uinput for input injection)\n- Compositor-specific IPC (niri msg)\n- App opt-in via AT-SPI accessibility\n- Portal permissions for screenshots\n\n## Available building blocks\n- **niri IPC**: window management, screenshots, workspace control\n- **ydotool**: kernel uinput input injection (keyboard/mouse)\n- **AT-SPI/dogtail**: app UI introspection (GTK/Qt apps that expose accessibility)\n- **xdg-desktop-portal**: screenshot permissions\n\n## Key questions\n- Which apps do we care about? (GTK, Qt, Electron, web?)\n- Is AT-SPI sufficient for our target apps?\n- When to use OCR vs AT-SPI?\n- How to combine screenshot + AT-SPI for element location?\n\n## Architecture sketch\n```\n┌─────────────────────────────────────────────┐\n│             User/Agent Scripts              │\n├─────────────────────────────────────────────┤\n│            Unified Python API?              │\n├──────────┬──────────┬──────────┬────────────┤\n│ niri IPC │ ydotool  │  AT-SPI  │  OCR       │\n│ (windows)│ (input)  │ (UI tree)│ (fallback) │\n├──────────┴──────────┴──────────┴────────────┤\n│     Wayland compositor / kernel / D-Bus     │\n└─────────────────────────────────────────────┘\n```\n\n## Approach\nEngineer solutions based on specific team needs as they arise, rather than building a monolithic framework upfront.","status":"open","priority":2,"issue_type":"epic","created_at":"2025-12-17T12:42:17.863074501-08:00","updated_at":"2025-12-17T12:56:08.188860952-08:00","dependencies":[{"issue_id":"skills-kg7","depends_on_id":"skills-pdg","type":"blocks","created_at":"2025-12-17T13:59:59.915105471-08:00","created_by":"daemon"}]}
+{"id":"skills-kg7","title":"Desktop automation for Wayland/niri","description":"Explore and implement desktop automation solutions for Wayland (niri compositor).\n\n## The Seeing Problem\n\nHow can an AI agent understand what's on screen?\n\n### Layers (bottom to top)\n1. **Window awareness** - what's open, app_id, title → niri IPC ✅\n2. **Window geometry** - size, position, monitor → niri IPC ✅\n3. **Pixel capture** - screenshots → niri screenshot-window ✅\n4. **Understanding** - UI elements, coordinates, semantics → **GAP**\n\n### Two paths to understanding (layer 4)\n\n**Path A: AT-SPI (structured)**\n- Pros: precise coordinates, semantic element types, states\n- Cons: runtime overhead, requires NixOS config, app compliance varies\n- Coverage: GTK ✅, Qt (with env var), Electron ❓\n\n**Path B: Vision model (visual)**\n- Pros: universal (works on any pixels), no system config needed\n- Cons: API latency, token cost, coordinate precision unclear\n- Coverage: anything visible ✅\n\n### Hybrid approach\nRun both, benchmark tradeoffs:\n- AT-SPI overhead vs query precision\n- Vision model latency/cost vs coverage\n- When to use which (or both for validation)\n\n## What we have\n- niri-window-capture skill (screenshots, window list)\n- niri IPC for window/monitor geometry\n\n## Context\nWayland's security model blocks X11-style automation. Solutions require:\n- Compositor-specific IPC (niri msg)\n- App opt-in via AT-SPI accessibility\n- Or vision model interpretation of pixels","status":"open","priority":2,"issue_type":"epic","created_at":"2025-12-17T12:42:17.863074501-08:00","updated_at":"2025-12-17T14:12:59.143207802-08:00","dependencies":[{"issue_id":"skills-kg7","depends_on_id":"skills-pdg","type":"blocks","created_at":"2025-12-17T13:59:59.915105471-08:00","created_by":"daemon"},{"issue_id":"skills-kg7","depends_on_id":"skills-ebl","type":"blocks","created_at":"2025-12-17T14:13:41.679692009-08:00","created_by":"daemon"},{"issue_id":"skills-kg7","depends_on_id":"skills-bww","type":"blocks","created_at":"2025-12-17T14:13:41.725196677-08:00","created_by":"daemon"}]}
 {"id":"skills-kmj","title":"Orch skill: document or handle orch not in PATH","description":"Skill docs show 'orch consensus' but orch requires 'uv run' from ~/proj/orch. Either update skill to invoke correctly or document installation requirement.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-01T17:29:48.844997238-08:00","updated_at":"2025-12-01T18:28:11.374048504-08:00","closed_at":"2025-12-01T18:28:11.374048504-08:00"}
 {"id":"skills-lie","title":"Compare DEPENDENCIES.md with upstream","description":"","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:15:53.925914243-08:00","updated_at":"2025-12-03T20:19:28.665641809-08:00","closed_at":"2025-12-03T20:19:28.665641809-08:00","dependencies":[{"issue_id":"skills-lie","depends_on_id":"skills-ebh","type":"discovered-from","created_at":"2025-12-03T20:15:53.9275694-08:00","created_by":"daemon"}]}
 {"id":"skills-lvg","title":"Compare ISSUE_CREATION.md with upstream","description":"","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:15:54.609282051-08:00","updated_at":"2025-12-03T20:19:29.134966356-08:00","closed_at":"2025-12-03T20:19:29.134966356-08:00","dependencies":[{"issue_id":"skills-lvg","depends_on_id":"skills-ebh","type":"discovered-from","created_at":"2025-12-03T20:15:54.610717055-08:00","created_by":"daemon"}]}