8.1 KiB
8.1 KiB
Vision Benchmark, Orch Patterns, and Claude Search Prototype
- Session Summary
- Accomplishments
- Key Decisions
- Decision 1: Vision models best used in hybrid approach with AT-SPI
- Decision 2: Hooks unsuitable for automatic parallel orch queries
- Decision 3: Claude search - extract existing summaries, don't generate
- Decision 4: Shared logging library not worth the overhead
- Decision 5: Cross-repo work should be filed as separate beads
- Problems & Solutions
- Technical Details
- Process and Workflow
- Learning and Insights
- Context for Future Work
- Raw Notes
- Session Metrics
Session Summary
Date: 2025-12-29 (Evening session, continuation from morning)
Focus Area: Issue triage, research tasks, and prototype development
Accomplishments
- Completed vision model UI understanding benchmark (skills-ebl)
- Added conversational patterns documentation to orch skill (skills-8d9)
- Investigated Claude Code hooks for parallel orch queries (skills-x2l)
- Built claude-search prototype for conversation history search (skills-6e3)
- Verified ARGS variable usage - closed invalid issue (skills-8cc)
- Evaluated shared logging library need - closed as not worth it (skills-r5c)
- Created dotfiles-0l3 for AT-SPI enablement (cross-repo tracking)
- Researched Claude Code LSP integration architecture
Key Decisions
Decision 1: Vision models best used in hybrid approach with AT-SPI
- Context: Benchmarking how well vision models understand UI screenshots
- Test cases: Element location, identification, state detection, text extraction, layout
-
Findings:
- Excellent at text extraction and semantic understanding
- Approximate coordinates only (regions, not pixels)
- Good state detection (filled/empty, checked/unchecked)
- Rationale: Vision excels at semantics, AT-SPI provides precision
- Impact: Desktop automation should use hybrid approach
Decision 2: Hooks unsuitable for automatic parallel orch queries
- Context: Could hooks automatically spin off orch queries?
-
Research findings:
- Hooks are synchronous and blocking (60s timeout)
- Cannot run true background processes
- Would block Claude agent while waiting
- Rationale: orch consensus already runs models in parallel internally
- Impact: Explicit skill invocation is the right pattern, not hooks
Decision 3: Claude search - extract existing summaries, don't generate
- Context: Making conversation history searchable
- Discovery: Claude Code JSONL files already contain summary entries
- Approach: Index existing summaries + metadata vs generating new ones
- Impact: Simple grep-based search works well, AI summarization unnecessary
Decision 4: Shared logging library not worth the overhead
- Context: Issue claimed duplicated logging functions across scripts
- Investigation: Only 2 files have logging, with different styles
- Rationale: Shared library adds complexity without proportional benefit
- Impact: Keep logging inline in each script
Decision 5: Cross-repo work should be filed as separate beads
- Context: AT-SPI enablement requires dotfiles NixOS changes
- User feedback: "modifying system config is out-of-scope, file a beads issue"
- Action: Created dotfiles-0l3, reverted direct changes
- Impact: Better tracking of cross-repo dependencies
Problems & Solutions
| Problem | Solution | Learning |
|---|---|---|
| JSONL index was pretty-printed, not one-per-line | Changed jq -n to jq -cn for compact output | JSONL requires compact JSON per line |
| skills-8cc claimed ARGS was unused | Verified with grep - ARGS used on lines 8, 58, 64 | Always verify before removing "dead code" |
| Hooks research needed official docs | Used claude-code-guide agent for accurate info | Agent has access to official documentation |
| AT-SPI work crossed repo boundaries | Created dotfiles-0l3 issue instead of direct edit | Cross-repo dependencies need proper tracking |
Technical Details
Code Changes
- Total files modified: 4
-
Key files changed:
skills/orch/SKILL.md- Added conversational patterns section (~95 lines)bin/claude-search- New prototype script (93 lines)docs/research/vision-ui-benchmark-2025-12-29.md- Benchmark results
New Files Created
bin/claude-search- Indexes and searches Claude Code conversation historydocs/research/vision-ui-benchmark-2025-12-29.md- Vision benchmark findings
Commands Used
# Vision benchmark - capture screenshots
niri msg action screenshot-window --id 44
cd ~/proj/skills/skills/playwright-visit && nix develop --command ./scripts/visit.py screenshot "https://github.com" /tmp/ui-test-github.png --wait 2000
# Build claude-search index
~/proj/skills/bin/claude-search --rebuild
# Output: Index built: 122 sessions
# Search conversation history
~/proj/skills/bin/claude-search "wayland"
# Returns matching sessions with date, project, message count
# Check beads status
bd ready
bd show <id>
bd close <id> --reason="..."
Architecture Notes
- Claude Code stores conversations in
~/.claude/projects/<project>/<uuid>.jsonl - JSONL contains message types: user, assistant, summary, file-history-snapshot, queue-operation
- Summary entries already exist - no need to generate them
- Claude Code LSP is plugin-based, not automatic discovery
- User has pyright-lsp and gopls-lsp plugins installed
Process and Workflow
What Worked Well
- Rapid issue triage - several closed as invalid or not worth doing
- Research-first approach for hooks and LSP questions
- Using claude-code-guide agent for accurate Claude Code documentation
- User feedback loop on cross-repo work (file issues vs direct changes)
What Was Challenging
- Extract-metrics.sh failed with exit 141 (SIGPIPE)
- Initial JSONL indexing produced pretty-printed JSON instead of JSONL
- Remote git server still down - 52 commits queued
Learning and Insights
Technical Insights
- Vision models give approximate coordinates, not pixel-precise
- Claude Code hooks are synchronous with 60s timeout - not for background work
- Claude Code JSONL already has summary entries from compaction
- LSP in Claude Code requires explicit plugin installation via /plugin
Process Insights
- Invalid issues happen - verify before acting
- "Not worth doing" is a valid close reason for premature optimization
- Cross-repo work needs proper issue tracking in both repos
Architectural Insights
- Hybrid AT-SPI + vision is the right desktop automation architecture
- Claude Code conversation search can be simple grep on indexed summaries
- Plugin-based LSP means per-repo profiles (skills-0f1) may be less relevant
Context for Future Work
Open Questions
- When will git remote come back online? (52 commits queued)
- Should claude-search have a SessionEnd hook for auto-indexing?
- Are the LSP-related issues (jbo, hh2, e96, etc.) still relevant given plugin architecture?
Next Steps
- Deploy dotfiles when ready (AT-SPI enablement)
- Push queued commits when remote recovers
- Consider adding full-text search to claude-search
- Review LSP issues against Claude Code's plugin-based approach
Related Work
- Earlier 2025-12-29 session - playwright-visit, issue triage
- 2025-12-28 session - code-review skill, orch patterns origin
Raw Notes
- Session started with "what's ready" and worked through P2 issues
- Vision benchmark used btop (terminal UI) and GitHub homepage (web UI)
- claude-search indexes 122 sessions across 20 projects
- Orch skill now documents: sessions, cross-model dialogue, iterative refinement
- User explicitly said "no" to checking AT-SPI status - "waiting for deployment"
Issues Closed This Session
| Issue | Title | Resolution |
|---|---|---|
| skills-ebl | Benchmark vision model UI understanding | Hybrid AT-SPI+vision recommended |
| skills-8d9 | Add conversational patterns to orch skill | Added to SKILL.md |
| skills-x2l | Investigate hooks for parallel orch queries | Hooks synchronous, not suitable |
| skills-6e3 | Searchable Claude Code conversation history | Prototype complete |
| skills-8cc | Remove dead code: unused ARGS variable | Invalid - ARGS is used |
| skills-r5c | Extract shared logging library from scripts | Not worth it - minimal duplication |
Session Metrics
- Commits made: 5 (+ 1 beads sync)
- Files touched: 4
- Lines added/removed: +290/-10 (estimated)
- Issues closed: 6
- New prototypes: 1 (claude-search)
- Research docs: 1 (vision benchmark)