skills/docs/worklogs/2025-12-29-vision-benchmark-orch-patterns-claude-search.org

179 lines
8.1 KiB
Org Mode

#+TITLE: Vision Benchmark, Orch Patterns, and Claude Search Prototype
#+DATE: 2025-12-29
#+KEYWORDS: vision-benchmark, orch-skill, claude-search, lsp-research, issue-triage, hooks
#+COMMITS: 5
#+COMPRESSION_STATUS: uncompressed
* Session Summary
** Date: 2025-12-29 (Evening session, continuation from morning)
** Focus Area: Issue triage, research tasks, and prototype development
* Accomplishments
- [X] Completed vision model UI understanding benchmark (skills-ebl)
- [X] Added conversational patterns documentation to orch skill (skills-8d9)
- [X] Investigated Claude Code hooks for parallel orch queries (skills-x2l)
- [X] Built claude-search prototype for conversation history search (skills-6e3)
- [X] Verified ARGS variable usage - closed invalid issue (skills-8cc)
- [X] Evaluated shared logging library need - closed as not worth it (skills-r5c)
- [X] Created dotfiles-0l3 for AT-SPI enablement (cross-repo tracking)
- [X] Researched Claude Code LSP integration architecture
* Key Decisions
** Decision 1: Vision models best used in hybrid approach with AT-SPI
- Context: Benchmarking how well vision models understand UI screenshots
- Test cases: Element location, identification, state detection, text extraction, layout
- Findings:
1. Excellent at text extraction and semantic understanding
2. Approximate coordinates only (regions, not pixels)
3. Good state detection (filled/empty, checked/unchecked)
- Rationale: Vision excels at semantics, AT-SPI provides precision
- Impact: Desktop automation should use hybrid approach
** Decision 2: Hooks unsuitable for automatic parallel orch queries
- Context: Could hooks automatically spin off orch queries?
- Research findings:
1. Hooks are synchronous and blocking (60s timeout)
2. Cannot run true background processes
3. Would block Claude agent while waiting
- Rationale: orch consensus already runs models in parallel internally
- Impact: Explicit skill invocation is the right pattern, not hooks
** Decision 3: Claude search - extract existing summaries, don't generate
- Context: Making conversation history searchable
- Discovery: Claude Code JSONL files already contain summary entries
- Approach: Index existing summaries + metadata vs generating new ones
- Impact: Simple grep-based search works well, AI summarization unnecessary
** Decision 4: Shared logging library not worth the overhead
- Context: Issue claimed duplicated logging functions across scripts
- Investigation: Only 2 files have logging, with different styles
- Rationale: Shared library adds complexity without proportional benefit
- Impact: Keep logging inline in each script
** Decision 5: Cross-repo work should be filed as separate beads
- Context: AT-SPI enablement requires dotfiles NixOS changes
- User feedback: "modifying system config is out-of-scope, file a beads issue"
- Action: Created dotfiles-0l3, reverted direct changes
- Impact: Better tracking of cross-repo dependencies
* Problems & Solutions
| Problem | Solution | Learning |
|---------+----------+----------|
| JSONL index was pretty-printed, not one-per-line | Changed jq -n to jq -cn for compact output | JSONL requires compact JSON per line |
| skills-8cc claimed ARGS was unused | Verified with grep - ARGS used on lines 8, 58, 64 | Always verify before removing "dead code" |
| Hooks research needed official docs | Used claude-code-guide agent for accurate info | Agent has access to official documentation |
| AT-SPI work crossed repo boundaries | Created dotfiles-0l3 issue instead of direct edit | Cross-repo dependencies need proper tracking |
* Technical Details
** Code Changes
- Total files modified: 4
- Key files changed:
- =skills/orch/SKILL.md= - Added conversational patterns section (~95 lines)
- =bin/claude-search= - New prototype script (93 lines)
- =docs/research/vision-ui-benchmark-2025-12-29.md= - Benchmark results
** New Files Created
- =bin/claude-search= - Indexes and searches Claude Code conversation history
- =docs/research/vision-ui-benchmark-2025-12-29.md= - Vision benchmark findings
** Commands Used
#+begin_src bash
# Vision benchmark - capture screenshots
niri msg action screenshot-window --id 44
cd ~/proj/skills/skills/playwright-visit && nix develop --command ./scripts/visit.py screenshot "https://github.com" /tmp/ui-test-github.png --wait 2000
# Build claude-search index
~/proj/skills/bin/claude-search --rebuild
# Output: Index built: 122 sessions
# Search conversation history
~/proj/skills/bin/claude-search "wayland"
# Returns matching sessions with date, project, message count
# Check beads status
bd ready
bd show <id>
bd close <id> --reason="..."
#+end_src
** Architecture Notes
- Claude Code stores conversations in =~/.claude/projects/<project>/<uuid>.jsonl=
- JSONL contains message types: user, assistant, summary, file-history-snapshot, queue-operation
- Summary entries already exist - no need to generate them
- Claude Code LSP is plugin-based, not automatic discovery
- User has pyright-lsp and gopls-lsp plugins installed
* Process and Workflow
** What Worked Well
- Rapid issue triage - several closed as invalid or not worth doing
- Research-first approach for hooks and LSP questions
- Using claude-code-guide agent for accurate Claude Code documentation
- User feedback loop on cross-repo work (file issues vs direct changes)
** What Was Challenging
- Extract-metrics.sh failed with exit 141 (SIGPIPE)
- Initial JSONL indexing produced pretty-printed JSON instead of JSONL
- Remote git server still down - 52 commits queued
* Learning and Insights
** Technical Insights
- Vision models give approximate coordinates, not pixel-precise
- Claude Code hooks are synchronous with 60s timeout - not for background work
- Claude Code JSONL already has summary entries from compaction
- LSP in Claude Code requires explicit plugin installation via /plugin
** Process Insights
- Invalid issues happen - verify before acting
- "Not worth doing" is a valid close reason for premature optimization
- Cross-repo work needs proper issue tracking in both repos
** Architectural Insights
- Hybrid AT-SPI + vision is the right desktop automation architecture
- Claude Code conversation search can be simple grep on indexed summaries
- Plugin-based LSP means per-repo profiles (skills-0f1) may be less relevant
* Context for Future Work
** Open Questions
- When will git remote come back online? (52 commits queued)
- Should claude-search have a SessionEnd hook for auto-indexing?
- Are the LSP-related issues (jbo, hh2, e96, etc.) still relevant given plugin architecture?
** Next Steps
- Deploy dotfiles when ready (AT-SPI enablement)
- Push queued commits when remote recovers
- Consider adding full-text search to claude-search
- Review LSP issues against Claude Code's plugin-based approach
** Related Work
- [[file:2025-12-29-issue-triage-playwright-skill-implementation.org][Earlier 2025-12-29 session]] - playwright-visit, issue triage
- [[file:2025-12-28-code-review-skill-creation-worklog-cleanup.org][2025-12-28 session]] - code-review skill, orch patterns origin
* Raw Notes
- Session started with "what's ready" and worked through P2 issues
- Vision benchmark used btop (terminal UI) and GitHub homepage (web UI)
- claude-search indexes 122 sessions across 20 projects
- Orch skill now documents: sessions, cross-model dialogue, iterative refinement
- User explicitly said "no" to checking AT-SPI status - "waiting for deployment"
** Issues Closed This Session
| Issue | Title | Resolution |
|-------+-------+------------|
| skills-ebl | Benchmark vision model UI understanding | Hybrid AT-SPI+vision recommended |
| skills-8d9 | Add conversational patterns to orch skill | Added to SKILL.md |
| skills-x2l | Investigate hooks for parallel orch queries | Hooks synchronous, not suitable |
| skills-6e3 | Searchable Claude Code conversation history | Prototype complete |
| skills-8cc | Remove dead code: unused ARGS variable | Invalid - ARGS is used |
| skills-r5c | Extract shared logging library from scripts | Not worth it - minimal duplication |
* Session Metrics
- Commits made: 5 (+ 1 beads sync)
- Files touched: 4
- Lines added/removed: +290/-10 (estimated)
- Issues closed: 6
- New prototypes: 1 (claude-search)
- Research docs: 1 (vision benchmark)