skills/docs/worklogs/2025-12-29-vision-benchmark-orch-patterns-claude-search.org

8.1 KiB

Vision Benchmark, Orch Patterns, and Claude Search Prototype

Session Summary

Date: 2025-12-29 (Evening session, continuation from morning)

Focus Area: Issue triage, research tasks, and prototype development

Accomplishments

  • Completed vision model UI understanding benchmark (skills-ebl)
  • Added conversational patterns documentation to orch skill (skills-8d9)
  • Investigated Claude Code hooks for parallel orch queries (skills-x2l)
  • Built claude-search prototype for conversation history search (skills-6e3)
  • Verified ARGS variable usage - closed invalid issue (skills-8cc)
  • Evaluated shared logging library need - closed as not worth it (skills-r5c)
  • Created dotfiles-0l3 for AT-SPI enablement (cross-repo tracking)
  • Researched Claude Code LSP integration architecture

Key Decisions

Decision 1: Vision models best used in hybrid approach with AT-SPI

  • Context: Benchmarking how well vision models understand UI screenshots
  • Test cases: Element location, identification, state detection, text extraction, layout
  • Findings:

    1. Excellent at text extraction and semantic understanding
    2. Approximate coordinates only (regions, not pixels)
    3. Good state detection (filled/empty, checked/unchecked)
  • Rationale: Vision excels at semantics, AT-SPI provides precision
  • Impact: Desktop automation should use hybrid approach

Decision 2: Hooks unsuitable for automatic parallel orch queries

  • Context: Could hooks automatically spin off orch queries?
  • Research findings:

    1. Hooks are synchronous and blocking (60s timeout)
    2. Cannot run true background processes
    3. Would block Claude agent while waiting
  • Rationale: orch consensus already runs models in parallel internally
  • Impact: Explicit skill invocation is the right pattern, not hooks

Decision 3: Claude search - extract existing summaries, don't generate

  • Context: Making conversation history searchable
  • Discovery: Claude Code JSONL files already contain summary entries
  • Approach: Index existing summaries + metadata vs generating new ones
  • Impact: Simple grep-based search works well, AI summarization unnecessary

Decision 4: Shared logging library not worth the overhead

  • Context: Issue claimed duplicated logging functions across scripts
  • Investigation: Only 2 files have logging, with different styles
  • Rationale: Shared library adds complexity without proportional benefit
  • Impact: Keep logging inline in each script

Decision 5: Cross-repo work should be filed as separate beads

  • Context: AT-SPI enablement requires dotfiles NixOS changes
  • User feedback: "modifying system config is out-of-scope, file a beads issue"
  • Action: Created dotfiles-0l3, reverted direct changes
  • Impact: Better tracking of cross-repo dependencies

Problems & Solutions

Problem Solution Learning
JSONL index was pretty-printed, not one-per-line Changed jq -n to jq -cn for compact output JSONL requires compact JSON per line
skills-8cc claimed ARGS was unused Verified with grep - ARGS used on lines 8, 58, 64 Always verify before removing "dead code"
Hooks research needed official docs Used claude-code-guide agent for accurate info Agent has access to official documentation
AT-SPI work crossed repo boundaries Created dotfiles-0l3 issue instead of direct edit Cross-repo dependencies need proper tracking

Technical Details

Code Changes

  • Total files modified: 4
  • Key files changed:

    • skills/orch/SKILL.md - Added conversational patterns section (~95 lines)
    • bin/claude-search - New prototype script (93 lines)
    • docs/research/vision-ui-benchmark-2025-12-29.md - Benchmark results

New Files Created

  • bin/claude-search - Indexes and searches Claude Code conversation history
  • docs/research/vision-ui-benchmark-2025-12-29.md - Vision benchmark findings

Commands Used

# Vision benchmark - capture screenshots
niri msg action screenshot-window --id 44
cd ~/proj/skills/skills/playwright-visit && nix develop --command ./scripts/visit.py screenshot "https://github.com" /tmp/ui-test-github.png --wait 2000

# Build claude-search index
~/proj/skills/bin/claude-search --rebuild
# Output: Index built: 122 sessions

# Search conversation history
~/proj/skills/bin/claude-search "wayland"
# Returns matching sessions with date, project, message count

# Check beads status
bd ready
bd show <id>
bd close <id> --reason="..."

Architecture Notes

  • Claude Code stores conversations in ~/.claude/projects/<project>/<uuid>.jsonl
  • JSONL contains message types: user, assistant, summary, file-history-snapshot, queue-operation
  • Summary entries already exist - no need to generate them
  • Claude Code LSP is plugin-based, not automatic discovery
  • User has pyright-lsp and gopls-lsp plugins installed

Process and Workflow

What Worked Well

  • Rapid issue triage - several closed as invalid or not worth doing
  • Research-first approach for hooks and LSP questions
  • Using claude-code-guide agent for accurate Claude Code documentation
  • User feedback loop on cross-repo work (file issues vs direct changes)

What Was Challenging

  • Extract-metrics.sh failed with exit 141 (SIGPIPE)
  • Initial JSONL indexing produced pretty-printed JSON instead of JSONL
  • Remote git server still down - 52 commits queued

Learning and Insights

Technical Insights

  • Vision models give approximate coordinates, not pixel-precise
  • Claude Code hooks are synchronous with 60s timeout - not for background work
  • Claude Code JSONL already has summary entries from compaction
  • LSP in Claude Code requires explicit plugin installation via /plugin

Process Insights

  • Invalid issues happen - verify before acting
  • "Not worth doing" is a valid close reason for premature optimization
  • Cross-repo work needs proper issue tracking in both repos

Architectural Insights

  • Hybrid AT-SPI + vision is the right desktop automation architecture
  • Claude Code conversation search can be simple grep on indexed summaries
  • Plugin-based LSP means per-repo profiles (skills-0f1) may be less relevant

Context for Future Work

Open Questions

  • When will git remote come back online? (52 commits queued)
  • Should claude-search have a SessionEnd hook for auto-indexing?
  • Are the LSP-related issues (jbo, hh2, e96, etc.) still relevant given plugin architecture?

Next Steps

  • Deploy dotfiles when ready (AT-SPI enablement)
  • Push queued commits when remote recovers
  • Consider adding full-text search to claude-search
  • Review LSP issues against Claude Code's plugin-based approach

Related Work

Raw Notes

  • Session started with "what's ready" and worked through P2 issues
  • Vision benchmark used btop (terminal UI) and GitHub homepage (web UI)
  • claude-search indexes 122 sessions across 20 projects
  • Orch skill now documents: sessions, cross-model dialogue, iterative refinement
  • User explicitly said "no" to checking AT-SPI status - "waiting for deployment"

Issues Closed This Session

Issue Title Resolution
skills-ebl Benchmark vision model UI understanding Hybrid AT-SPI+vision recommended
skills-8d9 Add conversational patterns to orch skill Added to SKILL.md
skills-x2l Investigate hooks for parallel orch queries Hooks synchronous, not suitable
skills-6e3 Searchable Claude Code conversation history Prototype complete
skills-8cc Remove dead code: unused ARGS variable Invalid - ARGS is used
skills-r5c Extract shared logging library from scripts Not worth it - minimal duplication

Session Metrics

  • Commits made: 5 (+ 1 beads sync)
  • Files touched: 4
  • Lines added/removed: +290/-10 (estimated)
  • Issues closed: 6
  • New prototypes: 1 (claude-search)
  • Research docs: 1 (vision benchmark)