179 lines
8.1 KiB
Org Mode
179 lines
8.1 KiB
Org Mode
#+TITLE: Vision Benchmark, Orch Patterns, and Claude Search Prototype
|
|
#+DATE: 2025-12-29
|
|
#+KEYWORDS: vision-benchmark, orch-skill, claude-search, lsp-research, issue-triage, hooks
|
|
#+COMMITS: 5
|
|
#+COMPRESSION_STATUS: uncompressed
|
|
|
|
* Session Summary
|
|
** Date: 2025-12-29 (Evening session, continuation from morning)
|
|
** Focus Area: Issue triage, research tasks, and prototype development
|
|
|
|
* Accomplishments
|
|
- [X] Completed vision model UI understanding benchmark (skills-ebl)
|
|
- [X] Added conversational patterns documentation to orch skill (skills-8d9)
|
|
- [X] Investigated Claude Code hooks for parallel orch queries (skills-x2l)
|
|
- [X] Built claude-search prototype for conversation history search (skills-6e3)
|
|
- [X] Verified ARGS variable usage - closed invalid issue (skills-8cc)
|
|
- [X] Evaluated shared logging library need - closed as not worth it (skills-r5c)
|
|
- [X] Created dotfiles-0l3 for AT-SPI enablement (cross-repo tracking)
|
|
- [X] Researched Claude Code LSP integration architecture
|
|
|
|
* Key Decisions
|
|
** Decision 1: Vision models best used in hybrid approach with AT-SPI
|
|
- Context: Benchmarking how well vision models understand UI screenshots
|
|
- Test cases: Element location, identification, state detection, text extraction, layout
|
|
- Findings:
|
|
1. Excellent at text extraction and semantic understanding
|
|
2. Approximate coordinates only (regions, not pixels)
|
|
3. Good state detection (filled/empty, checked/unchecked)
|
|
- Rationale: Vision excels at semantics, AT-SPI provides precision
|
|
- Impact: Desktop automation should use hybrid approach
|
|
|
|
** Decision 2: Hooks unsuitable for automatic parallel orch queries
|
|
- Context: Could hooks automatically spin off orch queries?
|
|
- Research findings:
|
|
1. Hooks are synchronous and blocking (60s timeout)
|
|
2. Cannot run true background processes
|
|
3. Would block Claude agent while waiting
|
|
- Rationale: orch consensus already runs models in parallel internally
|
|
- Impact: Explicit skill invocation is the right pattern, not hooks
|
|
|
|
** Decision 3: Claude search - extract existing summaries, don't generate
|
|
- Context: Making conversation history searchable
|
|
- Discovery: Claude Code JSONL files already contain summary entries
|
|
- Approach: Index existing summaries + metadata vs generating new ones
|
|
- Impact: Simple grep-based search works well, AI summarization unnecessary
|
|
|
|
** Decision 4: Shared logging library not worth the overhead
|
|
- Context: Issue claimed duplicated logging functions across scripts
|
|
- Investigation: Only 2 files have logging, with different styles
|
|
- Rationale: Shared library adds complexity without proportional benefit
|
|
- Impact: Keep logging inline in each script
|
|
|
|
** Decision 5: Cross-repo work should be filed as separate beads
|
|
- Context: AT-SPI enablement requires dotfiles NixOS changes
|
|
- User feedback: "modifying system config is out-of-scope, file a beads issue"
|
|
- Action: Created dotfiles-0l3, reverted direct changes
|
|
- Impact: Better tracking of cross-repo dependencies
|
|
|
|
* Problems & Solutions
|
|
| Problem | Solution | Learning |
|
|
|---------+----------+----------|
|
|
| JSONL index was pretty-printed, not one-per-line | Changed jq -n to jq -cn for compact output | JSONL requires compact JSON per line |
|
|
| skills-8cc claimed ARGS was unused | Verified with grep - ARGS used on lines 8, 58, 64 | Always verify before removing "dead code" |
|
|
| Hooks research needed official docs | Used claude-code-guide agent for accurate info | Agent has access to official documentation |
|
|
| AT-SPI work crossed repo boundaries | Created dotfiles-0l3 issue instead of direct edit | Cross-repo dependencies need proper tracking |
|
|
|
|
* Technical Details
|
|
|
|
** Code Changes
|
|
- Total files modified: 4
|
|
- Key files changed:
|
|
- =skills/orch/SKILL.md= - Added conversational patterns section (~95 lines)
|
|
- =bin/claude-search= - New prototype script (93 lines)
|
|
- =docs/research/vision-ui-benchmark-2025-12-29.md= - Benchmark results
|
|
|
|
** New Files Created
|
|
- =bin/claude-search= - Indexes and searches Claude Code conversation history
|
|
- =docs/research/vision-ui-benchmark-2025-12-29.md= - Vision benchmark findings
|
|
|
|
** Commands Used
|
|
#+begin_src bash
|
|
# Vision benchmark - capture screenshots
|
|
niri msg action screenshot-window --id 44
|
|
cd ~/proj/skills/skills/playwright-visit && nix develop --command ./scripts/visit.py screenshot "https://github.com" /tmp/ui-test-github.png --wait 2000
|
|
|
|
# Build claude-search index
|
|
~/proj/skills/bin/claude-search --rebuild
|
|
# Output: Index built: 122 sessions
|
|
|
|
# Search conversation history
|
|
~/proj/skills/bin/claude-search "wayland"
|
|
# Returns matching sessions with date, project, message count
|
|
|
|
# Check beads status
|
|
bd ready
|
|
bd show <id>
|
|
bd close <id> --reason="..."
|
|
#+end_src
|
|
|
|
** Architecture Notes
|
|
- Claude Code stores conversations in =~/.claude/projects/<project>/<uuid>.jsonl=
|
|
- JSONL contains message types: user, assistant, summary, file-history-snapshot, queue-operation
|
|
- Summary entries already exist - no need to generate them
|
|
- Claude Code LSP is plugin-based, not automatic discovery
|
|
- User has pyright-lsp and gopls-lsp plugins installed
|
|
|
|
* Process and Workflow
|
|
|
|
** What Worked Well
|
|
- Rapid issue triage - several closed as invalid or not worth doing
|
|
- Research-first approach for hooks and LSP questions
|
|
- Using claude-code-guide agent for accurate Claude Code documentation
|
|
- User feedback loop on cross-repo work (file issues vs direct changes)
|
|
|
|
** What Was Challenging
|
|
- Extract-metrics.sh failed with exit 141 (SIGPIPE)
|
|
- Initial JSONL indexing produced pretty-printed JSON instead of JSONL
|
|
- Remote git server still down - 52 commits queued
|
|
|
|
* Learning and Insights
|
|
|
|
** Technical Insights
|
|
- Vision models give approximate coordinates, not pixel-precise
|
|
- Claude Code hooks are synchronous with 60s timeout - not for background work
|
|
- Claude Code JSONL already has summary entries from compaction
|
|
- LSP in Claude Code requires explicit plugin installation via /plugin
|
|
|
|
** Process Insights
|
|
- Invalid issues happen - verify before acting
|
|
- "Not worth doing" is a valid close reason for premature optimization
|
|
- Cross-repo work needs proper issue tracking in both repos
|
|
|
|
** Architectural Insights
|
|
- Hybrid AT-SPI + vision is the right desktop automation architecture
|
|
- Claude Code conversation search can be simple grep on indexed summaries
|
|
- Plugin-based LSP means per-repo profiles (skills-0f1) may be less relevant
|
|
|
|
* Context for Future Work
|
|
|
|
** Open Questions
|
|
- When will git remote come back online? (52 commits queued)
|
|
- Should claude-search have a SessionEnd hook for auto-indexing?
|
|
- Are the LSP-related issues (jbo, hh2, e96, etc.) still relevant given plugin architecture?
|
|
|
|
** Next Steps
|
|
- Deploy dotfiles when ready (AT-SPI enablement)
|
|
- Push queued commits when remote recovers
|
|
- Consider adding full-text search to claude-search
|
|
- Review LSP issues against Claude Code's plugin-based approach
|
|
|
|
** Related Work
|
|
- [[file:2025-12-29-issue-triage-playwright-skill-implementation.org][Earlier 2025-12-29 session]] - playwright-visit, issue triage
|
|
- [[file:2025-12-28-code-review-skill-creation-worklog-cleanup.org][2025-12-28 session]] - code-review skill, orch patterns origin
|
|
|
|
* Raw Notes
|
|
- Session started with "what's ready" and worked through P2 issues
|
|
- Vision benchmark used btop (terminal UI) and GitHub homepage (web UI)
|
|
- claude-search indexes 122 sessions across 20 projects
|
|
- Orch skill now documents: sessions, cross-model dialogue, iterative refinement
|
|
- User explicitly said "no" to checking AT-SPI status - "waiting for deployment"
|
|
|
|
** Issues Closed This Session
|
|
| Issue | Title | Resolution |
|
|
|-------+-------+------------|
|
|
| skills-ebl | Benchmark vision model UI understanding | Hybrid AT-SPI+vision recommended |
|
|
| skills-8d9 | Add conversational patterns to orch skill | Added to SKILL.md |
|
|
| skills-x2l | Investigate hooks for parallel orch queries | Hooks synchronous, not suitable |
|
|
| skills-6e3 | Searchable Claude Code conversation history | Prototype complete |
|
|
| skills-8cc | Remove dead code: unused ARGS variable | Invalid - ARGS is used |
|
|
| skills-r5c | Extract shared logging library from scripts | Not worth it - minimal duplication |
|
|
|
|
* Session Metrics
|
|
- Commits made: 5 (+ 1 beads sync)
|
|
- Files touched: 4
|
|
- Lines added/removed: +290/-10 (estimated)
|
|
- Issues closed: 6
|
|
- New prototypes: 1 (claude-search)
|
|
- Research docs: 1 (vision benchmark)
|