dan/skills

Fork 0

dan 34afa86b77 docs: worklog for vision benchmark, orch patterns, claude-search

2025-12-29 20:45:01 -05:00

8.1 KiB

Raw Blame History

Vision Benchmark, Orch Patterns, and Claude Search Prototype

Session Summary
- Date: 2025-12-29 (Evening session, continuation from morning)
- Focus Area: Issue triage, research tasks, and prototype development
Accomplishments
Key Decisions
Problems & Solutions
Technical Details
Process and Workflow
- What Worked Well
- What Was Challenging
Learning and Insights
Context for Future Work
Raw Notes
- Issues Closed This Session
Session Metrics

Session Summary

Date: 2025-12-29 (Evening session, continuation from morning)

Focus Area: Issue triage, research tasks, and prototype development

Accomplishments

Completed vision model UI understanding benchmark (skills-ebl)
Added conversational patterns documentation to orch skill (skills-8d9)
Investigated Claude Code hooks for parallel orch queries (skills-x2l)
Built claude-search prototype for conversation history search (skills-6e3)
Verified ARGS variable usage - closed invalid issue (skills-8cc)
Evaluated shared logging library need - closed as not worth it (skills-r5c)
Created dotfiles-0l3 for AT-SPI enablement (cross-repo tracking)
Researched Claude Code LSP integration architecture

Key Decisions

Decision 1: Vision models best used in hybrid approach with AT-SPI

Context: Benchmarking how well vision models understand UI screenshots
Test cases: Element location, identification, state detection, text extraction, layout
Findings:
1. Excellent at text extraction and semantic understanding
2. Approximate coordinates only (regions, not pixels)
3. Good state detection (filled/empty, checked/unchecked)
Rationale: Vision excels at semantics, AT-SPI provides precision
Impact: Desktop automation should use hybrid approach

Decision 2: Hooks unsuitable for automatic parallel orch queries

Context: Could hooks automatically spin off orch queries?
Research findings:
1. Hooks are synchronous and blocking (60s timeout)
2. Cannot run true background processes
3. Would block Claude agent while waiting
Rationale: orch consensus already runs models in parallel internally
Impact: Explicit skill invocation is the right pattern, not hooks

Decision 3: Claude search - extract existing summaries, don't generate

Context: Making conversation history searchable
Discovery: Claude Code JSONL files already contain summary entries
Approach: Index existing summaries + metadata vs generating new ones
Impact: Simple grep-based search works well, AI summarization unnecessary

Decision 4: Shared logging library not worth the overhead

Context: Issue claimed duplicated logging functions across scripts
Investigation: Only 2 files have logging, with different styles
Rationale: Shared library adds complexity without proportional benefit
Impact: Keep logging inline in each script

Decision 5: Cross-repo work should be filed as separate beads

Context: AT-SPI enablement requires dotfiles NixOS changes
User feedback: "modifying system config is out-of-scope, file a beads issue"
Action: Created dotfiles-0l3, reverted direct changes
Impact: Better tracking of cross-repo dependencies

Problems & Solutions

Problem	Solution	Learning
JSONL index was pretty-printed, not one-per-line	Changed jq -n to jq -cn for compact output	JSONL requires compact JSON per line
skills-8cc claimed ARGS was unused	Verified with grep - ARGS used on lines 8, 58, 64	Always verify before removing "dead code"
Hooks research needed official docs	Used claude-code-guide agent for accurate info	Agent has access to official documentation
AT-SPI work crossed repo boundaries	Created dotfiles-0l3 issue instead of direct edit	Cross-repo dependencies need proper tracking

Technical Details

Code Changes

Total files modified: 4
Key files changed:
- skills/orch/SKILL.md - Added conversational patterns section (~95 lines)
- bin/claude-search - New prototype script (93 lines)
- docs/research/vision-ui-benchmark-2025-12-29.md - Benchmark results

New Files Created

bin/claude-search - Indexes and searches Claude Code conversation history
docs/research/vision-ui-benchmark-2025-12-29.md - Vision benchmark findings

Commands Used

# Vision benchmark - capture screenshots
niri msg action screenshot-window --id 44
cd ~/proj/skills/skills/playwright-visit && nix develop --command ./scripts/visit.py screenshot "https://github.com" /tmp/ui-test-github.png --wait 2000

# Build claude-search index
~/proj/skills/bin/claude-search --rebuild
# Output: Index built: 122 sessions

# Search conversation history
~/proj/skills/bin/claude-search "wayland"
# Returns matching sessions with date, project, message count

# Check beads status
bd ready
bd show <id>
bd close <id> --reason="..."

Architecture Notes

Claude Code stores conversations in ~/.claude/projects/<project>/<uuid>.jsonl
JSONL contains message types: user, assistant, summary, file-history-snapshot, queue-operation
Summary entries already exist - no need to generate them
Claude Code LSP is plugin-based, not automatic discovery
User has pyright-lsp and gopls-lsp plugins installed

Process and Workflow

What Worked Well

Rapid issue triage - several closed as invalid or not worth doing
Research-first approach for hooks and LSP questions
Using claude-code-guide agent for accurate Claude Code documentation
User feedback loop on cross-repo work (file issues vs direct changes)

What Was Challenging

Extract-metrics.sh failed with exit 141 (SIGPIPE)
Initial JSONL indexing produced pretty-printed JSON instead of JSONL
Remote git server still down - 52 commits queued

Learning and Insights

Technical Insights

Vision models give approximate coordinates, not pixel-precise
Claude Code hooks are synchronous with 60s timeout - not for background work
Claude Code JSONL already has summary entries from compaction
LSP in Claude Code requires explicit plugin installation via /plugin

Process Insights

Invalid issues happen - verify before acting
"Not worth doing" is a valid close reason for premature optimization
Cross-repo work needs proper issue tracking in both repos

Architectural Insights

Hybrid AT-SPI + vision is the right desktop automation architecture
Claude Code conversation search can be simple grep on indexed summaries
Plugin-based LSP means per-repo profiles (skills-0f1) may be less relevant

Context for Future Work

Open Questions

When will git remote come back online? (52 commits queued)
Should claude-search have a SessionEnd hook for auto-indexing?
Are the LSP-related issues (jbo, hh2, e96, etc.) still relevant given plugin architecture?

Next Steps

Deploy dotfiles when ready (AT-SPI enablement)
Push queued commits when remote recovers
Consider adding full-text search to claude-search
Review LSP issues against Claude Code's plugin-based approach

Related Work

Earlier 2025-12-29 session - playwright-visit, issue triage
2025-12-28 session - code-review skill, orch patterns origin

Raw Notes

Session started with "what's ready" and worked through P2 issues
Vision benchmark used btop (terminal UI) and GitHub homepage (web UI)
claude-search indexes 122 sessions across 20 projects
Orch skill now documents: sessions, cross-model dialogue, iterative refinement
User explicitly said "no" to checking AT-SPI status - "waiting for deployment"

Issues Closed This Session

Issue	Title	Resolution
skills-ebl	Benchmark vision model UI understanding	Hybrid AT-SPI+vision recommended
skills-8d9	Add conversational patterns to orch skill	Added to SKILL.md
skills-x2l	Investigate hooks for parallel orch queries	Hooks synchronous, not suitable
skills-6e3	Searchable Claude Code conversation history	Prototype complete
skills-8cc	Remove dead code: unused ARGS variable	Invalid - ARGS is used
skills-r5c	Extract shared logging library from scripts	Not worth it - minimal duplication

Session Metrics

Commits made: 5 (+ 1 beads sync)
Files touched: 4
Lines added/removed: +290/-10 (estimated)
Issues closed: 6
New prototypes: 1 (claude-search)
Research docs: 1 (vision benchmark)

8.1 KiB Raw Blame History

Vision Benchmark, Orch Patterns, and Claude Search Prototype

Session Summary

Date: 2025-12-29 (Evening session, continuation from morning)

Focus Area: Issue triage, research tasks, and prototype development

Accomplishments

Key Decisions

Decision 1: Vision models best used in hybrid approach with AT-SPI

Decision 2: Hooks unsuitable for automatic parallel orch queries

Decision 3: Claude search - extract existing summaries, don't generate

Decision 4: Shared logging library not worth the overhead

Decision 5: Cross-repo work should be filed as separate beads

Problems & Solutions

Technical Details

Code Changes

New Files Created

Commands Used

Architecture Notes

Process and Workflow

What Worked Well

What Was Challenging

Learning and Insights

Technical Insights

Process Insights

Architectural Insights

Context for Future Work

Open Questions

Next Steps

Related Work

Raw Notes

Issues Closed This Session

Session Metrics

8.1 KiB

Raw Blame History