doc-review: design session complete, spun off to ~/proj/doc-review
- Added use_api_keys to .envrc for orch access - Worklog documents full design process - Beads closed: skills-bcu, skills-1ig, skills-53k, skills-d6r - Architecture: Vale + LLM hybrid (deterministic + semantic) - Implementation continues in dedicated repo
This commit is contained in:
parent
148f219887
commit
139a521a89
|
|
@ -1,16 +1,21 @@
|
||||||
|
{"id":"skills-1ig","title":"Brainstorm agent-friendly doc conventions","description":"# Agent-Friendly Doc Conventions - Hybrid Architecture\n\n## FINAL ARCHITECTURE: Vale + LLM Hybrid\n\n### Insight\n\u003e \"Good old deterministic testing (dumb robots) is the best way to keep in check LLMs (smart robots) at volume.\"\n\n### Split by Tool\n\n| Category | Rubrics | Tool |\n|----------|---------|------|\n| Vale-only | Format Integrity, Deterministic Instructions, Terminology Strictness, Token Efficiency | Fast, deterministic, CI-friendly |\n| Vale + LLM | Semantic Headings, Configuration Precision, Security Boundaries | Vale flags, LLM suggests fixes |\n| LLM-only | Contextual Independence, Code Executability, Execution Verification | Semantic understanding required |\n\n### Pipeline\n\n```\n┌─────────────────────────────────────────────────────────────┐\n│ Stage 1: Vale (deterministic, fast, free) │\n│ - Runs in CI on every commit │\n│ - Catches 40% of issues instantly │\n│ - No LLM cost for clean docs │\n└─────────────────────┬───────────────────────────────────────┘\n │ only if Vale passes\n ▼\n┌─────────────────────────────────────────────────────────────┐\n│ Stage 2: LLM Triage (cheap model) │\n│ - Evaluates 3 semantic rubrics │\n│ - Identifies which need patches │\n└─────────────────────┬───────────────────────────────────────┘\n │ only if issues found\n ▼\n┌─────────────────────────────────────────────────────────────┐\n│ Stage 3: LLM Specialists (capable model) │\n│ - One agent per failed rubric │\n│ - Generates patches │\n└─────────────────────────────────────────────────────────────┘\n```\n\n### Why This Works\n- Vale is battle-tested, fast, CI-native\n- LLM only fires when needed (adaptive cost)\n- Deterministic rules catch predictable issues\n- LLM handles semantic/contextual issues\n\n---\n\n## Vale Rules Needed\n\n### Format Integrity\n- Existence: code blocks without language tags\n- Regex for unclosed fences\n\n### Deterministic Instructions \n- Existence: hedging words (\"might\", \"may want to\", \"consider\", \"you could\")\n\n### Terminology Strictness\n- Consistency: flag term variations\n\n### Token Efficiency\n- Existence: filler phrases (\"In this section we will...\", \"As you may know...\")\n\n### Semantic Headings (partial)\n- Existence: banned headings (\"Overview\", \"Introduction\", \"Getting Started\")\n\n### Configuration Precision (partial)\n- Existence: vague versions (\"Python 3.x\", \"recent version\")\n\n### Security Boundaries (partial)\n- Existence: hardcoded API key patterns\n\n---\n\n## NEXT STEPS\n\n1. Create Vale style for doc-review rubrics\n2. Test Vale on sample docs\n3. Design LLM prompts for semantic rubrics only\n4. Wire into orch or standalone","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-04T14:02:04.898026177-08:00","updated_at":"2025-12-04T16:43:53.0608948-08:00","closed_at":"2025-12-04T16:43:53.0608948-08:00"}
|
||||||
{"id":"skills-20s","title":"Compare BOUNDARIES.md with upstream","description":"","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:15:53.585115099-08:00","updated_at":"2025-12-03T20:19:28.442646801-08:00","closed_at":"2025-12-03T20:19:28.442646801-08:00","dependencies":[{"issue_id":"skills-20s","depends_on_id":"skills-ebh","type":"discovered-from","created_at":"2025-12-03T20:15:53.586442134-08:00","created_by":"daemon"}]}
|
{"id":"skills-20s","title":"Compare BOUNDARIES.md with upstream","description":"","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:15:53.585115099-08:00","updated_at":"2025-12-03T20:19:28.442646801-08:00","closed_at":"2025-12-03T20:19:28.442646801-08:00","dependencies":[{"issue_id":"skills-20s","depends_on_id":"skills-ebh","type":"discovered-from","created_at":"2025-12-03T20:15:53.586442134-08:00","created_by":"daemon"}]}
|
||||||
{"id":"skills-25l","title":"Create orch skill for multi-model consensus","description":"Build a skill that exposes orch CLI capabilities to agents for querying multiple AI models","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-11-30T15:43:49.209528963-08:00","updated_at":"2025-11-30T15:47:36.608887453-08:00","closed_at":"2025-11-30T15:47:36.608887453-08:00"}
|
{"id":"skills-25l","title":"Create orch skill for multi-model consensus","description":"Build a skill that exposes orch CLI capabilities to agents for querying multiple AI models","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-11-30T15:43:49.209528963-08:00","updated_at":"2025-11-30T15:47:36.608887453-08:00","closed_at":"2025-11-30T15:47:36.608887453-08:00"}
|
||||||
{"id":"skills-2xo","title":"Add README.md for web-search skill","description":"web-search skill has SKILL.md and scripts but no README.md. AGENTS.md says README.md is for humans, contains installation instructions, usage examples, prerequisites.","status":"open","priority":2,"issue_type":"task","created_at":"2025-11-30T11:58:14.26066025-08:00","updated_at":"2025-11-30T12:00:25.561281052-08:00","dependencies":[{"issue_id":"skills-2xo","depends_on_id":"skills-vb5","type":"blocks","created_at":"2025-11-30T12:01:30.240439018-08:00","created_by":"daemon"}]}
|
{"id":"skills-2xo","title":"Add README.md for web-search skill","description":"web-search skill has SKILL.md and scripts but no README.md. AGENTS.md says README.md is for humans, contains installation instructions, usage examples, prerequisites.","status":"open","priority":2,"issue_type":"task","created_at":"2025-11-30T11:58:14.26066025-08:00","updated_at":"2025-11-30T12:00:25.561281052-08:00","dependencies":[{"issue_id":"skills-2xo","depends_on_id":"skills-vb5","type":"blocks","created_at":"2025-11-30T12:01:30.240439018-08:00","created_by":"daemon"}]}
|
||||||
{"id":"skills-39g","title":"RFC: .skills manifest pattern for per-repo skill deployment","description":"Document the .skills file pattern where projects declare skills in a manifest, .envrc reads it, and agents can query/edit it.","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-11-30T12:37:50.106992381-08:00","updated_at":"2025-11-30T12:43:04.155161727-08:00","closed_at":"2025-11-30T12:43:04.155161727-08:00"}
|
{"id":"skills-39g","title":"RFC: .skills manifest pattern for per-repo skill deployment","description":"Document the .skills file pattern where projects declare skills in a manifest, .envrc reads it, and agents can query/edit it.","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-11-30T12:37:50.106992381-08:00","updated_at":"2025-11-30T12:43:04.155161727-08:00","closed_at":"2025-11-30T12:43:04.155161727-08:00"}
|
||||||
{"id":"skills-3o7","title":"Fix ai-skills.nix missing sha256 hash","description":"modules/ai-skills.nix:16 has empty sha256 placeholder for opencode-skills npm package. Either get actual hash or remove/comment out the incomplete fetchFromNpm approach.","status":"closed","priority":2,"issue_type":"bug","created_at":"2025-11-30T11:58:24.404929863-08:00","updated_at":"2025-11-30T12:12:39.372107348-08:00","closed_at":"2025-11-30T12:12:39.372107348-08:00"}
|
{"id":"skills-3o7","title":"Fix ai-skills.nix missing sha256 hash","description":"modules/ai-skills.nix:16 has empty sha256 placeholder for opencode-skills npm package. Either get actual hash or remove/comment out the incomplete fetchFromNpm approach.","status":"closed","priority":2,"issue_type":"bug","created_at":"2025-11-30T11:58:24.404929863-08:00","updated_at":"2025-11-30T12:12:39.372107348-08:00","closed_at":"2025-11-30T12:12:39.372107348-08:00"}
|
||||||
{"id":"skills-4yn","title":"Decide on screenshot-latest skill deployment","description":"DEPLOYED.md shows screenshot-latest as 'Not yet deployed - Pending decision'. Low risk skill that finds existing files. Need to decide whether to deploy or archive.","status":"open","priority":2,"issue_type":"task","created_at":"2025-11-30T11:58:33.099790809-08:00","updated_at":"2025-11-30T11:58:33.099790809-08:00"}
|
{"id":"skills-4yn","title":"Decide on screenshot-latest skill deployment","description":"DEPLOYED.md shows screenshot-latest as 'Not yet deployed - Pending decision'. Low risk skill that finds existing files. Need to decide whether to deploy or archive.","status":"open","priority":2,"issue_type":"task","created_at":"2025-11-30T11:58:33.099790809-08:00","updated_at":"2025-11-30T11:58:33.099790809-08:00"}
|
||||||
|
{"id":"skills-53k","title":"Design graph-based doc discovery","description":"How does doc-review find and traverse documentation?\n\nApproach: Start from README.md or AGENTS.md, graph out from there.\n\nDesign questions:\n- Parse markdown links to find related docs?\n- Follow only relative links or also section references?\n- How to handle circular references?\n- Depth limit or exhaustive traversal?\n- What about orphan docs not linked from root?\n- How to represent the graph for chunking decisions?\n\nConsiderations:\n- Large repos may have hundreds of markdown files\n- Not all .md files are \"documentation\" (changelogs, templates, etc.)\n- Some docs are generated and shouldn't be patched\n\nDeliverable: Algorithm/pseudocode for doc discovery + chunking strategy.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-04T14:02:13.316843518-08:00","updated_at":"2025-12-04T16:43:58.277061015-08:00","closed_at":"2025-12-04T16:43:58.277061015-08:00"}
|
||||||
{"id":"skills-5v8","title":"Replace SKILL.md with upstream version","description":"Upstream has 644 lines vs our 122. Missing: self-test questions, notes quality checks, token checkpointing, database selection, field usage table, lifecycle workflow, common patterns, troubleshooting","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-03T20:15:53.025829293-08:00","updated_at":"2025-12-03T20:16:20.470185004-08:00","closed_at":"2025-12-03T20:16:20.470185004-08:00","dependencies":[{"issue_id":"skills-5v8","depends_on_id":"skills-ebh","type":"discovered-from","created_at":"2025-12-03T20:15:53.027601712-08:00","created_by":"daemon"}]}
|
{"id":"skills-5v8","title":"Replace SKILL.md with upstream version","description":"Upstream has 644 lines vs our 122. Missing: self-test questions, notes quality checks, token checkpointing, database selection, field usage table, lifecycle workflow, common patterns, troubleshooting","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-03T20:15:53.025829293-08:00","updated_at":"2025-12-03T20:16:20.470185004-08:00","closed_at":"2025-12-03T20:16:20.470185004-08:00","dependencies":[{"issue_id":"skills-5v8","depends_on_id":"skills-ebh","type":"discovered-from","created_at":"2025-12-03T20:15:53.027601712-08:00","created_by":"daemon"}]}
|
||||||
{"id":"skills-7s0","title":"Compare STATIC_DATA.md with upstream","description":"","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:15:55.193704589-08:00","updated_at":"2025-12-03T20:19:29.659256809-08:00","closed_at":"2025-12-03T20:19:29.659256809-08:00","dependencies":[{"issue_id":"skills-7s0","depends_on_id":"skills-ebh","type":"discovered-from","created_at":"2025-12-03T20:15:55.195160705-08:00","created_by":"daemon"}]}
|
{"id":"skills-7s0","title":"Compare STATIC_DATA.md with upstream","description":"","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:15:55.193704589-08:00","updated_at":"2025-12-03T20:19:29.659256809-08:00","closed_at":"2025-12-03T20:19:29.659256809-08:00","dependencies":[{"issue_id":"skills-7s0","depends_on_id":"skills-ebh","type":"discovered-from","created_at":"2025-12-03T20:15:55.195160705-08:00","created_by":"daemon"}]}
|
||||||
{"id":"skills-7sh","title":"Set up bd-issue-tracking Claude Code skill from beads repo","description":"Install the beads Claude Code skill from https://github.com/steveyegge/beads/tree/main/examples/claude-code-skill\n\nThis skill teaches Claude how to effectively use beads for issue tracking across multi-session coding workflows. It provides strategic guidance on when/how to use beads, not just command syntax.\n\nFiles to install to ~/.claude/skills/bd-issue-tracking/:\n- SKILL.md - Core workflow patterns and decision criteria\n- BOUNDARIES.md - When to use beads vs markdown alternatives\n- CLI_REFERENCE.md - Complete command documentation\n- DEPENDENCIES.md - Relationship types and patterns\n- WORKFLOWS.md - Step-by-step procedures\n- ISSUE_CREATION.md - Quality guidelines\n- RESUMABILITY.md - Making work resumable across sessions\n- STATIC_DATA.md - Using beads as reference databases\n\nCan symlink or copy the files. Restart Claude Code after install.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T17:53:43.254007992-08:00","updated_at":"2025-12-03T20:04:53.416579381-08:00","closed_at":"2025-12-03T20:04:53.416579381-08:00"}
|
{"id":"skills-7sh","title":"Set up bd-issue-tracking Claude Code skill from beads repo","description":"Install the beads Claude Code skill from https://github.com/steveyegge/beads/tree/main/examples/claude-code-skill\n\nThis skill teaches Claude how to effectively use beads for issue tracking across multi-session coding workflows. It provides strategic guidance on when/how to use beads, not just command syntax.\n\nFiles to install to ~/.claude/skills/bd-issue-tracking/:\n- SKILL.md - Core workflow patterns and decision criteria\n- BOUNDARIES.md - When to use beads vs markdown alternatives\n- CLI_REFERENCE.md - Complete command documentation\n- DEPENDENCIES.md - Relationship types and patterns\n- WORKFLOWS.md - Step-by-step procedures\n- ISSUE_CREATION.md - Quality guidelines\n- RESUMABILITY.md - Making work resumable across sessions\n- STATIC_DATA.md - Using beads as reference databases\n\nCan symlink or copy the files. Restart Claude Code after install.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T17:53:43.254007992-08:00","updated_at":"2025-12-03T20:04:53.416579381-08:00","closed_at":"2025-12-03T20:04:53.416579381-08:00"}
|
||||||
{"id":"skills-8d4","title":"Compare CLI_REFERENCE.md with upstream","description":"","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:15:53.268324087-08:00","updated_at":"2025-12-03T20:17:26.552616779-08:00","closed_at":"2025-12-03T20:17:26.552616779-08:00","dependencies":[{"issue_id":"skills-8d4","depends_on_id":"skills-ebh","type":"discovered-from","created_at":"2025-12-03T20:15:53.27265681-08:00","created_by":"daemon"}]}
|
{"id":"skills-8d4","title":"Compare CLI_REFERENCE.md with upstream","description":"","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:15:53.268324087-08:00","updated_at":"2025-12-03T20:17:26.552616779-08:00","closed_at":"2025-12-03T20:17:26.552616779-08:00","dependencies":[{"issue_id":"skills-8d4","depends_on_id":"skills-ebh","type":"discovered-from","created_at":"2025-12-03T20:15:53.27265681-08:00","created_by":"daemon"}]}
|
||||||
{"id":"skills-a23","title":"Update main README to list all 9 skills","description":"Main README.md 'Skills Included' section only lists worklog and update-spec-kit. Repo actually has 9 skills: template, worklog, update-spec-kit, screenshot-latest, niri-window-capture, tufte-press, update-opencode, web-research, web-search.","status":"open","priority":2,"issue_type":"task","created_at":"2025-11-30T11:58:14.042397754-08:00","updated_at":"2025-11-30T12:00:18.916270858-08:00","dependencies":[{"issue_id":"skills-a23","depends_on_id":"skills-4yn","type":"blocks","created_at":"2025-11-30T12:01:30.306742184-08:00","created_by":"daemon"}]}
|
{"id":"skills-a23","title":"Update main README to list all 9 skills","description":"Main README.md 'Skills Included' section only lists worklog and update-spec-kit. Repo actually has 9 skills: template, worklog, update-spec-kit, screenshot-latest, niri-window-capture, tufte-press, update-opencode, web-research, web-search.","status":"open","priority":2,"issue_type":"task","created_at":"2025-11-30T11:58:14.042397754-08:00","updated_at":"2025-11-30T12:00:18.916270858-08:00","dependencies":[{"issue_id":"skills-a23","depends_on_id":"skills-4yn","type":"blocks","created_at":"2025-11-30T12:01:30.306742184-08:00","created_by":"daemon"}]}
|
||||||
|
{"id":"skills-bcu","title":"Design doc-review skill","description":"# doc-review skill\n\nFight documentation drift with a non-interactive review process that generates patchfiles for human review.\n\n## Problem\n- No consistent documentation system across repos\n- Stale content accumulates\n- Structural inconsistencies (docs not optimized for agents)\n\n## Envisioned Workflow\n\n```bash\n# Phase 1: Generate patches (non-interactive, use spare credits, test models)\ndoc-review scan ~/proj/foo --model claude-sonnet --output /tmp/foo-patches/\n\n# Phase 2: Review patches (interactive session)\ncd ~/proj/foo\nclaude # human reviews patches, applies selectively\n```\n\n## Design Decisions Made\n\n- **Trigger**: Manual invocation (not CI). Use case includes burning extra LLM credits, testing models repeatably.\n- **Source of truth**: Style guide embedded in prompt template. Blessed defaults, overridable per-repo.\n- **Output**: Patchfiles for human review in interactive Claude session.\n- **Chunking**: Based on absolute size, not file count. Logical chunks easy for Claude to review.\n- **Scope detection**: Graph-based discovery starting from README.md or AGENTS.md, not glob-all-markdown.\n\n## Open Design Work\n\n### Agent-Friendly Doc Conventions (needs brainstorming)\nWhat makes docs agent-readable?\n- Explicit context (no \"as mentioned above\")\n- Clear section headers for navigation\n- Self-contained sections\n- Consistent terminology\n- Front-loaded summaries\n- ???\n\n### Prompt Content\nFull design round needed on:\n- What conventions to enforce\n- How to express them in prompt\n- Examples of \"good\" vs \"bad\"\n\n### Graph-Based Discovery\nHow does traversal work?\n- Parse links from README/AGENTS.md?\n- Follow relative markdown links?\n- Depth limit?\n\n## Skill Structure (tentative)\n```\nskills/doc-review/\n├── prompt.md # Core review instructions + style guide\n├── scan.sh # Orchestrates: find docs → invoke claude → emit patches\n└── README.md\n```\n\n## Out of Scope (for now)\n- Cross-repo standardization (broader than skills repo)\n- CI integration\n- Auto-apply without human review","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-12-04T14:01:43.305653729-08:00","updated_at":"2025-12-04T16:44:03.468118288-08:00","closed_at":"2025-12-04T16:44:03.468118288-08:00","dependencies":[{"issue_id":"skills-bcu","depends_on_id":"skills-1ig","type":"blocks","created_at":"2025-12-04T14:02:17.144414636-08:00","created_by":"daemon"},{"issue_id":"skills-bcu","depends_on_id":"skills-53k","type":"blocks","created_at":"2025-12-04T14:02:17.164968463-08:00","created_by":"daemon"}]}
|
||||||
{"id":"skills-cnc","title":"Add direnv helper for per-repo skill deployment","description":"Create sourceable helper script and documentation for the standard per-repo skill deployment pattern using direnv + nix build.","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-11-30T12:19:20.71056749-08:00","updated_at":"2025-11-30T12:37:47.22638278-08:00","closed_at":"2025-11-30T12:37:47.22638278-08:00"}
|
{"id":"skills-cnc","title":"Add direnv helper for per-repo skill deployment","description":"Create sourceable helper script and documentation for the standard per-repo skill deployment pattern using direnv + nix build.","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-11-30T12:19:20.71056749-08:00","updated_at":"2025-11-30T12:37:47.22638278-08:00","closed_at":"2025-11-30T12:37:47.22638278-08:00"}
|
||||||
{"id":"skills-czz","title":"Research OpenCode agents for skill integration","description":"DEPLOYMENT.md:218 has TODO to research OpenCode agents. Need to understand how Build/Plan/custom agents work and whether skills need agent-specific handling.","status":"open","priority":2,"issue_type":"task","created_at":"2025-11-30T11:58:24.855701141-08:00","updated_at":"2025-11-30T11:58:24.855701141-08:00"}
|
{"id":"skills-czz","title":"Research OpenCode agents for skill integration","description":"DEPLOYMENT.md:218 has TODO to research OpenCode agents. Need to understand how Build/Plan/custom agents work and whether skills need agent-specific handling.","status":"open","priority":2,"issue_type":"task","created_at":"2025-11-30T11:58:24.855701141-08:00","updated_at":"2025-11-30T11:58:24.855701141-08:00"}
|
||||||
|
{"id":"skills-d6r","title":"Design: orch as local agent framework","description":"# Orch Evolution: From Consensus Tool to Agent Framework\n\n## Current State\n- `orch consensus` - multi-model queries\n- `orch chat` - single model queries\n- No state, no pipelines, no retries\n\n## Proposed Extensions\n\n### Pipeline Mode\n```bash\norch pipeline config.yaml\n```\nWhere config.yaml defines:\n- Stages (triage → specialists → verify)\n- Routing logic (if triage finds X, run specialist Y)\n- Retry policy\n\n### Evaluate Mode (doc-review specific)\n```bash\norch evaluate doc.md --rubrics=1,4,7 --output=patches/\n```\n- Applies specific rubrics to document\n- Outputs JSON or patches\n\n### Parallel Mode\n```bash\norch parallel --fan-out=5 --template=\"evaluate {rubric}\" rubrics.txt\n```\n- Fan-out to multiple parallel calls\n- Aggregate results\n\n## Open Questions\n1. Does this belong in orch or a separate tool?\n2. Should orch pipelines be YAML-defined or code-defined?\n3. How does this relate to Claude Code Task subagents?\n4. What's the minimal viable extension?\n\n## Context\nEmerged from doc-review skill design - need multi-pass evaluation but don't want to adopt heavy framework (LangGraph, etc.)","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-04T16:06:56.681282678-08:00","updated_at":"2025-12-04T16:44:08.652185174-08:00","closed_at":"2025-12-04T16:44:08.652185174-08:00"}
|
||||||
|
{"id":"skills-dpw","title":"orch: add command to show available/configured models","description":"## Problem\n\nWhen trying to use orch, you have to trial-and-error through models to find which ones have API keys configured. Each failure looks like:\n\n```\nError: GEMINI_API_KEY not set. Required for Google Gemini models.\n```\n\nNo way to know upfront which models are usable.\n\n## Proposed Solution\n\nAdd `orch models` or `orch status` command:\n\n```bash\n$ orch models\nAvailable models:\n ✓ flash (GEMINI_API_KEY set)\n ✓ gemini (GEMINI_API_KEY set)\n ✗ deepseek (OPENROUTER_KEY not set)\n ✗ qwen (OPENROUTER_KEY not set)\n ✓ gpt (OPENAI_API_KEY set)\n```\n\nOr at minimum, on failure suggest alternatives:\n```\nError: GEMINI_API_KEY not set. Try --model gpt or --model deepseek instead.\n```\n\n## Context\n\nHit this while trying to brainstorm with high-temp gemini - had to try 4 models before realizing none were configured in this environment.","status":"closed","priority":3,"issue_type":"feature","created_at":"2025-12-04T14:10:07.069103175-08:00","updated_at":"2025-12-04T14:11:05.49122538-08:00","closed_at":"2025-12-04T14:11:05.49122538-08:00"}
|
||||||
{"id":"skills-ebh","title":"Compare bd-issue-tracking skill files with upstream","description":"Fetch upstream beads skill files and compare with our condensed versions to identify differences","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:14:07.886535859-08:00","updated_at":"2025-12-03T20:19:37.579815337-08:00","closed_at":"2025-12-03T20:19:37.579815337-08:00"}
|
{"id":"skills-ebh","title":"Compare bd-issue-tracking skill files with upstream","description":"Fetch upstream beads skill files and compare with our condensed versions to identify differences","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:14:07.886535859-08:00","updated_at":"2025-12-03T20:19:37.579815337-08:00","closed_at":"2025-12-03T20:19:37.579815337-08:00"}
|
||||||
{"id":"skills-fo3","title":"Compare WORKFLOWS.md with upstream","description":"","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:15:54.283175561-08:00","updated_at":"2025-12-03T20:19:28.897037199-08:00","closed_at":"2025-12-03T20:19:28.897037199-08:00","dependencies":[{"issue_id":"skills-fo3","depends_on_id":"skills-ebh","type":"discovered-from","created_at":"2025-12-03T20:15:54.286009672-08:00","created_by":"daemon"}]}
|
{"id":"skills-fo3","title":"Compare WORKFLOWS.md with upstream","description":"","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-03T20:15:54.283175561-08:00","updated_at":"2025-12-03T20:19:28.897037199-08:00","closed_at":"2025-12-03T20:19:28.897037199-08:00","dependencies":[{"issue_id":"skills-fo3","depends_on_id":"skills-ebh","type":"discovered-from","created_at":"2025-12-03T20:15:54.286009672-08:00","created_by":"daemon"}]}
|
||||||
{"id":"skills-kmj","title":"Orch skill: document or handle orch not in PATH","description":"Skill docs show 'orch consensus' but orch requires 'uv run' from ~/proj/orch. Either update skill to invoke correctly or document installation requirement.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-01T17:29:48.844997238-08:00","updated_at":"2025-12-01T18:28:11.374048504-08:00","closed_at":"2025-12-01T18:28:11.374048504-08:00"}
|
{"id":"skills-kmj","title":"Orch skill: document or handle orch not in PATH","description":"Skill docs show 'orch consensus' but orch requires 'uv run' from ~/proj/orch. Either update skill to invoke correctly or document installation requirement.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-01T17:29:48.844997238-08:00","updated_at":"2025-12-01T18:28:11.374048504-08:00","closed_at":"2025-12-01T18:28:11.374048504-08:00"}
|
||||||
|
|
|
||||||
325
docs/worklogs/2025-12-04-doc-review-skill-design.org
Normal file
325
docs/worklogs/2025-12-04-doc-review-skill-design.org
Normal file
|
|
@ -0,0 +1,325 @@
|
||||||
|
#+TITLE: Doc-Review Skill Design: Research and Rubric Development
|
||||||
|
#+DATE: 2025-12-04
|
||||||
|
#+KEYWORDS: doc-review, rubrics, LLM-as-judge, agent-friendly-docs, documentation-drift, AGENTS.md
|
||||||
|
#+COMMITS: 0
|
||||||
|
#+COMPRESSION_STATUS: uncompressed
|
||||||
|
|
||||||
|
* Session Summary
|
||||||
|
** Date: 2025-12-04 (Day 1 of doc-review skill)
|
||||||
|
** Focus Area: Designing a doc-review skill to normalize documentation and fight documentation drift
|
||||||
|
|
||||||
|
* Accomplishments
|
||||||
|
- [X] Created bead structure for doc-review skill design (skills-bcu with 2 blockers)
|
||||||
|
- [X] Researched agent-friendly documentation conventions via web search
|
||||||
|
- [X] Analyzed GitHub's study of 2,500+ AGENTS.md repositories
|
||||||
|
- [X] Researched LLM-as-judge rubric best practices
|
||||||
|
- [X] Drafted 7 decomposed rubrics for documentation evaluation (v1)
|
||||||
|
- [X] Filed orch UX issue (orch-z2x) for missing model availability command
|
||||||
|
- [X] Fixed .envrc to include use_api_keys for orch access
|
||||||
|
- [X] Got Gemini critique via orch - identified overlaps and 3 missing dimensions
|
||||||
|
- [X] Revised rubrics to v2 (10 dimensions) based on AI-optimization feedback
|
||||||
|
- [ ] Update rubrics-v2.md with full definitions
|
||||||
|
- [ ] Prompt template (next step)
|
||||||
|
- [ ] Before/after examples (next step)
|
||||||
|
|
||||||
|
* Key Decisions
|
||||||
|
** Decision 1: Non-interactive patch-based workflow
|
||||||
|
- Context: Deciding how doc-review should integrate with human workflow
|
||||||
|
- Options considered:
|
||||||
|
1. Direct edit mode - LLM directly modifies docs
|
||||||
|
2. Report-only mode - LLM produces report, human edits
|
||||||
|
3. Patch-based mode - LLM generates patches, human reviews and applies
|
||||||
|
- Rationale: Patch-based gives human control, enables review before apply, works with existing git workflow
|
||||||
|
- Impact: Two-phase workflow: generate patches (non-interactive), review patches (interactive Claude session)
|
||||||
|
|
||||||
|
** Decision 2: Decomposed rubrics over holistic evaluation
|
||||||
|
- Context: How should the LLM evaluate documentation quality?
|
||||||
|
- Options considered:
|
||||||
|
1. Single holistic prompt ("rate this doc's quality")
|
||||||
|
2. Multi-dimensional single pass ("rate on 10 dimensions")
|
||||||
|
3. Decomposed rubrics (one dimension per evaluation)
|
||||||
|
- Rationale: Research shows LLMs better at "guided summarization" than complex reasoning. Decomposed approach plays to LLM strengths.
|
||||||
|
- Impact: 7 separate rubrics, each with clear question, scoring levels, and detect patterns
|
||||||
|
|
||||||
|
** Decision 3: Three-level scoring (PASS/MARGINAL/FAIL)
|
||||||
|
- Context: What granularity for scoring?
|
||||||
|
- Options considered:
|
||||||
|
1. Binary (pass/fail)
|
||||||
|
2. Three-level (pass/marginal/fail)
|
||||||
|
3. Five or ten point scale
|
||||||
|
- Rationale: Binary loses nuance, 5+ point scales introduce ambiguity. Three levels give actionable distinction.
|
||||||
|
- Impact: FAIL triggers patch generation, MARGINAL flags for human review, PASS means agent-friendly
|
||||||
|
|
||||||
|
** Decision 4: Graph-based doc discovery (deferred)
|
||||||
|
- Context: How does doc-review find which docs to evaluate?
|
||||||
|
- Decision: Start from README.md or AGENTS.md and graph out via links
|
||||||
|
- Rationale: Not all .md files are documentation. Following links from root finds connected docs.
|
||||||
|
- Impact: Created separate bead (skills-53k) to design discovery algorithm
|
||||||
|
|
||||||
|
** Decision 5: "Instruction Clarity" rubric added
|
||||||
|
- Context: Initial 6 rubrics didn't cover the "boundaries" pattern from AGENTS.md research
|
||||||
|
- Discussion: User asked what "boundaries" meant. Realized the ✅/⚠️/🚫 pattern generalizes beyond AGENTS.md
|
||||||
|
- Decision: Added Rubric 7 - Instruction Clarity (required vs optional vs dangerous)
|
||||||
|
- Impact: Covers clarity of optionality and risk in all documentation, not just agent instruction files
|
||||||
|
|
||||||
|
* Problems & Solutions
|
||||||
|
| Problem | Solution | Learning |
|
||||||
|
|---------+----------+----------|
|
||||||
|
| orch CLI failed - no API keys configured | Tried 4 models (gemini, flash, deepseek, gpt) - all failed | Need `orch models` command to show available models |
|
||||||
|
| API keys not loading in skills repo | Added `use_api_keys` to .envrc (was only `use flake`) | Check .envrc has use_api_keys for repos needing API access |
|
||||||
|
| False distinction between "general docs" and "agent instruction files" | User pushed back - all docs are agent-readable | Don't create artificial categories; generalize patterns |
|
||||||
|
| Rubric research scattered across sources | Synthesized into single bead description | Centralize findings in beads for future reference |
|
||||||
|
|
||||||
|
* Technical Details
|
||||||
|
|
||||||
|
** Artifacts Created
|
||||||
|
- `/tmp/doc-review-drafts/rubrics-v1.md` - Full rubric definitions (181 lines)
|
||||||
|
- `skills-bcu` - Main design bead with workflow description
|
||||||
|
- `skills-1ig` - Conventions research bead (in_progress)
|
||||||
|
- `skills-53k` - Graph discovery design bead (open)
|
||||||
|
- `orch-z2x` - Filed in ~/proj/orch for model availability feature
|
||||||
|
|
||||||
|
** Commands Used
|
||||||
|
#+begin_src bash
|
||||||
|
# Web search for conventions
|
||||||
|
WebSearch: "LLM-friendly documentation conventions AI agent readable docs best practices 2025"
|
||||||
|
WebSearch: "AGENTS.md llms.txt AI coding assistant documentation format"
|
||||||
|
WebSearch: "LLM as judge rubric prompt engineering best practices"
|
||||||
|
|
||||||
|
# Web fetch for deep dives
|
||||||
|
WebFetch: github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
|
||||||
|
WebFetch: docs.kapa.ai/improving/writing-best-practices
|
||||||
|
WebFetch: biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide
|
||||||
|
WebFetch: montecarlodata.com/blog-llm-as-judge/
|
||||||
|
|
||||||
|
# Beads workflow
|
||||||
|
bd create --title="Design doc-review skill" --type=feature
|
||||||
|
bd dep add skills-bcu skills-1ig
|
||||||
|
bd dep add skills-bcu skills-53k
|
||||||
|
#+end_src
|
||||||
|
|
||||||
|
** The 7 Rubrics
|
||||||
|
| # | Rubric | Key Question |
|
||||||
|
|---+--------+--------------|
|
||||||
|
| 1 | Self-Containment | Can section be understood alone? |
|
||||||
|
| 2 | Heading Structure | Logical hierarchy, descriptive names? |
|
||||||
|
| 3 | Terminology Consistency | Same term for same concept? |
|
||||||
|
| 4 | Code Example Completeness | Examples runnable as shown? |
|
||||||
|
| 5 | Explicit Context | Prerequisites stated, no implicit knowledge? |
|
||||||
|
| 6 | Technical Specificity | Exact versions, flags, commands? |
|
||||||
|
| 7 | Instruction Clarity | Required vs optional vs dangerous? |
|
||||||
|
|
||||||
|
Each rubric includes:
|
||||||
|
- Scoring levels with clear definitions
|
||||||
|
- Detect patterns (specific things to look for)
|
||||||
|
- Reasoning requirement (must quote evidence)
|
||||||
|
- Suggested fix field in output
|
||||||
|
|
||||||
|
* Process and Workflow
|
||||||
|
|
||||||
|
** What Worked Well
|
||||||
|
- Web search → web fetch pipeline for research
|
||||||
|
- Beads for tracking design decisions
|
||||||
|
- Iterative rubric development with user feedback
|
||||||
|
- User pushing back on false distinctions
|
||||||
|
|
||||||
|
** What Was Challenging
|
||||||
|
- orch unavailable due to missing API keys - blocked brainstorming with external model
|
||||||
|
- Keeping rubrics crisp - tendency to over-specify
|
||||||
|
|
||||||
|
* Learning and Insights
|
||||||
|
|
||||||
|
** Technical Insights
|
||||||
|
- LLMs are "better at guided summarization than complex reasoning"
|
||||||
|
- One few-shot example often outperforms multiple (diminishing returns)
|
||||||
|
- Decomposed evaluation + deterministic aggregation > holistic evaluation
|
||||||
|
- llms.txt is emerging as robots.txt equivalent for LLM access
|
||||||
|
- MCP (Model Context Protocol) for structured doc discovery
|
||||||
|
|
||||||
|
** Process Insights
|
||||||
|
- Research before design - the web search surfaced patterns we wouldn't have invented
|
||||||
|
- GitHub's 2,500 repo study provided concrete evidence for conventions
|
||||||
|
- Asking "what would a rubric for X look like" forces clarity on what X means
|
||||||
|
|
||||||
|
** Architectural Insights
|
||||||
|
- Two-phase workflow (generate then review) separates concerns
|
||||||
|
- Patches as intermediate format enables git integration
|
||||||
|
- Per-dimension rubrics enable independent iteration
|
||||||
|
|
||||||
|
** Key Research Sources
|
||||||
|
- GitHub blog: 2,500+ AGENTS.md analysis - "one snippet beats three paragraphs"
|
||||||
|
- kapa.ai: proximity principle, self-contained sections, semantic discoverability
|
||||||
|
- biel.ai: single purpose per section, complete code examples
|
||||||
|
- Monte Carlo: 7 best practices for LLM-as-judge, failure modes
|
||||||
|
- ACM ICER 2025: "Rubric Is All You Need" paper
|
||||||
|
|
||||||
|
* Context for Future Work
|
||||||
|
|
||||||
|
** Open Questions
|
||||||
|
- How to handle large repos with many doc files? (chunking strategy)
|
||||||
|
- Should rubrics be weighted differently?
|
||||||
|
- How to handle generated docs (should they be excluded)?
|
||||||
|
- What's the right model for different stages (eval vs patch generation)?
|
||||||
|
|
||||||
|
** Next Steps
|
||||||
|
1. Draft the prompt template that uses these rubrics
|
||||||
|
2. Create before/after examples for few-shot
|
||||||
|
3. Design graph-based doc discovery (skills-53k)
|
||||||
|
4. Prototype scan.sh script
|
||||||
|
5. Test on real repos
|
||||||
|
|
||||||
|
** Related Work
|
||||||
|
- skills-53k: Graph-based doc discovery design
|
||||||
|
- skills-bcu: Parent design bead
|
||||||
|
- orch-z2x: Model availability feature for orch
|
||||||
|
|
||||||
|
** External References
|
||||||
|
- https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
|
||||||
|
- https://docs.kapa.ai/improving/writing-best-practices
|
||||||
|
- https://biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide
|
||||||
|
- https://www.montecarlodata.com/blog-llm-as-judge/
|
||||||
|
- https://arxiv.org/abs/2503.23989 (Rubric Is All You Need)
|
||||||
|
- https://agents.md (AGENTS.md standard)
|
||||||
|
|
||||||
|
* Raw Notes
|
||||||
|
|
||||||
|
** Envisioned Workflow (from initial brainstorm)
|
||||||
|
#+begin_src bash
|
||||||
|
# Phase 1: Generate patches (non-interactive, burnable credits)
|
||||||
|
doc-review scan ~/proj/foo --model claude-sonnet --output /tmp/foo-patches/
|
||||||
|
|
||||||
|
# Phase 2: Review patches (interactive session)
|
||||||
|
cd ~/proj/foo
|
||||||
|
claude # human reviews patches, applies selectively
|
||||||
|
#+end_src
|
||||||
|
|
||||||
|
** Skill Structure (tentative)
|
||||||
|
#+begin_example
|
||||||
|
skills/doc-review/
|
||||||
|
├── prompt.md # Core review instructions + style guide
|
||||||
|
├── scan.sh # Orchestrates: find docs → invoke claude → emit patches
|
||||||
|
└── README.md
|
||||||
|
#+end_example
|
||||||
|
|
||||||
|
** Output Format for Evaluations
|
||||||
|
#+begin_src json
|
||||||
|
{
|
||||||
|
"section": "## Installation",
|
||||||
|
"line_range": [15, 42],
|
||||||
|
"evaluations": [
|
||||||
|
{
|
||||||
|
"rubric": "self_containment",
|
||||||
|
"score": "FAIL",
|
||||||
|
"reasoning": "Line 23 says 'as configured above' but the configuration is in a different file.",
|
||||||
|
"evidence": "as configured above",
|
||||||
|
"suggested_fix": "Add explicit reference: 'as configured in config.yaml (see Configuration section)'"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
#+end_src
|
||||||
|
|
||||||
|
** Key Quote from Research
|
||||||
|
#+begin_quote
|
||||||
|
"LLMs are better at guided summarization than complex reasoning. If you can break down a problem into multiple smaller steps of guided summarization by creating a rubric, you can play to the strengths of LLMs and then rely on simple deterministic code to parse the output and score the rubric."
|
||||||
|
— Monte Carlo, LLM-as-Judge best practices
|
||||||
|
#+end_quote
|
||||||
|
|
||||||
|
** Gemini Critique of Rubrics (via orch)
|
||||||
|
After fixing .envrc to include use_api_keys, ran high-temp Gemini critique.
|
||||||
|
|
||||||
|
*** Overlaps Identified
|
||||||
|
- Self-Containment vs Explicit Context had significant overlap
|
||||||
|
- Fix: Split into Contextual Independence (referential) + Environmental State (runtime)
|
||||||
|
|
||||||
|
*** Missing Dimensions for AI
|
||||||
|
1. **Verifiability** - agents get stuck when they can't tell if command succeeded
|
||||||
|
2. **Format Integrity** - broken markdown/JSON causes hallucinations
|
||||||
|
3. **Token Efficiency** - fluff dilutes semantic weight in vector search
|
||||||
|
|
||||||
|
*** Key Insight
|
||||||
|
#+begin_quote
|
||||||
|
"Agents fail when given choices without criteria. Change 'Use a large instance' to 'Use instance type t3.2xlarge or greater.'"
|
||||||
|
#+end_quote
|
||||||
|
|
||||||
|
*** Revised Rubrics (v3 final: 10 dimensions, merged)
|
||||||
|
| Phase | # | Rubric |
|
||||||
|
|-------+---+--------|
|
||||||
|
| Read | 1 | Format Integrity |
|
||||||
|
| Find | 2 | Semantic Headings |
|
||||||
|
| Find | 3 | Contextual Independence |
|
||||||
|
| Run | 4 | Configuration Precision (merged: Env State + Tech Specificity) |
|
||||||
|
| Run | 5 | Code Executability |
|
||||||
|
| Run | 6 | Deterministic Instructions |
|
||||||
|
| Verify | 7 | Execution Verification (merged: Verifiable Output + Error Recovery) |
|
||||||
|
| Optimize | 8 | Terminology Strictness |
|
||||||
|
| Optimize | 9 | Token Efficiency |
|
||||||
|
| Optimize | 10 | Security Boundaries |
|
||||||
|
|
||||||
|
** Prompt Template Critique (Gemini round 3)
|
||||||
|
|
||||||
|
*** Problem: 10 rubrics = cognitive overload
|
||||||
|
#+begin_quote
|
||||||
|
"Accuracy drops significantly after 3-5 distinct evaluation criteria per prompt"
|
||||||
|
#+end_quote
|
||||||
|
|
||||||
|
*** Anti-patterns identified
|
||||||
|
1. JSON-only output - model commits to score before thinking (fix: CoT first)
|
||||||
|
2. Evidence hallucination - model invents quotes (fix: allow null)
|
||||||
|
3. Positional bias - middle rubrics degrade (fix: split passes)
|
||||||
|
|
||||||
|
*** Architecture options
|
||||||
|
- **2-Pass**: Linter (structural) + Agent Simulator (semantic)
|
||||||
|
- **Many-Pass**: One rubric per pass (max accuracy, max latency)
|
||||||
|
- **Iterative/Recursive**: Quick scan → deep-dive on failures
|
||||||
|
- **Hybrid**: Single-pass quick, multi-pass on demand
|
||||||
|
|
||||||
|
*** The "Agent Simulator" persona
|
||||||
|
#+begin_quote
|
||||||
|
"Simulate execution step-by-step. Identify where you get stuck, where instructions are ambiguous, or where you might hallucinate due to missing context."
|
||||||
|
#+end_quote
|
||||||
|
|
||||||
|
** Architecture Research (web search + Gemini round 4)
|
||||||
|
|
||||||
|
*** Frameworks discovered
|
||||||
|
| Framework | Key Insight |
|
||||||
|
|-----------+-------------|
|
||||||
|
| PRE (ACM 2025) | One rubric per call for accuracy |
|
||||||
|
| G-Eval | 3-step: define → CoT → execute |
|
||||||
|
| EvalPlanner | Plan → Execute → Judge |
|
||||||
|
| Spring AI Recursive | Generate → Evaluate → Retry loop |
|
||||||
|
| ARISE | 22 specialized agents for different tasks |
|
||||||
|
| Realm | Recursive refinement with Bayesian aggregation |
|
||||||
|
| HuCoSC | Break complex problems into independent analyses |
|
||||||
|
|
||||||
|
*** Final Architecture: "Triage & Specialist" Cascade
|
||||||
|
|
||||||
|
#+begin_example
|
||||||
|
Phase 1: TRIAGE SCAN (cheap model)
|
||||||
|
├── Input: doc chunk + 10 rubric definitions
|
||||||
|
├── Output: list of suspect rubrics
|
||||||
|
└── Cost: ~0 for clean docs
|
||||||
|
|
||||||
|
Phase 2: SPECIALIST AGENTS (PRE-style, parallel)
|
||||||
|
├── One agent per suspect rubric
|
||||||
|
└── Each outputs: patch for that dimension
|
||||||
|
|
||||||
|
Phase 3: VERIFICATION LOOP (recursive)
|
||||||
|
├── Re-evaluate patched content
|
||||||
|
└── Retry with feedback if still FAIL
|
||||||
|
#+end_example
|
||||||
|
|
||||||
|
*** Why this pattern
|
||||||
|
- Adaptive cost: 80% of clean content pays near-zero
|
||||||
|
- PRE accuracy: One dimension at a time reduces hallucination
|
||||||
|
- Recursive safety: Verify patches don't introduce regressions
|
||||||
|
|
||||||
|
* Session Metrics
|
||||||
|
- Commits made: 0 (design session, no code yet)
|
||||||
|
- Files touched: 3 (rubrics-v1.md, rubrics-v3.md, prompt-template-v1.md in /tmp; .envrc)
|
||||||
|
- Beads created: 4 (skills-bcu, skills-1ig, skills-53k, orch-z2x)
|
||||||
|
- Beads closed: 1 (skills-dpw - duplicate moved to orch repo)
|
||||||
|
- Web searches: 6 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive)
|
||||||
|
- Web fetches: 4 (GitHub blog, kapa.ai, biel.ai, Monte Carlo)
|
||||||
|
- Gemini consultations: 4 (conventions → rubrics v2 → prompt critique → architecture synthesis)
|
||||||
|
- Rubrics evolution: 7 (v1) → 12 (v2) → 10 (v3 merged)
|
||||||
|
- Key outcome: "Triage & Specialist" cascade architecture
|
||||||
Loading…
Reference in a new issue