#+TITLE: Doc-Review Skill Design: Research, Rubrics, Vale Discovery, and Repo Spin-off #+DATE: 2025-12-04 #+KEYWORDS: doc-review, rubrics, LLM-as-judge, agent-friendly-docs, documentation-drift, AGENTS.md, vale, hybrid-architecture #+COMMITS: 1 #+COMPRESSION_STATUS: uncompressed * Session Summary ** Date: 2025-12-04 (Day 1 of doc-review skill) ** Focus Area: Designing a doc-review skill to normalize documentation and fight documentation drift * Accomplishments - [X] Created bead structure for doc-review skill design (skills-bcu with 2 blockers) - [X] Researched agent-friendly documentation conventions via web search - [X] Analyzed GitHub's study of 2,500+ AGENTS.md repositories - [X] Researched LLM-as-judge rubric best practices - [X] Drafted 7 decomposed rubrics for documentation evaluation (v1) - [X] Filed orch UX issue (orch-z2x) for missing model availability command - [X] Fixed .envrc to include use_api_keys for orch access - [X] Got Gemini critique via orch - identified overlaps and 3 missing dimensions - [X] Revised rubrics to v2 (12 dimensions), merged to v3 (10 dimensions) - [X] Created prompt-template-v1.md with full system prompt and example - [X] Got Gemini critique on prompt - identified cognitive overload with 10 rubrics - [X] Researched multi-pass/recursive LLM evaluation architectures (PRE, G-Eval, ARISE) - [X] Designed "Triage & Specialist" cascade architecture - [X] Discovered Vale prose linter - can handle 4-7 rubrics deterministically - [X] Designed Vale + LLM hybrid architecture (deterministic first, LLM only for semantic) - [X] Ran GPT/Gemini consensus on spin-off decision - both recommended separate repo - [X] Created ~/proj/doc-review with beads setup - [X] Migrated artifacts: rubrics-v3.md, prompt-template-v1.md to design/ folder - [X] Updated doc-review/AGENTS.md with full architecture context - [X] Created beads in doc-review: doc-review-i8n (Vale), doc-review-4to (LLM), doc-review-xy3 (CLI) - [X] Closed skills repo beads: skills-bcu, skills-1ig, skills-53k - [X] Committed and pushed skills repo * Key Decisions ** Decision 1: Non-interactive patch-based workflow - Context: Deciding how doc-review should integrate with human workflow - Options considered: 1. Direct edit mode - LLM directly modifies docs 2. Report-only mode - LLM produces report, human edits 3. Patch-based mode - LLM generates patches, human reviews and applies - Rationale: Patch-based gives human control, enables review before apply, works with existing git workflow - Impact: Two-phase workflow: generate patches (non-interactive), review patches (interactive Claude session) ** Decision 2: Decomposed rubrics over holistic evaluation - Context: How should the LLM evaluate documentation quality? - Options considered: 1. Single holistic prompt ("rate this doc's quality") 2. Multi-dimensional single pass ("rate on 10 dimensions") 3. Decomposed rubrics (one dimension per evaluation) - Rationale: Research shows LLMs better at "guided summarization" than complex reasoning. Decomposed approach plays to LLM strengths. - Impact: 7 separate rubrics, each with clear question, scoring levels, and detect patterns ** Decision 3: Three-level scoring (PASS/MARGINAL/FAIL) - Context: What granularity for scoring? - Options considered: 1. Binary (pass/fail) 2. Three-level (pass/marginal/fail) 3. Five or ten point scale - Rationale: Binary loses nuance, 5+ point scales introduce ambiguity. Three levels give actionable distinction. - Impact: FAIL triggers patch generation, MARGINAL flags for human review, PASS means agent-friendly ** Decision 4: Graph-based doc discovery (deferred) - Context: How does doc-review find which docs to evaluate? - Decision: Start from README.md or AGENTS.md and graph out via links - Rationale: Not all .md files are documentation. Following links from root finds connected docs. - Impact: Created separate bead (skills-53k) to design discovery algorithm ** Decision 5: "Instruction Clarity" rubric added - Context: Initial 6 rubrics didn't cover the "boundaries" pattern from AGENTS.md research - Discussion: User asked what "boundaries" meant. Realized the ✅/⚠️/🚫 pattern generalizes beyond AGENTS.md - Decision: Added Rubric 7 - Instruction Clarity (required vs optional vs dangerous) - Impact: Covers clarity of optionality and risk in all documentation, not just agent instruction files ** Decision 6: Vale + LLM hybrid architecture - Context: Gemini critique revealed 10 rubrics in one prompt causes cognitive overload - Discovery: Vale (prose linter) can handle 4-7 rubrics deterministically with YAML rules - Options considered: 1. Pure LLM approach (original plan) 2. Pure Vale approach (limited to pattern matching) 3. Vale + LLM hybrid (deterministic first, LLM for semantic) - Rationale: Vale catches ~40% of issues instantly, for free, in CI. LLM only needed for semantic analysis. - Impact: Three-stage pipeline: - Stage 1: Vale (fast/free) - Format Integrity, Semantic Headings, Deterministic Instructions, Terminology, Token Efficiency - Stage 2: LLM Triage (cheap model) - Quick semantic scan - Stage 3: LLM Specialists (capable model) - Patch generation per failed rubric ** Decision 7: Spin off to separate repository - Context: Project grew from "Claude skill" to "standalone tool with Vale + LLM" - Ran consensus: GPT and Gemini both recommended separate repo - GPT reasoning: "Start with MVP in separate repo, iterate fast" - Gemini reasoning: "Invest in orch integration, design for reuse" - Decision: Created ~/proj/doc-review as standalone project - Impact: Clean separation, own beads, can be used outside claude-code context * Problems & Solutions | Problem | Solution | Learning | |---------+----------+----------| | orch CLI failed - no API keys configured | Tried 4 models (gemini, flash, deepseek, gpt) - all failed | Need `orch models` command to show available models | | API keys not loading in skills repo | Added `use_api_keys` to .envrc (was only `use flake`) | Check .envrc has use_api_keys for repos needing API access | | False distinction between "general docs" and "agent instruction files" | User pushed back - all docs are agent-readable | Don't create artificial categories; generalize patterns | | Rubric research scattered across sources | Synthesized into single bead description | Centralize findings in beads for future reference | | 10 rubrics in one LLM prompt = cognitive overload | Split: Vale for deterministic, LLM for semantic only | Play to tool strengths - deterministic tools for pattern matching | | OPENROUTER_KEY vs OPENROUTER_API_KEY mismatch | Filed bug orch-6o3 in ~/proj/orch | Environment variable naming must be consistent | | bd sync failed with worktree error | Used regular git add + commit instead | bd sync has edge cases with certain git states | | doc-review repo missing git identity | Left for user to configure user.email/name | New repos need git config before first commit | * Technical Details ** Artifacts Created - `/tmp/doc-review-drafts/rubrics-v1.md` - Initial 7 rubric definitions - `/tmp/doc-review-drafts/rubrics-v3.md` - Final 10 rubrics (320 lines) - `/tmp/doc-review-drafts/prompt-template-v1.md` - LLM system prompt with example - `~/proj/doc-review/docs/design/rubrics-v3.md` - Migrated final rubrics - `~/proj/doc-review/docs/design/prompt-template-v1.md` - Migrated prompt template - `~/proj/doc-review/AGENTS.md` - Full project context for next agent ** Beads Created/Updated In skills repo (all closed): - `skills-bcu` - Main design bead (closed: spun off to doc-review) - `skills-1ig` - Conventions research (closed: completed) - `skills-53k` - Graph discovery design (closed: moved to doc-review) In doc-review repo: - `doc-review-i8n` - Implement Vale style for rubrics - `doc-review-4to` - Implement LLM prompts for semantic rubrics - `doc-review-xy3` - Design CLI interface In orch repo: - `orch-z2x` - Model availability command feature - `orch-6o3` - OPENROUTER_KEY vs OPENROUTER_API_KEY bug ** Commands Used #+begin_src bash # Web search for conventions and research WebSearch: "LLM-friendly documentation conventions AI agent readable docs best practices 2025" WebSearch: "AGENTS.md llms.txt AI coding assistant documentation format" WebSearch: "LLM as judge rubric prompt engineering best practices" WebSearch: "multi-pass LLM evaluation recursive evaluation framework" WebSearch: "prose linter markdown documentation CI" # Web fetch for deep dives WebFetch: github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/ WebFetch: docs.kapa.ai/improving/writing-best-practices WebFetch: biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide WebFetch: montecarlodata.com/blog-llm-as-judge/ WebFetch: vale.sh # orch consultations orch ask gemini --temp 1.2 "Critique these rubrics for AI-optimized documentation..." orch consensus gpt gemini "Should doc-review be a skill or separate repo?" # Migration commands mkdir -p ~/proj/doc-review/docs/design cp /tmp/doc-review-drafts/*.md ~/proj/doc-review/docs/design/ # Beads workflow bd close skills-bcu skills-1ig skills-53k #+end_src ** The 10 Rubrics (Final v3) Organized by "Agent's Hierarchy of Needs": | Phase | # | Rubric | Tool | Key Question | |-------+---+--------+------+--------------| | Read | 1 | Format Integrity | Vale | Valid markdown? Code blocks tagged? | | Find | 2 | Semantic Headings | Vale | Headings contain task+object keywords? | | Find | 3 | Contextual Independence | LLM | No "as mentioned above"? | | Run | 4 | Configuration Precision | Partial | Exact versions, flags, paths? | | Run | 5 | Code Executability | LLM | All imports present? | | Run | 6 | Deterministic Instructions | Vale | No hedging ("might", "consider")? | | Verify | 7 | Execution Verification | LLM | Expected output + error recovery? | | Optimize | 8 | Terminology Strictness | Vale | 1:1 term-concept mapping? | | Optimize | 9 | Token Efficiency | Vale | No filler phrases? | | Optimize | 10 | Security Boundaries | Partial | No hardcoded secrets? | Tool assignment: - **Vale (5)**: 1, 2, 6, 8, 9 - Pattern matching, deterministic - **LLM (3)**: 3, 5, 7 - Semantic understanding required - **Partial (2)**: 4, 10 - Hybrid (Vale for patterns, LLM for edge cases) Each rubric includes: - Scoring levels with clear definitions (PASS/MARGINAL/FAIL) - Detect patterns (specific things to look for) - Reasoning requirement (must quote evidence) - Suggested fix field in output * Process and Workflow ** What Worked Well - Web search → web fetch pipeline for research - Beads for tracking design decisions - Iterative rubric development with user feedback - User pushing back on false distinctions ** What Was Challenging - orch unavailable due to missing API keys - blocked brainstorming with external model - Keeping rubrics crisp - tendency to over-specify * Learning and Insights ** Technical Insights - LLMs are "better at guided summarization than complex reasoning" - One few-shot example often outperforms multiple (diminishing returns) - Decomposed evaluation + deterministic aggregation > holistic evaluation - llms.txt is emerging as robots.txt equivalent for LLM access - MCP (Model Context Protocol) for structured doc discovery ** Process Insights - Research before design - the web search surfaced patterns we wouldn't have invented - GitHub's 2,500 repo study provided concrete evidence for conventions - Asking "what would a rubric for X look like" forces clarity on what X means ** Architectural Insights - Two-phase workflow (generate then review) separates concerns - Patches as intermediate format enables git integration - Per-dimension rubrics enable independent iteration ** Key Research Sources - GitHub blog: 2,500+ AGENTS.md analysis - "one snippet beats three paragraphs" - kapa.ai: proximity principle, self-contained sections, semantic discoverability - biel.ai: single purpose per section, complete code examples - Monte Carlo: 7 best practices for LLM-as-judge, failure modes - ACM ICER 2025: "Rubric Is All You Need" paper * Context for Future Work ** Open Questions - How to handle large repos with many doc files? (chunking strategy) - Should rubrics be weighted differently? - How to handle generated docs (should they be excluded)? - What's the right model for different stages (triage vs patch generation)? - Vale vs markdownlint vs custom - which linter is best for this use case? ** Next Steps (in ~/proj/doc-review) 1. Implement Vale style for 5 deterministic rubrics (doc-review-i8n) 2. Implement LLM prompts for 3 semantic rubrics (doc-review-4to) 3. Design CLI interface (doc-review-xy3) 4. Graph-based doc discovery (from skills-53k, needs new bead) 5. Test on real repos (skills repo is good candidate) ** Related Work - ~/proj/doc-review - New home for this project - doc-review-i8n, doc-review-4to, doc-review-xy3 - Active beads - orch-z2x: Model availability feature for orch - orch-6o3: OPENROUTER env var naming bug ** External References - https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/ - https://docs.kapa.ai/improving/writing-best-practices - https://biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide - https://www.montecarlodata.com/blog-llm-as-judge/ - https://arxiv.org/abs/2503.23989 (Rubric Is All You Need) - https://agents.md (AGENTS.md standard) - https://vale.sh (Vale prose linter) * Raw Notes ** Envisioned Workflow (from initial brainstorm) #+begin_src bash # Phase 1: Generate patches (non-interactive, burnable credits) doc-review scan ~/proj/foo --model claude-sonnet --output /tmp/foo-patches/ # Phase 2: Review patches (interactive session) cd ~/proj/foo claude # human reviews patches, applies selectively #+end_src ** Skill Structure (tentative) #+begin_example skills/doc-review/ ├── prompt.md # Core review instructions + style guide ├── scan.sh # Orchestrates: find docs → invoke claude → emit patches └── README.md #+end_example ** Output Format for Evaluations #+begin_src json { "section": "## Installation", "line_range": [15, 42], "evaluations": [ { "rubric": "self_containment", "score": "FAIL", "reasoning": "Line 23 says 'as configured above' but the configuration is in a different file.", "evidence": "as configured above", "suggested_fix": "Add explicit reference: 'as configured in config.yaml (see Configuration section)'" } ] } #+end_src ** Key Quote from Research #+begin_quote "LLMs are better at guided summarization than complex reasoning. If you can break down a problem into multiple smaller steps of guided summarization by creating a rubric, you can play to the strengths of LLMs and then rely on simple deterministic code to parse the output and score the rubric." — Monte Carlo, LLM-as-Judge best practices #+end_quote ** Gemini Critique of Rubrics (via orch) After fixing .envrc to include use_api_keys, ran high-temp Gemini critique. *** Overlaps Identified - Self-Containment vs Explicit Context had significant overlap - Fix: Split into Contextual Independence (referential) + Environmental State (runtime) *** Missing Dimensions for AI 1. **Verifiability** - agents get stuck when they can't tell if command succeeded 2. **Format Integrity** - broken markdown/JSON causes hallucinations 3. **Token Efficiency** - fluff dilutes semantic weight in vector search *** Key Insight #+begin_quote "Agents fail when given choices without criteria. Change 'Use a large instance' to 'Use instance type t3.2xlarge or greater.'" #+end_quote *** Revised Rubrics (v3 final: 10 dimensions, merged) | Phase | # | Rubric | |-------+---+--------| | Read | 1 | Format Integrity | | Find | 2 | Semantic Headings | | Find | 3 | Contextual Independence | | Run | 4 | Configuration Precision (merged: Env State + Tech Specificity) | | Run | 5 | Code Executability | | Run | 6 | Deterministic Instructions | | Verify | 7 | Execution Verification (merged: Verifiable Output + Error Recovery) | | Optimize | 8 | Terminology Strictness | | Optimize | 9 | Token Efficiency | | Optimize | 10 | Security Boundaries | ** Prompt Template Critique (Gemini round 3) *** Problem: 10 rubrics = cognitive overload #+begin_quote "Accuracy drops significantly after 3-5 distinct evaluation criteria per prompt" #+end_quote *** Anti-patterns identified 1. JSON-only output - model commits to score before thinking (fix: CoT first) 2. Evidence hallucination - model invents quotes (fix: allow null) 3. Positional bias - middle rubrics degrade (fix: split passes) *** Architecture options - **2-Pass**: Linter (structural) + Agent Simulator (semantic) - **Many-Pass**: One rubric per pass (max accuracy, max latency) - **Iterative/Recursive**: Quick scan → deep-dive on failures - **Hybrid**: Single-pass quick, multi-pass on demand *** The "Agent Simulator" persona #+begin_quote "Simulate execution step-by-step. Identify where you get stuck, where instructions are ambiguous, or where you might hallucinate due to missing context." #+end_quote ** Architecture Research (web search + Gemini round 4) *** Frameworks discovered | Framework | Key Insight | |-----------+-------------| | PRE (ACM 2025) | One rubric per call for accuracy | | G-Eval | 3-step: define → CoT → execute | | EvalPlanner | Plan → Execute → Judge | | Spring AI Recursive | Generate → Evaluate → Retry loop | | ARISE | 22 specialized agents for different tasks | | Realm | Recursive refinement with Bayesian aggregation | | HuCoSC | Break complex problems into independent analyses | *** Final Architecture: "Triage & Specialist" Cascade #+begin_example Phase 1: TRIAGE SCAN (cheap model) ├── Input: doc chunk + 10 rubric definitions ├── Output: list of suspect rubrics └── Cost: ~0 for clean docs Phase 2: SPECIALIST AGENTS (PRE-style, parallel) ├── One agent per suspect rubric └── Each outputs: patch for that dimension Phase 3: VERIFICATION LOOP (recursive) ├── Re-evaluate patched content └── Retry with feedback if still FAIL #+end_example *** Why this pattern - Adaptive cost: 80% of clean content pays near-zero - PRE accuracy: One dimension at a time reduces hallucination - Recursive safety: Verify patches don't introduce regressions ** Vale Discovery (Session Turning Point) Web search for "prose linter markdown documentation CI" revealed Vale: - Open-source, extensible prose linter - YAML-based rules (existence, substitution, consistency checks) - CI-friendly (runs in seconds, no LLM cost) - Already used by: Microsoft, Google, GitLab, DigitalOcean Realization: 5 of our rubrics are pattern-based → Vale can handle them: - Format Integrity (missing language tags) - Semantic Headings (banned words: "Overview", "Introduction") - Deterministic Instructions (hedging words: "might", "consider") - Terminology Strictness (consistency checks) - Token Efficiency (filler phrases) This shifted architecture from "Claude skill with all-LLM" to "standalone tool with Vale + LLM hybrid" ** Consensus Results *** "Should doc-review be a skill or separate repo?" GPT: "Start with MVP in separate repo. Skill integration can come later." Gemini: "Invest in orch as orchestration layer. Design doc-review for reuse." Both agreed: Separate repo is the right call. ** Final Architecture (Vale + LLM Hybrid) #+begin_example Stage 1: Vale (deterministic, fast, free) ├── Catches ~40% of issues instantly ├── Runs in CI on every commit └── No LLM cost for clean docs Stage 2: LLM Triage (cheap model) ├── Only runs if Vale passes └── Evaluates 3 semantic rubrics (contextual independence, code executability, execution verification) Stage 3: LLM Specialists (capable model) ├── One agent per failed rubric └── Generates patches #+end_example * Session Metrics - Commits made: 1 (139a521 - doc-review design session complete) - Files touched: 19 (per git diff --stat) - Lines added: +4187 - Lines removed: -65 - Beads created: 7 (skills-bcu, skills-1ig, skills-53k in skills; doc-review-i8n, doc-review-4to, doc-review-xy3 in doc-review; orch-z2x, orch-6o3 in orch) - Beads closed: 3 (skills-bcu, skills-1ig, skills-53k) - Web searches: 8 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive, prose linter, Vale) - Web fetches: 5 (GitHub blog, kapa.ai, biel.ai, Monte Carlo, vale.sh) - Gemini consultations: 4 (conventions → rubrics v2 → prompt critique → architecture synthesis) - Consensus runs: 2 (spin-off decision, implementation approach) - Rubrics evolution: 7 (v1) → 12 (v2) → 10 (v3 merged) - Key outcome: Vale + LLM hybrid architecture, spun off to ~/proj/doc-review