#+TITLE: Doc-Review Skill Design: Research and Rubric Development #+DATE: 2025-12-04 #+KEYWORDS: doc-review, rubrics, LLM-as-judge, agent-friendly-docs, documentation-drift, AGENTS.md #+COMMITS: 0 #+COMPRESSION_STATUS: uncompressed * Session Summary ** Date: 2025-12-04 (Day 1 of doc-review skill) ** Focus Area: Designing a doc-review skill to normalize documentation and fight documentation drift * Accomplishments - [X] Created bead structure for doc-review skill design (skills-bcu with 2 blockers) - [X] Researched agent-friendly documentation conventions via web search - [X] Analyzed GitHub's study of 2,500+ AGENTS.md repositories - [X] Researched LLM-as-judge rubric best practices - [X] Drafted 7 decomposed rubrics for documentation evaluation (v1) - [X] Filed orch UX issue (orch-z2x) for missing model availability command - [X] Fixed .envrc to include use_api_keys for orch access - [X] Got Gemini critique via orch - identified overlaps and 3 missing dimensions - [X] Revised rubrics to v2 (10 dimensions) based on AI-optimization feedback - [ ] Update rubrics-v2.md with full definitions - [ ] Prompt template (next step) - [ ] Before/after examples (next step) * Key Decisions ** Decision 1: Non-interactive patch-based workflow - Context: Deciding how doc-review should integrate with human workflow - Options considered: 1. Direct edit mode - LLM directly modifies docs 2. Report-only mode - LLM produces report, human edits 3. Patch-based mode - LLM generates patches, human reviews and applies - Rationale: Patch-based gives human control, enables review before apply, works with existing git workflow - Impact: Two-phase workflow: generate patches (non-interactive), review patches (interactive Claude session) ** Decision 2: Decomposed rubrics over holistic evaluation - Context: How should the LLM evaluate documentation quality? - Options considered: 1. Single holistic prompt ("rate this doc's quality") 2. Multi-dimensional single pass ("rate on 10 dimensions") 3. Decomposed rubrics (one dimension per evaluation) - Rationale: Research shows LLMs better at "guided summarization" than complex reasoning. Decomposed approach plays to LLM strengths. - Impact: 7 separate rubrics, each with clear question, scoring levels, and detect patterns ** Decision 3: Three-level scoring (PASS/MARGINAL/FAIL) - Context: What granularity for scoring? - Options considered: 1. Binary (pass/fail) 2. Three-level (pass/marginal/fail) 3. Five or ten point scale - Rationale: Binary loses nuance, 5+ point scales introduce ambiguity. Three levels give actionable distinction. - Impact: FAIL triggers patch generation, MARGINAL flags for human review, PASS means agent-friendly ** Decision 4: Graph-based doc discovery (deferred) - Context: How does doc-review find which docs to evaluate? - Decision: Start from README.md or AGENTS.md and graph out via links - Rationale: Not all .md files are documentation. Following links from root finds connected docs. - Impact: Created separate bead (skills-53k) to design discovery algorithm ** Decision 5: "Instruction Clarity" rubric added - Context: Initial 6 rubrics didn't cover the "boundaries" pattern from AGENTS.md research - Discussion: User asked what "boundaries" meant. Realized the ✅/⚠️/🚫 pattern generalizes beyond AGENTS.md - Decision: Added Rubric 7 - Instruction Clarity (required vs optional vs dangerous) - Impact: Covers clarity of optionality and risk in all documentation, not just agent instruction files * Problems & Solutions | Problem | Solution | Learning | |---------+----------+----------| | orch CLI failed - no API keys configured | Tried 4 models (gemini, flash, deepseek, gpt) - all failed | Need `orch models` command to show available models | | API keys not loading in skills repo | Added `use_api_keys` to .envrc (was only `use flake`) | Check .envrc has use_api_keys for repos needing API access | | False distinction between "general docs" and "agent instruction files" | User pushed back - all docs are agent-readable | Don't create artificial categories; generalize patterns | | Rubric research scattered across sources | Synthesized into single bead description | Centralize findings in beads for future reference | * Technical Details ** Artifacts Created - `/tmp/doc-review-drafts/rubrics-v1.md` - Full rubric definitions (181 lines) - `skills-bcu` - Main design bead with workflow description - `skills-1ig` - Conventions research bead (in_progress) - `skills-53k` - Graph discovery design bead (open) - `orch-z2x` - Filed in ~/proj/orch for model availability feature ** Commands Used #+begin_src bash # Web search for conventions WebSearch: "LLM-friendly documentation conventions AI agent readable docs best practices 2025" WebSearch: "AGENTS.md llms.txt AI coding assistant documentation format" WebSearch: "LLM as judge rubric prompt engineering best practices" # Web fetch for deep dives WebFetch: github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/ WebFetch: docs.kapa.ai/improving/writing-best-practices WebFetch: biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide WebFetch: montecarlodata.com/blog-llm-as-judge/ # Beads workflow bd create --title="Design doc-review skill" --type=feature bd dep add skills-bcu skills-1ig bd dep add skills-bcu skills-53k #+end_src ** The 7 Rubrics | # | Rubric | Key Question | |---+--------+--------------| | 1 | Self-Containment | Can section be understood alone? | | 2 | Heading Structure | Logical hierarchy, descriptive names? | | 3 | Terminology Consistency | Same term for same concept? | | 4 | Code Example Completeness | Examples runnable as shown? | | 5 | Explicit Context | Prerequisites stated, no implicit knowledge? | | 6 | Technical Specificity | Exact versions, flags, commands? | | 7 | Instruction Clarity | Required vs optional vs dangerous? | Each rubric includes: - Scoring levels with clear definitions - Detect patterns (specific things to look for) - Reasoning requirement (must quote evidence) - Suggested fix field in output * Process and Workflow ** What Worked Well - Web search → web fetch pipeline for research - Beads for tracking design decisions - Iterative rubric development with user feedback - User pushing back on false distinctions ** What Was Challenging - orch unavailable due to missing API keys - blocked brainstorming with external model - Keeping rubrics crisp - tendency to over-specify * Learning and Insights ** Technical Insights - LLMs are "better at guided summarization than complex reasoning" - One few-shot example often outperforms multiple (diminishing returns) - Decomposed evaluation + deterministic aggregation > holistic evaluation - llms.txt is emerging as robots.txt equivalent for LLM access - MCP (Model Context Protocol) for structured doc discovery ** Process Insights - Research before design - the web search surfaced patterns we wouldn't have invented - GitHub's 2,500 repo study provided concrete evidence for conventions - Asking "what would a rubric for X look like" forces clarity on what X means ** Architectural Insights - Two-phase workflow (generate then review) separates concerns - Patches as intermediate format enables git integration - Per-dimension rubrics enable independent iteration ** Key Research Sources - GitHub blog: 2,500+ AGENTS.md analysis - "one snippet beats three paragraphs" - kapa.ai: proximity principle, self-contained sections, semantic discoverability - biel.ai: single purpose per section, complete code examples - Monte Carlo: 7 best practices for LLM-as-judge, failure modes - ACM ICER 2025: "Rubric Is All You Need" paper * Context for Future Work ** Open Questions - How to handle large repos with many doc files? (chunking strategy) - Should rubrics be weighted differently? - How to handle generated docs (should they be excluded)? - What's the right model for different stages (eval vs patch generation)? ** Next Steps 1. Draft the prompt template that uses these rubrics 2. Create before/after examples for few-shot 3. Design graph-based doc discovery (skills-53k) 4. Prototype scan.sh script 5. Test on real repos ** Related Work - skills-53k: Graph-based doc discovery design - skills-bcu: Parent design bead - orch-z2x: Model availability feature for orch ** External References - https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/ - https://docs.kapa.ai/improving/writing-best-practices - https://biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide - https://www.montecarlodata.com/blog-llm-as-judge/ - https://arxiv.org/abs/2503.23989 (Rubric Is All You Need) - https://agents.md (AGENTS.md standard) * Raw Notes ** Envisioned Workflow (from initial brainstorm) #+begin_src bash # Phase 1: Generate patches (non-interactive, burnable credits) doc-review scan ~/proj/foo --model claude-sonnet --output /tmp/foo-patches/ # Phase 2: Review patches (interactive session) cd ~/proj/foo claude # human reviews patches, applies selectively #+end_src ** Skill Structure (tentative) #+begin_example skills/doc-review/ ├── prompt.md # Core review instructions + style guide ├── scan.sh # Orchestrates: find docs → invoke claude → emit patches └── README.md #+end_example ** Output Format for Evaluations #+begin_src json { "section": "## Installation", "line_range": [15, 42], "evaluations": [ { "rubric": "self_containment", "score": "FAIL", "reasoning": "Line 23 says 'as configured above' but the configuration is in a different file.", "evidence": "as configured above", "suggested_fix": "Add explicit reference: 'as configured in config.yaml (see Configuration section)'" } ] } #+end_src ** Key Quote from Research #+begin_quote "LLMs are better at guided summarization than complex reasoning. If you can break down a problem into multiple smaller steps of guided summarization by creating a rubric, you can play to the strengths of LLMs and then rely on simple deterministic code to parse the output and score the rubric." — Monte Carlo, LLM-as-Judge best practices #+end_quote ** Gemini Critique of Rubrics (via orch) After fixing .envrc to include use_api_keys, ran high-temp Gemini critique. *** Overlaps Identified - Self-Containment vs Explicit Context had significant overlap - Fix: Split into Contextual Independence (referential) + Environmental State (runtime) *** Missing Dimensions for AI 1. **Verifiability** - agents get stuck when they can't tell if command succeeded 2. **Format Integrity** - broken markdown/JSON causes hallucinations 3. **Token Efficiency** - fluff dilutes semantic weight in vector search *** Key Insight #+begin_quote "Agents fail when given choices without criteria. Change 'Use a large instance' to 'Use instance type t3.2xlarge or greater.'" #+end_quote *** Revised Rubrics (v3 final: 10 dimensions, merged) | Phase | # | Rubric | |-------+---+--------| | Read | 1 | Format Integrity | | Find | 2 | Semantic Headings | | Find | 3 | Contextual Independence | | Run | 4 | Configuration Precision (merged: Env State + Tech Specificity) | | Run | 5 | Code Executability | | Run | 6 | Deterministic Instructions | | Verify | 7 | Execution Verification (merged: Verifiable Output + Error Recovery) | | Optimize | 8 | Terminology Strictness | | Optimize | 9 | Token Efficiency | | Optimize | 10 | Security Boundaries | ** Prompt Template Critique (Gemini round 3) *** Problem: 10 rubrics = cognitive overload #+begin_quote "Accuracy drops significantly after 3-5 distinct evaluation criteria per prompt" #+end_quote *** Anti-patterns identified 1. JSON-only output - model commits to score before thinking (fix: CoT first) 2. Evidence hallucination - model invents quotes (fix: allow null) 3. Positional bias - middle rubrics degrade (fix: split passes) *** Architecture options - **2-Pass**: Linter (structural) + Agent Simulator (semantic) - **Many-Pass**: One rubric per pass (max accuracy, max latency) - **Iterative/Recursive**: Quick scan → deep-dive on failures - **Hybrid**: Single-pass quick, multi-pass on demand *** The "Agent Simulator" persona #+begin_quote "Simulate execution step-by-step. Identify where you get stuck, where instructions are ambiguous, or where you might hallucinate due to missing context." #+end_quote ** Architecture Research (web search + Gemini round 4) *** Frameworks discovered | Framework | Key Insight | |-----------+-------------| | PRE (ACM 2025) | One rubric per call for accuracy | | G-Eval | 3-step: define → CoT → execute | | EvalPlanner | Plan → Execute → Judge | | Spring AI Recursive | Generate → Evaluate → Retry loop | | ARISE | 22 specialized agents for different tasks | | Realm | Recursive refinement with Bayesian aggregation | | HuCoSC | Break complex problems into independent analyses | *** Final Architecture: "Triage & Specialist" Cascade #+begin_example Phase 1: TRIAGE SCAN (cheap model) ├── Input: doc chunk + 10 rubric definitions ├── Output: list of suspect rubrics └── Cost: ~0 for clean docs Phase 2: SPECIALIST AGENTS (PRE-style, parallel) ├── One agent per suspect rubric └── Each outputs: patch for that dimension Phase 3: VERIFICATION LOOP (recursive) ├── Re-evaluate patched content └── Retry with feedback if still FAIL #+end_example *** Why this pattern - Adaptive cost: 80% of clean content pays near-zero - PRE accuracy: One dimension at a time reduces hallucination - Recursive safety: Verify patches don't introduce regressions * Session Metrics - Commits made: 0 (design session, no code yet) - Files touched: 3 (rubrics-v1.md, rubrics-v3.md, prompt-template-v1.md in /tmp; .envrc) - Beads created: 4 (skills-bcu, skills-1ig, skills-53k, orch-z2x) - Beads closed: 1 (skills-dpw - duplicate moved to orch repo) - Web searches: 6 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive) - Web fetches: 4 (GitHub blog, kapa.ai, biel.ai, Monte Carlo) - Gemini consultations: 4 (conventions → rubrics v2 → prompt critique → architecture synthesis) - Rubrics evolution: 7 (v1) → 12 (v2) → 10 (v3 merged) - Key outcome: "Triage & Specialist" cascade architecture