#+TITLE: Doc-Review Skill Design: Research and Rubric Development
#+DATE: 2025-12-04
#+KEYWORDS: doc-review, rubrics, LLM-as-judge, agent-friendly-docs, documentation-drift, AGENTS.md
#+COMMITS: 0
#+COMPRESSION_STATUS: uncompressed

* Session Summary
** Date: 2025-12-04 (Day 1 of doc-review skill)
** Focus Area: Designing a doc-review skill to normalize documentation and fight documentation drift

* Accomplishments
- [X] Created bead structure for doc-review skill design (skills-bcu with 2 blockers)
- [X] Researched agent-friendly documentation conventions via web search
- [X] Analyzed GitHub's study of 2,500+ AGENTS.md repositories
- [X] Researched LLM-as-judge rubric best practices
- [X] Drafted 7 decomposed rubrics for documentation evaluation (v1)
- [X] Filed orch UX issue (orch-z2x) for missing model availability command
- [X] Fixed .envrc to include use_api_keys for orch access
- [X] Got Gemini critique via orch - identified overlaps and 3 missing dimensions
- [X] Revised rubrics to v2 (10 dimensions) based on AI-optimization feedback
- [ ] Update rubrics-v2.md with full definitions
- [ ] Prompt template (next step)
- [ ] Before/after examples (next step)

* Key Decisions
** Decision 1: Non-interactive patch-based workflow
- Context: Deciding how doc-review should integrate with human workflow
- Options considered:
  1. Direct edit mode - LLM directly modifies docs
  2. Report-only mode - LLM produces report, human edits
  3. Patch-based mode - LLM generates patches, human reviews and applies
- Rationale: Patch-based gives human control, enables review before apply, works with existing git workflow
- Impact: Two-phase workflow: generate patches (non-interactive), review patches (interactive Claude session)

** Decision 2: Decomposed rubrics over holistic evaluation
- Context: How should the LLM evaluate documentation quality?
- Options considered:
  1. Single holistic prompt ("rate this doc's quality")
  2. Multi-dimensional single pass ("rate on 10 dimensions")
  3. Decomposed rubrics (one dimension per evaluation)
- Rationale: Research shows LLMs better at "guided summarization" than complex reasoning. Decomposed approach plays to LLM strengths.
- Impact: 7 separate rubrics, each with clear question, scoring levels, and detect patterns

** Decision 3: Three-level scoring (PASS/MARGINAL/FAIL)
- Context: What granularity for scoring?
- Options considered:
  1. Binary (pass/fail)
  2. Three-level (pass/marginal/fail)
  3. Five or ten point scale
- Rationale: Binary loses nuance, 5+ point scales introduce ambiguity. Three levels give actionable distinction.
- Impact: FAIL triggers patch generation, MARGINAL flags for human review, PASS means agent-friendly

** Decision 4: Graph-based doc discovery (deferred)
- Context: How does doc-review find which docs to evaluate?
- Decision: Start from README.md or AGENTS.md and graph out via links
- Rationale: Not all .md files are documentation. Following links from root finds connected docs.
- Impact: Created separate bead (skills-53k) to design discovery algorithm

** Decision 5: "Instruction Clarity" rubric added
- Context: Initial 6 rubrics didn't cover the "boundaries" pattern from AGENTS.md research
- Discussion: User asked what "boundaries" meant. Realized the ✅/⚠️/🚫 pattern generalizes beyond AGENTS.md
- Decision: Added Rubric 7 - Instruction Clarity (required vs optional vs dangerous)
- Impact: Covers clarity of optionality and risk in all documentation, not just agent instruction files

* Problems & Solutions
| Problem | Solution | Learning |
|---------+----------+----------|
| orch CLI failed - no API keys configured | Tried 4 models (gemini, flash, deepseek, gpt) - all failed | Need `orch models` command to show available models |
| API keys not loading in skills repo | Added `use_api_keys` to .envrc (was only `use flake`) | Check .envrc has use_api_keys for repos needing API access |
| False distinction between "general docs" and "agent instruction files" | User pushed back - all docs are agent-readable | Don't create artificial categories; generalize patterns |
| Rubric research scattered across sources | Synthesized into single bead description | Centralize findings in beads for future reference |

* Technical Details

** Artifacts Created
- `/tmp/doc-review-drafts/rubrics-v1.md` - Full rubric definitions (181 lines)
- `skills-bcu` - Main design bead with workflow description
- `skills-1ig` - Conventions research bead (in_progress)
- `skills-53k` - Graph discovery design bead (open)
- `orch-z2x` - Filed in ~/proj/orch for model availability feature

** Commands Used
#+begin_src bash
# Web search for conventions
WebSearch: "LLM-friendly documentation conventions AI agent readable docs best practices 2025"
WebSearch: "AGENTS.md llms.txt AI coding assistant documentation format"
WebSearch: "LLM as judge rubric prompt engineering best practices"

# Web fetch for deep dives
WebFetch: github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
WebFetch: docs.kapa.ai/improving/writing-best-practices
WebFetch: biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide
WebFetch: montecarlodata.com/blog-llm-as-judge/

# Beads workflow
bd create --title="Design doc-review skill" --type=feature
bd dep add skills-bcu skills-1ig
bd dep add skills-bcu skills-53k
#+end_src

** The 7 Rubrics
| # | Rubric | Key Question |
|---+--------+--------------|
| 1 | Self-Containment | Can section be understood alone? |
| 2 | Heading Structure | Logical hierarchy, descriptive names? |
| 3 | Terminology Consistency | Same term for same concept? |
| 4 | Code Example Completeness | Examples runnable as shown? |
| 5 | Explicit Context | Prerequisites stated, no implicit knowledge? |
| 6 | Technical Specificity | Exact versions, flags, commands? |
| 7 | Instruction Clarity | Required vs optional vs dangerous? |

Each rubric includes:
- Scoring levels with clear definitions
- Detect patterns (specific things to look for)
- Reasoning requirement (must quote evidence)
- Suggested fix field in output

* Process and Workflow

** What Worked Well
- Web search → web fetch pipeline for research
- Beads for tracking design decisions
- Iterative rubric development with user feedback
- User pushing back on false distinctions

** What Was Challenging
- orch unavailable due to missing API keys - blocked brainstorming with external model
- Keeping rubrics crisp - tendency to over-specify

* Learning and Insights

** Technical Insights
- LLMs are "better at guided summarization than complex reasoning"
- One few-shot example often outperforms multiple (diminishing returns)
- Decomposed evaluation + deterministic aggregation > holistic evaluation
- llms.txt is emerging as robots.txt equivalent for LLM access
- MCP (Model Context Protocol) for structured doc discovery

** Process Insights
- Research before design - the web search surfaced patterns we wouldn't have invented
- GitHub's 2,500 repo study provided concrete evidence for conventions
- Asking "what would a rubric for X look like" forces clarity on what X means

** Architectural Insights
- Two-phase workflow (generate then review) separates concerns
- Patches as intermediate format enables git integration
- Per-dimension rubrics enable independent iteration

** Key Research Sources
- GitHub blog: 2,500+ AGENTS.md analysis - "one snippet beats three paragraphs"
- kapa.ai: proximity principle, self-contained sections, semantic discoverability
- biel.ai: single purpose per section, complete code examples
- Monte Carlo: 7 best practices for LLM-as-judge, failure modes
- ACM ICER 2025: "Rubric Is All You Need" paper

* Context for Future Work

** Open Questions
- How to handle large repos with many doc files? (chunking strategy)
- Should rubrics be weighted differently?
- How to handle generated docs (should they be excluded)?
- What's the right model for different stages (eval vs patch generation)?

** Next Steps
1. Draft the prompt template that uses these rubrics
2. Create before/after examples for few-shot
3. Design graph-based doc discovery (skills-53k)
4. Prototype scan.sh script
5. Test on real repos

** Related Work
- skills-53k: Graph-based doc discovery design
- skills-bcu: Parent design bead
- orch-z2x: Model availability feature for orch

** External References
- https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
- https://docs.kapa.ai/improving/writing-best-practices
- https://biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide
- https://www.montecarlodata.com/blog-llm-as-judge/
- https://arxiv.org/abs/2503.23989 (Rubric Is All You Need)
- https://agents.md (AGENTS.md standard)

* Raw Notes

** Envisioned Workflow (from initial brainstorm)
#+begin_src bash
# Phase 1: Generate patches (non-interactive, burnable credits)
doc-review scan ~/proj/foo --model claude-sonnet --output /tmp/foo-patches/

# Phase 2: Review patches (interactive session)
cd ~/proj/foo
claude  # human reviews patches, applies selectively
#+end_src

** Skill Structure (tentative)
#+begin_example
skills/doc-review/
├── prompt.md           # Core review instructions + style guide
├── scan.sh             # Orchestrates: find docs → invoke claude → emit patches
└── README.md
#+end_example

** Output Format for Evaluations
#+begin_src json
{
  "section": "## Installation",
  "line_range": [15, 42],
  "evaluations": [
    {
      "rubric": "self_containment",
      "score": "FAIL",
      "reasoning": "Line 23 says 'as configured above' but the configuration is in a different file.",
      "evidence": "as configured above",
      "suggested_fix": "Add explicit reference: 'as configured in config.yaml (see Configuration section)'"
    }
  ]
}
#+end_src

** Key Quote from Research
#+begin_quote
"LLMs are better at guided summarization than complex reasoning. If you can break down a problem into multiple smaller steps of guided summarization by creating a rubric, you can play to the strengths of LLMs and then rely on simple deterministic code to parse the output and score the rubric."
— Monte Carlo, LLM-as-Judge best practices
#+end_quote

** Gemini Critique of Rubrics (via orch)
After fixing .envrc to include use_api_keys, ran high-temp Gemini critique.

*** Overlaps Identified
- Self-Containment vs Explicit Context had significant overlap
- Fix: Split into Contextual Independence (referential) + Environmental State (runtime)

*** Missing Dimensions for AI
1. **Verifiability** - agents get stuck when they can't tell if command succeeded
2. **Format Integrity** - broken markdown/JSON causes hallucinations
3. **Token Efficiency** - fluff dilutes semantic weight in vector search

*** Key Insight
#+begin_quote
"Agents fail when given choices without criteria. Change 'Use a large instance' to 'Use instance type t3.2xlarge or greater.'"
#+end_quote

*** Revised Rubrics (v3 final: 10 dimensions, merged)
| Phase | # | Rubric |
|-------+---+--------|
| Read | 1 | Format Integrity |
| Find | 2 | Semantic Headings |
| Find | 3 | Contextual Independence |
| Run | 4 | Configuration Precision (merged: Env State + Tech Specificity) |
| Run | 5 | Code Executability |
| Run | 6 | Deterministic Instructions |
| Verify | 7 | Execution Verification (merged: Verifiable Output + Error Recovery) |
| Optimize | 8 | Terminology Strictness |
| Optimize | 9 | Token Efficiency |
| Optimize | 10 | Security Boundaries |

** Prompt Template Critique (Gemini round 3)

*** Problem: 10 rubrics = cognitive overload
#+begin_quote
"Accuracy drops significantly after 3-5 distinct evaluation criteria per prompt"
#+end_quote

*** Anti-patterns identified
1. JSON-only output - model commits to score before thinking (fix: CoT first)
2. Evidence hallucination - model invents quotes (fix: allow null)
3. Positional bias - middle rubrics degrade (fix: split passes)

*** Architecture options
- **2-Pass**: Linter (structural) + Agent Simulator (semantic)
- **Many-Pass**: One rubric per pass (max accuracy, max latency)
- **Iterative/Recursive**: Quick scan → deep-dive on failures
- **Hybrid**: Single-pass quick, multi-pass on demand

*** The "Agent Simulator" persona
#+begin_quote
"Simulate execution step-by-step. Identify where you get stuck, where instructions are ambiguous, or where you might hallucinate due to missing context."
#+end_quote

** Architecture Research (web search + Gemini round 4)

*** Frameworks discovered
| Framework | Key Insight |
|-----------+-------------|
| PRE (ACM 2025) | One rubric per call for accuracy |
| G-Eval | 3-step: define → CoT → execute |
| EvalPlanner | Plan → Execute → Judge |
| Spring AI Recursive | Generate → Evaluate → Retry loop |
| ARISE | 22 specialized agents for different tasks |
| Realm | Recursive refinement with Bayesian aggregation |
| HuCoSC | Break complex problems into independent analyses |

*** Final Architecture: "Triage & Specialist" Cascade

#+begin_example
Phase 1: TRIAGE SCAN (cheap model)
├── Input: doc chunk + 10 rubric definitions
├── Output: list of suspect rubrics
└── Cost: ~0 for clean docs

Phase 2: SPECIALIST AGENTS (PRE-style, parallel)
├── One agent per suspect rubric
└── Each outputs: patch for that dimension

Phase 3: VERIFICATION LOOP (recursive)
├── Re-evaluate patched content
└── Retry with feedback if still FAIL
#+end_example

*** Why this pattern
- Adaptive cost: 80% of clean content pays near-zero
- PRE accuracy: One dimension at a time reduces hallucination
- Recursive safety: Verify patches don't introduce regressions

* Session Metrics
- Commits made: 0 (design session, no code yet)
- Files touched: 3 (rubrics-v1.md, rubrics-v3.md, prompt-template-v1.md in /tmp; .envrc)
- Beads created: 4 (skills-bcu, skills-1ig, skills-53k, orch-z2x)
- Beads closed: 1 (skills-dpw - duplicate moved to orch repo)
- Web searches: 6 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive)
- Web fetches: 4 (GitHub blog, kapa.ai, biel.ai, Monte Carlo)
- Gemini consultations: 4 (conventions → rubrics v2 → prompt critique → architecture synthesis)
- Rubrics evolution: 7 (v1) → 12 (v2) → 10 (v3 merged)
- Key outcome: "Triage & Specialist" cascade architecture