skills/docs/worklogs/2025-12-04-doc-review-skill-design.org
dan def212bc5b docs: complete worklog for doc-review design session
Adds Vale discovery, spin-off decision, migration details,
and updated session metrics to the design session worklog.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 17:54:05 -08:00

20 KiB
Raw Permalink Blame History

Doc-Review Skill Design: Research, Rubrics, Vale Discovery, and Repo Spin-off

Session Summary

Date: 2025-12-04 (Day 1 of doc-review skill)

Focus Area: Designing a doc-review skill to normalize documentation and fight documentation drift

Accomplishments

  • Created bead structure for doc-review skill design (skills-bcu with 2 blockers)
  • Researched agent-friendly documentation conventions via web search
  • Analyzed GitHub's study of 2,500+ AGENTS.md repositories
  • Researched LLM-as-judge rubric best practices
  • Drafted 7 decomposed rubrics for documentation evaluation (v1)
  • Filed orch UX issue (orch-z2x) for missing model availability command
  • Fixed .envrc to include use_api_keys for orch access
  • Got Gemini critique via orch - identified overlaps and 3 missing dimensions
  • Revised rubrics to v2 (12 dimensions), merged to v3 (10 dimensions)
  • Created prompt-template-v1.md with full system prompt and example
  • Got Gemini critique on prompt - identified cognitive overload with 10 rubrics
  • Researched multi-pass/recursive LLM evaluation architectures (PRE, G-Eval, ARISE)
  • Designed "Triage & Specialist" cascade architecture
  • Discovered Vale prose linter - can handle 4-7 rubrics deterministically
  • Designed Vale + LLM hybrid architecture (deterministic first, LLM only for semantic)
  • Ran GPT/Gemini consensus on spin-off decision - both recommended separate repo
  • Created ~/proj/doc-review with beads setup
  • Migrated artifacts: rubrics-v3.md, prompt-template-v1.md to design/ folder
  • Updated doc-review/AGENTS.md with full architecture context
  • Created beads in doc-review: doc-review-i8n (Vale), doc-review-4to (LLM), doc-review-xy3 (CLI)
  • Closed skills repo beads: skills-bcu, skills-1ig, skills-53k
  • Committed and pushed skills repo

Key Decisions

Decision 1: Non-interactive patch-based workflow

  • Context: Deciding how doc-review should integrate with human workflow
  • Options considered:

    1. Direct edit mode - LLM directly modifies docs
    2. Report-only mode - LLM produces report, human edits
    3. Patch-based mode - LLM generates patches, human reviews and applies
  • Rationale: Patch-based gives human control, enables review before apply, works with existing git workflow
  • Impact: Two-phase workflow: generate patches (non-interactive), review patches (interactive Claude session)

Decision 2: Decomposed rubrics over holistic evaluation

  • Context: How should the LLM evaluate documentation quality?
  • Options considered:

    1. Single holistic prompt ("rate this doc's quality")
    2. Multi-dimensional single pass ("rate on 10 dimensions")
    3. Decomposed rubrics (one dimension per evaluation)
  • Rationale: Research shows LLMs better at "guided summarization" than complex reasoning. Decomposed approach plays to LLM strengths.
  • Impact: 7 separate rubrics, each with clear question, scoring levels, and detect patterns

Decision 3: Three-level scoring (PASS/MARGINAL/FAIL)

  • Context: What granularity for scoring?
  • Options considered:

    1. Binary (pass/fail)
    2. Three-level (pass/marginal/fail)
    3. Five or ten point scale
  • Rationale: Binary loses nuance, 5+ point scales introduce ambiguity. Three levels give actionable distinction.
  • Impact: FAIL triggers patch generation, MARGINAL flags for human review, PASS means agent-friendly

Decision 4: Graph-based doc discovery (deferred)

  • Context: How does doc-review find which docs to evaluate?
  • Decision: Start from README.md or AGENTS.md and graph out via links
  • Rationale: Not all .md files are documentation. Following links from root finds connected docs.
  • Impact: Created separate bead (skills-53k) to design discovery algorithm

Decision 5: "Instruction Clarity" rubric added

  • Context: Initial 6 rubrics didn't cover the "boundaries" pattern from AGENTS.md research
  • Discussion: User asked what "boundaries" meant. Realized the /⚠️/🚫 pattern generalizes beyond AGENTS.md
  • Decision: Added Rubric 7 - Instruction Clarity (required vs optional vs dangerous)
  • Impact: Covers clarity of optionality and risk in all documentation, not just agent instruction files

Decision 6: Vale + LLM hybrid architecture

  • Context: Gemini critique revealed 10 rubrics in one prompt causes cognitive overload
  • Discovery: Vale (prose linter) can handle 4-7 rubrics deterministically with YAML rules
  • Options considered:

    1. Pure LLM approach (original plan)
    2. Pure Vale approach (limited to pattern matching)
    3. Vale + LLM hybrid (deterministic first, LLM for semantic)
  • Rationale: Vale catches ~40% of issues instantly, for free, in CI. LLM only needed for semantic analysis.
  • Impact: Three-stage pipeline:

    • Stage 1: Vale (fast/free) - Format Integrity, Semantic Headings, Deterministic Instructions, Terminology, Token Efficiency
    • Stage 2: LLM Triage (cheap model) - Quick semantic scan
    • Stage 3: LLM Specialists (capable model) - Patch generation per failed rubric

Decision 7: Spin off to separate repository

  • Context: Project grew from "Claude skill" to "standalone tool with Vale + LLM"
  • Ran consensus: GPT and Gemini both recommended separate repo
  • GPT reasoning: "Start with MVP in separate repo, iterate fast"
  • Gemini reasoning: "Invest in orch integration, design for reuse"
  • Decision: Created ~/proj/doc-review as standalone project
  • Impact: Clean separation, own beads, can be used outside claude-code context

Problems & Solutions

Problem Solution Learning
orch CLI failed - no API keys configured Tried 4 models (gemini, flash, deepseek, gpt) - all failed Need `orch models` command to show available models
API keys not loading in skills repo Added `use_api_keys` to .envrc (was only `use flake`) Check .envrc has use_api_keys for repos needing API access
False distinction between "general docs" and "agent instruction files" User pushed back - all docs are agent-readable Don't create artificial categories; generalize patterns
Rubric research scattered across sources Synthesized into single bead description Centralize findings in beads for future reference
10 rubrics in one LLM prompt = cognitive overload Split: Vale for deterministic, LLM for semantic only Play to tool strengths - deterministic tools for pattern matching
OPENROUTER_KEY vs OPENROUTER_API_KEY mismatch Filed bug orch-6o3 in ~/proj/orch Environment variable naming must be consistent
bd sync failed with worktree error Used regular git add + commit instead bd sync has edge cases with certain git states
doc-review repo missing git identity Left for user to configure user.email/name New repos need git config before first commit

Technical Details

Artifacts Created

  • `/tmp/doc-review-drafts/rubrics-v1.md` - Initial 7 rubric definitions
  • `/tmp/doc-review-drafts/rubrics-v3.md` - Final 10 rubrics (320 lines)
  • `/tmp/doc-review-drafts/prompt-template-v1.md` - LLM system prompt with example
  • `~/proj/doc-review/docs/design/rubrics-v3.md` - Migrated final rubrics
  • `~/proj/doc-review/docs/design/prompt-template-v1.md` - Migrated prompt template
  • `~/proj/doc-review/AGENTS.md` - Full project context for next agent

Beads Created/Updated

In skills repo (all closed):

  • `skills-bcu` - Main design bead (closed: spun off to doc-review)
  • `skills-1ig` - Conventions research (closed: completed)
  • `skills-53k` - Graph discovery design (closed: moved to doc-review)

In doc-review repo:

  • `doc-review-i8n` - Implement Vale style for rubrics
  • `doc-review-4to` - Implement LLM prompts for semantic rubrics
  • `doc-review-xy3` - Design CLI interface

In orch repo:

  • `orch-z2x` - Model availability command feature
  • `orch-6o3` - OPENROUTER_KEY vs OPENROUTER_API_KEY bug

Commands Used

# Web search for conventions and research
WebSearch: "LLM-friendly documentation conventions AI agent readable docs best practices 2025"
WebSearch: "AGENTS.md llms.txt AI coding assistant documentation format"
WebSearch: "LLM as judge rubric prompt engineering best practices"
WebSearch: "multi-pass LLM evaluation recursive evaluation framework"
WebSearch: "prose linter markdown documentation CI"

# Web fetch for deep dives
WebFetch: github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
WebFetch: docs.kapa.ai/improving/writing-best-practices
WebFetch: biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide
WebFetch: montecarlodata.com/blog-llm-as-judge/
WebFetch: vale.sh

# orch consultations
orch ask gemini --temp 1.2 "Critique these rubrics for AI-optimized documentation..."
orch consensus gpt gemini "Should doc-review be a skill or separate repo?"

# Migration commands
mkdir -p ~/proj/doc-review/docs/design
cp /tmp/doc-review-drafts/*.md ~/proj/doc-review/docs/design/

# Beads workflow
bd close skills-bcu skills-1ig skills-53k

The 10 Rubrics (Final v3)

Organized by "Agent's Hierarchy of Needs":

Phase # Rubric Tool Key Question
Read 1 Format Integrity Vale Valid markdown? Code blocks tagged?
Find 2 Semantic Headings Vale Headings contain task+object keywords?
Find 3 Contextual Independence LLM No "as mentioned above"?
Run 4 Configuration Precision Partial Exact versions, flags, paths?
Run 5 Code Executability LLM All imports present?
Run 6 Deterministic Instructions Vale No hedging ("might", "consider")?
Verify 7 Execution Verification LLM Expected output + error recovery?
Optimize 8 Terminology Strictness Vale 1:1 term-concept mapping?
Optimize 9 Token Efficiency Vale No filler phrases?
Optimize 10 Security Boundaries Partial No hardcoded secrets?

Tool assignment:

  • Vale (5): 1, 2, 6, 8, 9 - Pattern matching, deterministic
  • LLM (3): 3, 5, 7 - Semantic understanding required
  • Partial (2): 4, 10 - Hybrid (Vale for patterns, LLM for edge cases)

Each rubric includes:

  • Scoring levels with clear definitions (PASS/MARGINAL/FAIL)
  • Detect patterns (specific things to look for)
  • Reasoning requirement (must quote evidence)
  • Suggested fix field in output

Process and Workflow

What Worked Well

  • Web search → web fetch pipeline for research
  • Beads for tracking design decisions
  • Iterative rubric development with user feedback
  • User pushing back on false distinctions

What Was Challenging

  • orch unavailable due to missing API keys - blocked brainstorming with external model
  • Keeping rubrics crisp - tendency to over-specify

Learning and Insights

Technical Insights

  • LLMs are "better at guided summarization than complex reasoning"
  • One few-shot example often outperforms multiple (diminishing returns)
  • Decomposed evaluation + deterministic aggregation > holistic evaluation
  • llms.txt is emerging as robots.txt equivalent for LLM access
  • MCP (Model Context Protocol) for structured doc discovery

Process Insights

  • Research before design - the web search surfaced patterns we wouldn't have invented
  • GitHub's 2,500 repo study provided concrete evidence for conventions
  • Asking "what would a rubric for X look like" forces clarity on what X means

Architectural Insights

  • Two-phase workflow (generate then review) separates concerns
  • Patches as intermediate format enables git integration
  • Per-dimension rubrics enable independent iteration

Key Research Sources

  • GitHub blog: 2,500+ AGENTS.md analysis - "one snippet beats three paragraphs"
  • kapa.ai: proximity principle, self-contained sections, semantic discoverability
  • biel.ai: single purpose per section, complete code examples
  • Monte Carlo: 7 best practices for LLM-as-judge, failure modes
  • ACM ICER 2025: "Rubric Is All You Need" paper

Context for Future Work

Open Questions

  • How to handle large repos with many doc files? (chunking strategy)
  • Should rubrics be weighted differently?
  • How to handle generated docs (should they be excluded)?
  • What's the right model for different stages (triage vs patch generation)?
  • Vale vs markdownlint vs custom - which linter is best for this use case?

Next Steps (in ~/proj/doc-review)

  1. Implement Vale style for 5 deterministic rubrics (doc-review-i8n)
  2. Implement LLM prompts for 3 semantic rubrics (doc-review-4to)
  3. Design CLI interface (doc-review-xy3)
  4. Graph-based doc discovery (from skills-53k, needs new bead)
  5. Test on real repos (skills repo is good candidate)

Related Work

  • ~/proj/doc-review - New home for this project
  • doc-review-i8n, doc-review-4to, doc-review-xy3 - Active beads
  • orch-z2x: Model availability feature for orch
  • orch-6o3: OPENROUTER env var naming bug

Raw Notes

Envisioned Workflow (from initial brainstorm)

# Phase 1: Generate patches (non-interactive, burnable credits)
doc-review scan ~/proj/foo --model claude-sonnet --output /tmp/foo-patches/

# Phase 2: Review patches (interactive session)
cd ~/proj/foo
claude  # human reviews patches, applies selectively

Skill Structure (tentative)

skills/doc-review/
├── prompt.md           # Core review instructions + style guide
├── scan.sh             # Orchestrates: find docs → invoke claude → emit patches
└── README.md

Output Format for Evaluations

{
  "section": "## Installation",
  "line_range": [15, 42],
  "evaluations": [
    {
      "rubric": "self_containment",
      "score": "FAIL",
      "reasoning": "Line 23 says 'as configured above' but the configuration is in a different file.",
      "evidence": "as configured above",
      "suggested_fix": "Add explicit reference: 'as configured in config.yaml (see Configuration section)'"
    }
  ]
}

Key Quote from Research

"LLMs are better at guided summarization than complex reasoning. If you can break down a problem into multiple smaller steps of guided summarization by creating a rubric, you can play to the strengths of LLMs and then rely on simple deterministic code to parse the output and score the rubric." — Monte Carlo, LLM-as-Judge best practices

Gemini Critique of Rubrics (via orch)

After fixing .envrc to include use_api_keys, ran high-temp Gemini critique.

Overlaps Identified

  • Self-Containment vs Explicit Context had significant overlap
  • Fix: Split into Contextual Independence (referential) + Environmental State (runtime)

Missing Dimensions for AI

  1. Verifiability - agents get stuck when they can't tell if command succeeded
  2. Format Integrity - broken markdown/JSON causes hallucinations
  3. Token Efficiency - fluff dilutes semantic weight in vector search

Key Insight

"Agents fail when given choices without criteria. Change 'Use a large instance' to 'Use instance type t3.2xlarge or greater.'"

Revised Rubrics (v3 final: 10 dimensions, merged)

Phase # Rubric
Read 1 Format Integrity
Find 2 Semantic Headings
Find 3 Contextual Independence
Run 4 Configuration Precision (merged: Env State + Tech Specificity)
Run 5 Code Executability
Run 6 Deterministic Instructions
Verify 7 Execution Verification (merged: Verifiable Output + Error Recovery)
Optimize 8 Terminology Strictness
Optimize 9 Token Efficiency
Optimize 10 Security Boundaries

Prompt Template Critique (Gemini round 3)

Problem: 10 rubrics = cognitive overload

"Accuracy drops significantly after 3-5 distinct evaluation criteria per prompt"

Anti-patterns identified

  1. JSON-only output - model commits to score before thinking (fix: CoT first)
  2. Evidence hallucination - model invents quotes (fix: allow null)
  3. Positional bias - middle rubrics degrade (fix: split passes)

Architecture options

  • 2-Pass: Linter (structural) + Agent Simulator (semantic)
  • Many-Pass: One rubric per pass (max accuracy, max latency)
  • Iterative/Recursive: Quick scan → deep-dive on failures
  • Hybrid: Single-pass quick, multi-pass on demand

The "Agent Simulator" persona

"Simulate execution step-by-step. Identify where you get stuck, where instructions are ambiguous, or where you might hallucinate due to missing context."

Architecture Research (web search + Gemini round 4)

Frameworks discovered

Framework Key Insight
PRE (ACM 2025) One rubric per call for accuracy
G-Eval 3-step: define → CoT → execute
EvalPlanner Plan → Execute → Judge
Spring AI Recursive Generate → Evaluate → Retry loop
ARISE 22 specialized agents for different tasks
Realm Recursive refinement with Bayesian aggregation
HuCoSC Break complex problems into independent analyses

Final Architecture: "Triage & Specialist" Cascade

Phase 1: TRIAGE SCAN (cheap model)
├── Input: doc chunk + 10 rubric definitions
├── Output: list of suspect rubrics
└── Cost: ~0 for clean docs

Phase 2: SPECIALIST AGENTS (PRE-style, parallel)
├── One agent per suspect rubric
└── Each outputs: patch for that dimension

Phase 3: VERIFICATION LOOP (recursive)
├── Re-evaluate patched content
└── Retry with feedback if still FAIL

Why this pattern

  • Adaptive cost: 80% of clean content pays near-zero
  • PRE accuracy: One dimension at a time reduces hallucination
  • Recursive safety: Verify patches don't introduce regressions

Vale Discovery (Session Turning Point)

Web search for "prose linter markdown documentation CI" revealed Vale:

  • Open-source, extensible prose linter
  • YAML-based rules (existence, substitution, consistency checks)
  • CI-friendly (runs in seconds, no LLM cost)
  • Already used by: Microsoft, Google, GitLab, DigitalOcean

Realization: 5 of our rubrics are pattern-based → Vale can handle them:

  • Format Integrity (missing language tags)
  • Semantic Headings (banned words: "Overview", "Introduction")
  • Deterministic Instructions (hedging words: "might", "consider")
  • Terminology Strictness (consistency checks)
  • Token Efficiency (filler phrases)

This shifted architecture from "Claude skill with all-LLM" to "standalone tool with Vale + LLM hybrid"

Consensus Results

"Should doc-review be a skill or separate repo?"

GPT: "Start with MVP in separate repo. Skill integration can come later." Gemini: "Invest in orch as orchestration layer. Design doc-review for reuse." Both agreed: Separate repo is the right call.

Final Architecture (Vale + LLM Hybrid)

Stage 1: Vale (deterministic, fast, free)
├── Catches ~40% of issues instantly
├── Runs in CI on every commit
└── No LLM cost for clean docs

Stage 2: LLM Triage (cheap model)
├── Only runs if Vale passes
└── Evaluates 3 semantic rubrics (contextual independence, code executability, execution verification)

Stage 3: LLM Specialists (capable model)
├── One agent per failed rubric
└── Generates patches

Session Metrics

  • Commits made: 1 (139a521 - doc-review design session complete)
  • Files touched: 19 (per git diff stat)
  • Lines added: +4187
  • Lines removed: -65
  • Beads created: 7 (skills-bcu, skills-1ig, skills-53k in skills; doc-review-i8n, doc-review-4to, doc-review-xy3 in doc-review; orch-z2x, orch-6o3 in orch)
  • Beads closed: 3 (skills-bcu, skills-1ig, skills-53k)
  • Web searches: 8 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive, prose linter, Vale)
  • Web fetches: 5 (GitHub blog, kapa.ai, biel.ai, Monte Carlo, vale.sh)
  • Gemini consultations: 4 (conventions → rubrics v2 → prompt critique → architecture synthesis)
  • Consensus runs: 2 (spin-off decision, implementation approach)
  • Rubrics evolution: 7 (v1) → 12 (v2) → 10 (v3 merged)
  • Key outcome: Vale + LLM hybrid architecture, spun off to ~/proj/doc-review