Adds Vale discovery, spin-off decision, migration details, and updated session metrics to the design session worklog. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
20 KiB
Doc-Review Skill Design: Research, Rubrics, Vale Discovery, and Repo Spin-off
- Session Summary
- Accomplishments
- Key Decisions
- Decision 1: Non-interactive patch-based workflow
- Decision 2: Decomposed rubrics over holistic evaluation
- Decision 3: Three-level scoring (PASS/MARGINAL/FAIL)
- Decision 4: Graph-based doc discovery (deferred)
- Decision 5: "Instruction Clarity" rubric added
- Decision 6: Vale + LLM hybrid architecture
- Decision 7: Spin off to separate repository
- Problems & Solutions
- Technical Details
- Process and Workflow
- Learning and Insights
- Context for Future Work
- Raw Notes
- Envisioned Workflow (from initial brainstorm)
- Skill Structure (tentative)
- Output Format for Evaluations
- Key Quote from Research
- Gemini Critique of Rubrics (via orch)
- Prompt Template Critique (Gemini round 3)
- Architecture Research (web search + Gemini round 4)
- Vale Discovery (Session Turning Point)
- Consensus Results
- Final Architecture (Vale + LLM Hybrid)
- Session Metrics
Session Summary
Date: 2025-12-04 (Day 1 of doc-review skill)
Focus Area: Designing a doc-review skill to normalize documentation and fight documentation drift
Accomplishments
- Created bead structure for doc-review skill design (skills-bcu with 2 blockers)
- Researched agent-friendly documentation conventions via web search
- Analyzed GitHub's study of 2,500+ AGENTS.md repositories
- Researched LLM-as-judge rubric best practices
- Drafted 7 decomposed rubrics for documentation evaluation (v1)
- Filed orch UX issue (orch-z2x) for missing model availability command
- Fixed .envrc to include use_api_keys for orch access
- Got Gemini critique via orch - identified overlaps and 3 missing dimensions
- Revised rubrics to v2 (12 dimensions), merged to v3 (10 dimensions)
- Created prompt-template-v1.md with full system prompt and example
- Got Gemini critique on prompt - identified cognitive overload with 10 rubrics
- Researched multi-pass/recursive LLM evaluation architectures (PRE, G-Eval, ARISE)
- Designed "Triage & Specialist" cascade architecture
- Discovered Vale prose linter - can handle 4-7 rubrics deterministically
- Designed Vale + LLM hybrid architecture (deterministic first, LLM only for semantic)
- Ran GPT/Gemini consensus on spin-off decision - both recommended separate repo
- Created ~/proj/doc-review with beads setup
- Migrated artifacts: rubrics-v3.md, prompt-template-v1.md to design/ folder
- Updated doc-review/AGENTS.md with full architecture context
- Created beads in doc-review: doc-review-i8n (Vale), doc-review-4to (LLM), doc-review-xy3 (CLI)
- Closed skills repo beads: skills-bcu, skills-1ig, skills-53k
- Committed and pushed skills repo
Key Decisions
Decision 1: Non-interactive patch-based workflow
- Context: Deciding how doc-review should integrate with human workflow
-
Options considered:
- Direct edit mode - LLM directly modifies docs
- Report-only mode - LLM produces report, human edits
- Patch-based mode - LLM generates patches, human reviews and applies
- Rationale: Patch-based gives human control, enables review before apply, works with existing git workflow
- Impact: Two-phase workflow: generate patches (non-interactive), review patches (interactive Claude session)
Decision 2: Decomposed rubrics over holistic evaluation
- Context: How should the LLM evaluate documentation quality?
-
Options considered:
- Single holistic prompt ("rate this doc's quality")
- Multi-dimensional single pass ("rate on 10 dimensions")
- Decomposed rubrics (one dimension per evaluation)
- Rationale: Research shows LLMs better at "guided summarization" than complex reasoning. Decomposed approach plays to LLM strengths.
- Impact: 7 separate rubrics, each with clear question, scoring levels, and detect patterns
Decision 3: Three-level scoring (PASS/MARGINAL/FAIL)
- Context: What granularity for scoring?
-
Options considered:
- Binary (pass/fail)
- Three-level (pass/marginal/fail)
- Five or ten point scale
- Rationale: Binary loses nuance, 5+ point scales introduce ambiguity. Three levels give actionable distinction.
- Impact: FAIL triggers patch generation, MARGINAL flags for human review, PASS means agent-friendly
Decision 4: Graph-based doc discovery (deferred)
- Context: How does doc-review find which docs to evaluate?
- Decision: Start from README.md or AGENTS.md and graph out via links
- Rationale: Not all .md files are documentation. Following links from root finds connected docs.
- Impact: Created separate bead (skills-53k) to design discovery algorithm
Decision 5: "Instruction Clarity" rubric added
- Context: Initial 6 rubrics didn't cover the "boundaries" pattern from AGENTS.md research
- Discussion: User asked what "boundaries" meant. Realized the ✅/⚠️/🚫 pattern generalizes beyond AGENTS.md
- Decision: Added Rubric 7 - Instruction Clarity (required vs optional vs dangerous)
- Impact: Covers clarity of optionality and risk in all documentation, not just agent instruction files
Decision 6: Vale + LLM hybrid architecture
- Context: Gemini critique revealed 10 rubrics in one prompt causes cognitive overload
- Discovery: Vale (prose linter) can handle 4-7 rubrics deterministically with YAML rules
-
Options considered:
- Pure LLM approach (original plan)
- Pure Vale approach (limited to pattern matching)
- Vale + LLM hybrid (deterministic first, LLM for semantic)
- Rationale: Vale catches ~40% of issues instantly, for free, in CI. LLM only needed for semantic analysis.
-
Impact: Three-stage pipeline:
- Stage 1: Vale (fast/free) - Format Integrity, Semantic Headings, Deterministic Instructions, Terminology, Token Efficiency
- Stage 2: LLM Triage (cheap model) - Quick semantic scan
- Stage 3: LLM Specialists (capable model) - Patch generation per failed rubric
Decision 7: Spin off to separate repository
- Context: Project grew from "Claude skill" to "standalone tool with Vale + LLM"
- Ran consensus: GPT and Gemini both recommended separate repo
- GPT reasoning: "Start with MVP in separate repo, iterate fast"
- Gemini reasoning: "Invest in orch integration, design for reuse"
- Decision: Created ~/proj/doc-review as standalone project
- Impact: Clean separation, own beads, can be used outside claude-code context
Problems & Solutions
| Problem | Solution | Learning |
|---|---|---|
| orch CLI failed - no API keys configured | Tried 4 models (gemini, flash, deepseek, gpt) - all failed | Need `orch models` command to show available models |
| API keys not loading in skills repo | Added `use_api_keys` to .envrc (was only `use flake`) | Check .envrc has use_api_keys for repos needing API access |
| False distinction between "general docs" and "agent instruction files" | User pushed back - all docs are agent-readable | Don't create artificial categories; generalize patterns |
| Rubric research scattered across sources | Synthesized into single bead description | Centralize findings in beads for future reference |
| 10 rubrics in one LLM prompt = cognitive overload | Split: Vale for deterministic, LLM for semantic only | Play to tool strengths - deterministic tools for pattern matching |
| OPENROUTER_KEY vs OPENROUTER_API_KEY mismatch | Filed bug orch-6o3 in ~/proj/orch | Environment variable naming must be consistent |
| bd sync failed with worktree error | Used regular git add + commit instead | bd sync has edge cases with certain git states |
| doc-review repo missing git identity | Left for user to configure user.email/name | New repos need git config before first commit |
Technical Details
Artifacts Created
- `/tmp/doc-review-drafts/rubrics-v1.md` - Initial 7 rubric definitions
- `/tmp/doc-review-drafts/rubrics-v3.md` - Final 10 rubrics (320 lines)
- `/tmp/doc-review-drafts/prompt-template-v1.md` - LLM system prompt with example
- `~/proj/doc-review/docs/design/rubrics-v3.md` - Migrated final rubrics
- `~/proj/doc-review/docs/design/prompt-template-v1.md` - Migrated prompt template
- `~/proj/doc-review/AGENTS.md` - Full project context for next agent
Beads Created/Updated
In skills repo (all closed):
- `skills-bcu` - Main design bead (closed: spun off to doc-review)
- `skills-1ig` - Conventions research (closed: completed)
- `skills-53k` - Graph discovery design (closed: moved to doc-review)
In doc-review repo:
- `doc-review-i8n` - Implement Vale style for rubrics
- `doc-review-4to` - Implement LLM prompts for semantic rubrics
- `doc-review-xy3` - Design CLI interface
In orch repo:
- `orch-z2x` - Model availability command feature
- `orch-6o3` - OPENROUTER_KEY vs OPENROUTER_API_KEY bug
Commands Used
# Web search for conventions and research
WebSearch: "LLM-friendly documentation conventions AI agent readable docs best practices 2025"
WebSearch: "AGENTS.md llms.txt AI coding assistant documentation format"
WebSearch: "LLM as judge rubric prompt engineering best practices"
WebSearch: "multi-pass LLM evaluation recursive evaluation framework"
WebSearch: "prose linter markdown documentation CI"
# Web fetch for deep dives
WebFetch: github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
WebFetch: docs.kapa.ai/improving/writing-best-practices
WebFetch: biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide
WebFetch: montecarlodata.com/blog-llm-as-judge/
WebFetch: vale.sh
# orch consultations
orch ask gemini --temp 1.2 "Critique these rubrics for AI-optimized documentation..."
orch consensus gpt gemini "Should doc-review be a skill or separate repo?"
# Migration commands
mkdir -p ~/proj/doc-review/docs/design
cp /tmp/doc-review-drafts/*.md ~/proj/doc-review/docs/design/
# Beads workflow
bd close skills-bcu skills-1ig skills-53k
The 10 Rubrics (Final v3)
Organized by "Agent's Hierarchy of Needs":
| Phase | # | Rubric | Tool | Key Question |
|---|---|---|---|---|
| Read | 1 | Format Integrity | Vale | Valid markdown? Code blocks tagged? |
| Find | 2 | Semantic Headings | Vale | Headings contain task+object keywords? |
| Find | 3 | Contextual Independence | LLM | No "as mentioned above"? |
| Run | 4 | Configuration Precision | Partial | Exact versions, flags, paths? |
| Run | 5 | Code Executability | LLM | All imports present? |
| Run | 6 | Deterministic Instructions | Vale | No hedging ("might", "consider")? |
| Verify | 7 | Execution Verification | LLM | Expected output + error recovery? |
| Optimize | 8 | Terminology Strictness | Vale | 1:1 term-concept mapping? |
| Optimize | 9 | Token Efficiency | Vale | No filler phrases? |
| Optimize | 10 | Security Boundaries | Partial | No hardcoded secrets? |
Tool assignment:
- Vale (5): 1, 2, 6, 8, 9 - Pattern matching, deterministic
- LLM (3): 3, 5, 7 - Semantic understanding required
- Partial (2): 4, 10 - Hybrid (Vale for patterns, LLM for edge cases)
Each rubric includes:
- Scoring levels with clear definitions (PASS/MARGINAL/FAIL)
- Detect patterns (specific things to look for)
- Reasoning requirement (must quote evidence)
- Suggested fix field in output
Process and Workflow
What Worked Well
- Web search → web fetch pipeline for research
- Beads for tracking design decisions
- Iterative rubric development with user feedback
- User pushing back on false distinctions
What Was Challenging
- orch unavailable due to missing API keys - blocked brainstorming with external model
- Keeping rubrics crisp - tendency to over-specify
Learning and Insights
Technical Insights
- LLMs are "better at guided summarization than complex reasoning"
- One few-shot example often outperforms multiple (diminishing returns)
- Decomposed evaluation + deterministic aggregation > holistic evaluation
- llms.txt is emerging as robots.txt equivalent for LLM access
- MCP (Model Context Protocol) for structured doc discovery
Process Insights
- Research before design - the web search surfaced patterns we wouldn't have invented
- GitHub's 2,500 repo study provided concrete evidence for conventions
- Asking "what would a rubric for X look like" forces clarity on what X means
Architectural Insights
- Two-phase workflow (generate then review) separates concerns
- Patches as intermediate format enables git integration
- Per-dimension rubrics enable independent iteration
Key Research Sources
- GitHub blog: 2,500+ AGENTS.md analysis - "one snippet beats three paragraphs"
- kapa.ai: proximity principle, self-contained sections, semantic discoverability
- biel.ai: single purpose per section, complete code examples
- Monte Carlo: 7 best practices for LLM-as-judge, failure modes
- ACM ICER 2025: "Rubric Is All You Need" paper
Context for Future Work
Open Questions
- How to handle large repos with many doc files? (chunking strategy)
- Should rubrics be weighted differently?
- How to handle generated docs (should they be excluded)?
- What's the right model for different stages (triage vs patch generation)?
- Vale vs markdownlint vs custom - which linter is best for this use case?
Next Steps (in ~/proj/doc-review)
- Implement Vale style for 5 deterministic rubrics (doc-review-i8n)
- Implement LLM prompts for 3 semantic rubrics (doc-review-4to)
- Design CLI interface (doc-review-xy3)
- Graph-based doc discovery (from skills-53k, needs new bead)
- Test on real repos (skills repo is good candidate)
Related Work
- ~/proj/doc-review - New home for this project
- doc-review-i8n, doc-review-4to, doc-review-xy3 - Active beads
- orch-z2x: Model availability feature for orch
- orch-6o3: OPENROUTER env var naming bug
External References
- https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
- https://docs.kapa.ai/improving/writing-best-practices
- https://biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide
- https://www.montecarlodata.com/blog-llm-as-judge/
- https://arxiv.org/abs/2503.23989 (Rubric Is All You Need)
- https://agents.md (AGENTS.md standard)
- https://vale.sh (Vale prose linter)
Raw Notes
Envisioned Workflow (from initial brainstorm)
# Phase 1: Generate patches (non-interactive, burnable credits)
doc-review scan ~/proj/foo --model claude-sonnet --output /tmp/foo-patches/
# Phase 2: Review patches (interactive session)
cd ~/proj/foo
claude # human reviews patches, applies selectively
Skill Structure (tentative)
skills/doc-review/ ├── prompt.md # Core review instructions + style guide ├── scan.sh # Orchestrates: find docs → invoke claude → emit patches └── README.md
Output Format for Evaluations
{
"section": "## Installation",
"line_range": [15, 42],
"evaluations": [
{
"rubric": "self_containment",
"score": "FAIL",
"reasoning": "Line 23 says 'as configured above' but the configuration is in a different file.",
"evidence": "as configured above",
"suggested_fix": "Add explicit reference: 'as configured in config.yaml (see Configuration section)'"
}
]
}
Key Quote from Research
"LLMs are better at guided summarization than complex reasoning. If you can break down a problem into multiple smaller steps of guided summarization by creating a rubric, you can play to the strengths of LLMs and then rely on simple deterministic code to parse the output and score the rubric." — Monte Carlo, LLM-as-Judge best practices
Gemini Critique of Rubrics (via orch)
After fixing .envrc to include use_api_keys, ran high-temp Gemini critique.
Overlaps Identified
- Self-Containment vs Explicit Context had significant overlap
- Fix: Split into Contextual Independence (referential) + Environmental State (runtime)
Missing Dimensions for AI
- Verifiability - agents get stuck when they can't tell if command succeeded
- Format Integrity - broken markdown/JSON causes hallucinations
- Token Efficiency - fluff dilutes semantic weight in vector search
Key Insight
"Agents fail when given choices without criteria. Change 'Use a large instance' to 'Use instance type t3.2xlarge or greater.'"
Revised Rubrics (v3 final: 10 dimensions, merged)
| Phase | # | Rubric |
|---|---|---|
| Read | 1 | Format Integrity |
| Find | 2 | Semantic Headings |
| Find | 3 | Contextual Independence |
| Run | 4 | Configuration Precision (merged: Env State + Tech Specificity) |
| Run | 5 | Code Executability |
| Run | 6 | Deterministic Instructions |
| Verify | 7 | Execution Verification (merged: Verifiable Output + Error Recovery) |
| Optimize | 8 | Terminology Strictness |
| Optimize | 9 | Token Efficiency |
| Optimize | 10 | Security Boundaries |
Prompt Template Critique (Gemini round 3)
Problem: 10 rubrics = cognitive overload
"Accuracy drops significantly after 3-5 distinct evaluation criteria per prompt"
Anti-patterns identified
- JSON-only output - model commits to score before thinking (fix: CoT first)
- Evidence hallucination - model invents quotes (fix: allow null)
- Positional bias - middle rubrics degrade (fix: split passes)
Architecture options
- 2-Pass: Linter (structural) + Agent Simulator (semantic)
- Many-Pass: One rubric per pass (max accuracy, max latency)
- Iterative/Recursive: Quick scan → deep-dive on failures
- Hybrid: Single-pass quick, multi-pass on demand
The "Agent Simulator" persona
"Simulate execution step-by-step. Identify where you get stuck, where instructions are ambiguous, or where you might hallucinate due to missing context."
Architecture Research (web search + Gemini round 4)
Frameworks discovered
| Framework | Key Insight |
|---|---|
| PRE (ACM 2025) | One rubric per call for accuracy |
| G-Eval | 3-step: define → CoT → execute |
| EvalPlanner | Plan → Execute → Judge |
| Spring AI Recursive | Generate → Evaluate → Retry loop |
| ARISE | 22 specialized agents for different tasks |
| Realm | Recursive refinement with Bayesian aggregation |
| HuCoSC | Break complex problems into independent analyses |
Final Architecture: "Triage & Specialist" Cascade
Phase 1: TRIAGE SCAN (cheap model) ├── Input: doc chunk + 10 rubric definitions ├── Output: list of suspect rubrics └── Cost: ~0 for clean docs Phase 2: SPECIALIST AGENTS (PRE-style, parallel) ├── One agent per suspect rubric └── Each outputs: patch for that dimension Phase 3: VERIFICATION LOOP (recursive) ├── Re-evaluate patched content └── Retry with feedback if still FAIL
Why this pattern
- Adaptive cost: 80% of clean content pays near-zero
- PRE accuracy: One dimension at a time reduces hallucination
- Recursive safety: Verify patches don't introduce regressions
Vale Discovery (Session Turning Point)
Web search for "prose linter markdown documentation CI" revealed Vale:
- Open-source, extensible prose linter
- YAML-based rules (existence, substitution, consistency checks)
- CI-friendly (runs in seconds, no LLM cost)
- Already used by: Microsoft, Google, GitLab, DigitalOcean
Realization: 5 of our rubrics are pattern-based → Vale can handle them:
- Format Integrity (missing language tags)
- Semantic Headings (banned words: "Overview", "Introduction")
- Deterministic Instructions (hedging words: "might", "consider")
- Terminology Strictness (consistency checks)
- Token Efficiency (filler phrases)
This shifted architecture from "Claude skill with all-LLM" to "standalone tool with Vale + LLM hybrid"
Consensus Results
"Should doc-review be a skill or separate repo?"
GPT: "Start with MVP in separate repo. Skill integration can come later." Gemini: "Invest in orch as orchestration layer. Design doc-review for reuse." Both agreed: Separate repo is the right call.
Final Architecture (Vale + LLM Hybrid)
Stage 1: Vale (deterministic, fast, free) ├── Catches ~40% of issues instantly ├── Runs in CI on every commit └── No LLM cost for clean docs Stage 2: LLM Triage (cheap model) ├── Only runs if Vale passes └── Evaluates 3 semantic rubrics (contextual independence, code executability, execution verification) Stage 3: LLM Specialists (capable model) ├── One agent per failed rubric └── Generates patches
Session Metrics
- Commits made: 1 (
139a521- doc-review design session complete) - Files touched: 19 (per git diff –stat)
- Lines added: +4187
- Lines removed: -65
- Beads created: 7 (skills-bcu, skills-1ig, skills-53k in skills; doc-review-i8n, doc-review-4to, doc-review-xy3 in doc-review; orch-z2x, orch-6o3 in orch)
- Beads closed: 3 (skills-bcu, skills-1ig, skills-53k)
- Web searches: 8 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive, prose linter, Vale)
- Web fetches: 5 (GitHub blog, kapa.ai, biel.ai, Monte Carlo, vale.sh)
- Gemini consultations: 4 (conventions → rubrics v2 → prompt critique → architecture synthesis)
- Consensus runs: 2 (spin-off decision, implementation approach)
- Rubrics evolution: 7 (v1) → 12 (v2) → 10 (v3 merged)
- Key outcome: Vale + LLM hybrid architecture, spun off to ~/proj/doc-review