dan/skills

dan def212bc5b docs: complete worklog for doc-review design session

Adds Vale discovery, spin-off decision, migration details,
and updated session metrics to the design session worklog.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-04 17:54:05 -08:00

20 KiB

Raw Permalink Blame History

Doc-Review Skill Design: Research, Rubrics, Vale Discovery, and Repo Spin-off

Session Summary
- Date: 2025-12-04 (Day 1 of doc-review skill)
- Focus Area: Designing a doc-review skill to normalize documentation and fight documentation drift
Accomplishments
Key Decisions
Problems & Solutions
Technical Details
Process and Workflow
- What Worked Well
- What Was Challenging
Learning and Insights
Context for Future Work
Raw Notes
Session Metrics

Session Summary

Date: 2025-12-04 (Day 1 of doc-review skill)

Focus Area: Designing a doc-review skill to normalize documentation and fight documentation drift

Accomplishments

Created bead structure for doc-review skill design (skills-bcu with 2 blockers)
Researched agent-friendly documentation conventions via web search
Analyzed GitHub's study of 2,500+ AGENTS.md repositories
Researched LLM-as-judge rubric best practices
Drafted 7 decomposed rubrics for documentation evaluation (v1)
Filed orch UX issue (orch-z2x) for missing model availability command
Fixed .envrc to include use_api_keys for orch access
Got Gemini critique via orch - identified overlaps and 3 missing dimensions
Revised rubrics to v2 (12 dimensions), merged to v3 (10 dimensions)
Created prompt-template-v1.md with full system prompt and example
Got Gemini critique on prompt - identified cognitive overload with 10 rubrics
Researched multi-pass/recursive LLM evaluation architectures (PRE, G-Eval, ARISE)
Designed "Triage & Specialist" cascade architecture
Discovered Vale prose linter - can handle 4-7 rubrics deterministically
Designed Vale + LLM hybrid architecture (deterministic first, LLM only for semantic)
Ran GPT/Gemini consensus on spin-off decision - both recommended separate repo
Created ~/proj/doc-review with beads setup
Migrated artifacts: rubrics-v3.md, prompt-template-v1.md to design/ folder
Updated doc-review/AGENTS.md with full architecture context
Created beads in doc-review: doc-review-i8n (Vale), doc-review-4to (LLM), doc-review-xy3 (CLI)
Closed skills repo beads: skills-bcu, skills-1ig, skills-53k
Committed and pushed skills repo

Key Decisions

Decision 1: Non-interactive patch-based workflow

Context: Deciding how doc-review should integrate with human workflow
Options considered:
1. Direct edit mode - LLM directly modifies docs
2. Report-only mode - LLM produces report, human edits
3. Patch-based mode - LLM generates patches, human reviews and applies
Rationale: Patch-based gives human control, enables review before apply, works with existing git workflow
Impact: Two-phase workflow: generate patches (non-interactive), review patches (interactive Claude session)

Decision 2: Decomposed rubrics over holistic evaluation

Context: How should the LLM evaluate documentation quality?
Options considered:
1. Single holistic prompt ("rate this doc's quality")
2. Multi-dimensional single pass ("rate on 10 dimensions")
3. Decomposed rubrics (one dimension per evaluation)
Rationale: Research shows LLMs better at "guided summarization" than complex reasoning. Decomposed approach plays to LLM strengths.
Impact: 7 separate rubrics, each with clear question, scoring levels, and detect patterns

Decision 3: Three-level scoring (PASS/MARGINAL/FAIL)

Context: What granularity for scoring?
Options considered:
1. Binary (pass/fail)
2. Three-level (pass/marginal/fail)
3. Five or ten point scale
Rationale: Binary loses nuance, 5+ point scales introduce ambiguity. Three levels give actionable distinction.
Impact: FAIL triggers patch generation, MARGINAL flags for human review, PASS means agent-friendly

Decision 4: Graph-based doc discovery (deferred)

Context: How does doc-review find which docs to evaluate?
Decision: Start from README.md or AGENTS.md and graph out via links
Rationale: Not all .md files are documentation. Following links from root finds connected docs.
Impact: Created separate bead (skills-53k) to design discovery algorithm

Decision 5: "Instruction Clarity" rubric added

Context: Initial 6 rubrics didn't cover the "boundaries" pattern from AGENTS.md research
Discussion: User asked what "boundaries" meant. Realized the ✅/⚠️/🚫 pattern generalizes beyond AGENTS.md
Decision: Added Rubric 7 - Instruction Clarity (required vs optional vs dangerous)
Impact: Covers clarity of optionality and risk in all documentation, not just agent instruction files

Decision 6: Vale + LLM hybrid architecture

Context: Gemini critique revealed 10 rubrics in one prompt causes cognitive overload
Discovery: Vale (prose linter) can handle 4-7 rubrics deterministically with YAML rules
Options considered:
1. Pure LLM approach (original plan)
2. Pure Vale approach (limited to pattern matching)
3. Vale + LLM hybrid (deterministic first, LLM for semantic)
Rationale: Vale catches ~40% of issues instantly, for free, in CI. LLM only needed for semantic analysis.
Impact: Three-stage pipeline:
- Stage 1: Vale (fast/free) - Format Integrity, Semantic Headings, Deterministic Instructions, Terminology, Token Efficiency
- Stage 2: LLM Triage (cheap model) - Quick semantic scan
- Stage 3: LLM Specialists (capable model) - Patch generation per failed rubric

Decision 7: Spin off to separate repository

Context: Project grew from "Claude skill" to "standalone tool with Vale + LLM"
Ran consensus: GPT and Gemini both recommended separate repo
GPT reasoning: "Start with MVP in separate repo, iterate fast"
Gemini reasoning: "Invest in orch integration, design for reuse"
Decision: Created ~/proj/doc-review as standalone project
Impact: Clean separation, own beads, can be used outside claude-code context

Problems & Solutions

Problem	Solution	Learning
orch CLI failed - no API keys configured	Tried 4 models (gemini, flash, deepseek, gpt) - all failed	Need `orch models` command to show available models
API keys not loading in skills repo	Added `use_api_keys` to .envrc (was only `use flake`)	Check .envrc has use_api_keys for repos needing API access
False distinction between "general docs" and "agent instruction files"	User pushed back - all docs are agent-readable	Don't create artificial categories; generalize patterns
Rubric research scattered across sources	Synthesized into single bead description	Centralize findings in beads for future reference
10 rubrics in one LLM prompt = cognitive overload	Split: Vale for deterministic, LLM for semantic only	Play to tool strengths - deterministic tools for pattern matching
OPENROUTER_KEY vs OPENROUTER_API_KEY mismatch	Filed bug orch-6o3 in ~/proj/orch	Environment variable naming must be consistent
bd sync failed with worktree error	Used regular git add + commit instead	bd sync has edge cases with certain git states
doc-review repo missing git identity	Left for user to configure user.email/name	New repos need git config before first commit

Technical Details

Artifacts Created

`/tmp/doc-review-drafts/rubrics-v1.md` - Initial 7 rubric definitions
`/tmp/doc-review-drafts/rubrics-v3.md` - Final 10 rubrics (320 lines)
`/tmp/doc-review-drafts/prompt-template-v1.md` - LLM system prompt with example
`~/proj/doc-review/docs/design/rubrics-v3.md` - Migrated final rubrics
`~/proj/doc-review/docs/design/prompt-template-v1.md` - Migrated prompt template
`~/proj/doc-review/AGENTS.md` - Full project context for next agent

Beads Created/Updated

In skills repo (all closed):

`skills-bcu` - Main design bead (closed: spun off to doc-review)
`skills-1ig` - Conventions research (closed: completed)
`skills-53k` - Graph discovery design (closed: moved to doc-review)

In doc-review repo:

`doc-review-i8n` - Implement Vale style for rubrics
`doc-review-4to` - Implement LLM prompts for semantic rubrics
`doc-review-xy3` - Design CLI interface

In orch repo:

`orch-z2x` - Model availability command feature
`orch-6o3` - OPENROUTER_KEY vs OPENROUTER_API_KEY bug

Commands Used

# Web search for conventions and research
WebSearch: "LLM-friendly documentation conventions AI agent readable docs best practices 2025"
WebSearch: "AGENTS.md llms.txt AI coding assistant documentation format"
WebSearch: "LLM as judge rubric prompt engineering best practices"
WebSearch: "multi-pass LLM evaluation recursive evaluation framework"
WebSearch: "prose linter markdown documentation CI"

# Web fetch for deep dives
WebFetch: github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
WebFetch: docs.kapa.ai/improving/writing-best-practices
WebFetch: biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide
WebFetch: montecarlodata.com/blog-llm-as-judge/
WebFetch: vale.sh

# orch consultations
orch ask gemini --temp 1.2 "Critique these rubrics for AI-optimized documentation..."
orch consensus gpt gemini "Should doc-review be a skill or separate repo?"

# Migration commands
mkdir -p ~/proj/doc-review/docs/design
cp /tmp/doc-review-drafts/*.md ~/proj/doc-review/docs/design/

# Beads workflow
bd close skills-bcu skills-1ig skills-53k

The 10 Rubrics (Final v3)

Organized by "Agent's Hierarchy of Needs":

Phase	#	Rubric	Tool	Key Question
Read	1	Format Integrity	Vale	Valid markdown? Code blocks tagged?
Find	2	Semantic Headings	Vale	Headings contain task+object keywords?
Find	3	Contextual Independence	LLM	No "as mentioned above"?
Run	4	Configuration Precision	Partial	Exact versions, flags, paths?
Run	5	Code Executability	LLM	All imports present?
Run	6	Deterministic Instructions	Vale	No hedging ("might", "consider")?
Verify	7	Execution Verification	LLM	Expected output + error recovery?
Optimize	8	Terminology Strictness	Vale	1:1 term-concept mapping?
Optimize	9	Token Efficiency	Vale	No filler phrases?
Optimize	10	Security Boundaries	Partial	No hardcoded secrets?

Tool assignment:

Vale (5): 1, 2, 6, 8, 9 - Pattern matching, deterministic
LLM (3): 3, 5, 7 - Semantic understanding required
Partial (2): 4, 10 - Hybrid (Vale for patterns, LLM for edge cases)

Each rubric includes:

Scoring levels with clear definitions (PASS/MARGINAL/FAIL)
Detect patterns (specific things to look for)
Reasoning requirement (must quote evidence)
Suggested fix field in output

Process and Workflow

What Worked Well

Web search → web fetch pipeline for research
Beads for tracking design decisions
Iterative rubric development with user feedback
User pushing back on false distinctions

What Was Challenging

orch unavailable due to missing API keys - blocked brainstorming with external model
Keeping rubrics crisp - tendency to over-specify

Learning and Insights

Technical Insights

LLMs are "better at guided summarization than complex reasoning"
One few-shot example often outperforms multiple (diminishing returns)
Decomposed evaluation + deterministic aggregation > holistic evaluation
llms.txt is emerging as robots.txt equivalent for LLM access
MCP (Model Context Protocol) for structured doc discovery

Process Insights

Research before design - the web search surfaced patterns we wouldn't have invented
GitHub's 2,500 repo study provided concrete evidence for conventions
Asking "what would a rubric for X look like" forces clarity on what X means

Architectural Insights

Two-phase workflow (generate then review) separates concerns
Patches as intermediate format enables git integration
Per-dimension rubrics enable independent iteration

Key Research Sources

GitHub blog: 2,500+ AGENTS.md analysis - "one snippet beats three paragraphs"
kapa.ai: proximity principle, self-contained sections, semantic discoverability
biel.ai: single purpose per section, complete code examples
Monte Carlo: 7 best practices for LLM-as-judge, failure modes
ACM ICER 2025: "Rubric Is All You Need" paper

Context for Future Work

Open Questions

How to handle large repos with many doc files? (chunking strategy)
Should rubrics be weighted differently?
How to handle generated docs (should they be excluded)?
What's the right model for different stages (triage vs patch generation)?
Vale vs markdownlint vs custom - which linter is best for this use case?

Next Steps (in ~/proj/doc-review)

Implement Vale style for 5 deterministic rubrics (doc-review-i8n)
Implement LLM prompts for 3 semantic rubrics (doc-review-4to)
Design CLI interface (doc-review-xy3)
Graph-based doc discovery (from skills-53k, needs new bead)
Test on real repos (skills repo is good candidate)

Related Work

~/proj/doc-review - New home for this project
doc-review-i8n, doc-review-4to, doc-review-xy3 - Active beads
orch-z2x: Model availability feature for orch
orch-6o3: OPENROUTER env var naming bug

External References

https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
https://docs.kapa.ai/improving/writing-best-practices
https://biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide
https://www.montecarlodata.com/blog-llm-as-judge/
https://arxiv.org/abs/2503.23989 (Rubric Is All You Need)
https://agents.md (AGENTS.md standard)
https://vale.sh (Vale prose linter)

Raw Notes

Envisioned Workflow (from initial brainstorm)

# Phase 1: Generate patches (non-interactive, burnable credits)
doc-review scan ~/proj/foo --model claude-sonnet --output /tmp/foo-patches/

# Phase 2: Review patches (interactive session)
cd ~/proj/foo
claude  # human reviews patches, applies selectively

Skill Structure (tentative)

skills/doc-review/
├── prompt.md           # Core review instructions + style guide
├── scan.sh             # Orchestrates: find docs → invoke claude → emit patches
└── README.md

Output Format for Evaluations

{
  "section": "## Installation",
  "line_range": [15, 42],
  "evaluations": [
    {
      "rubric": "self_containment",
      "score": "FAIL",
      "reasoning": "Line 23 says 'as configured above' but the configuration is in a different file.",
      "evidence": "as configured above",
      "suggested_fix": "Add explicit reference: 'as configured in config.yaml (see Configuration section)'"
    }
  ]
}

Key Quote from Research

"LLMs are better at guided summarization than complex reasoning. If you can break down a problem into multiple smaller steps of guided summarization by creating a rubric, you can play to the strengths of LLMs and then rely on simple deterministic code to parse the output and score the rubric." — Monte Carlo, LLM-as-Judge best practices

Gemini Critique of Rubrics (via orch)

After fixing .envrc to include use_api_keys, ran high-temp Gemini critique.

Overlaps Identified

Self-Containment vs Explicit Context had significant overlap
Fix: Split into Contextual Independence (referential) + Environmental State (runtime)

Missing Dimensions for AI

Verifiability - agents get stuck when they can't tell if command succeeded
Format Integrity - broken markdown/JSON causes hallucinations
Token Efficiency - fluff dilutes semantic weight in vector search

Key Insight

"Agents fail when given choices without criteria. Change 'Use a large instance' to 'Use instance type t3.2xlarge or greater.'"

Revised Rubrics (v3 final: 10 dimensions, merged)

Phase	#	Rubric
Read	1	Format Integrity
Find	2	Semantic Headings
Find	3	Contextual Independence
Run	4	Configuration Precision (merged: Env State + Tech Specificity)
Run	5	Code Executability
Run	6	Deterministic Instructions
Verify	7	Execution Verification (merged: Verifiable Output + Error Recovery)
Optimize	8	Terminology Strictness
Optimize	9	Token Efficiency
Optimize	10	Security Boundaries

Prompt Template Critique (Gemini round 3)

Problem: 10 rubrics = cognitive overload

"Accuracy drops significantly after 3-5 distinct evaluation criteria per prompt"

Anti-patterns identified

JSON-only output - model commits to score before thinking (fix: CoT first)
Evidence hallucination - model invents quotes (fix: allow null)
Positional bias - middle rubrics degrade (fix: split passes)

Architecture options

2-Pass: Linter (structural) + Agent Simulator (semantic)
Many-Pass: One rubric per pass (max accuracy, max latency)
Iterative/Recursive: Quick scan → deep-dive on failures
Hybrid: Single-pass quick, multi-pass on demand

The "Agent Simulator" persona

"Simulate execution step-by-step. Identify where you get stuck, where instructions are ambiguous, or where you might hallucinate due to missing context."

Architecture Research (web search + Gemini round 4)

Frameworks discovered

Framework	Key Insight
PRE (ACM 2025)	One rubric per call for accuracy
G-Eval	3-step: define → CoT → execute
EvalPlanner	Plan → Execute → Judge
Spring AI Recursive	Generate → Evaluate → Retry loop
ARISE	22 specialized agents for different tasks
Realm	Recursive refinement with Bayesian aggregation
HuCoSC	Break complex problems into independent analyses

Final Architecture: "Triage & Specialist" Cascade

Phase 1: TRIAGE SCAN (cheap model)
├── Input: doc chunk + 10 rubric definitions
├── Output: list of suspect rubrics
└── Cost: ~0 for clean docs

Phase 2: SPECIALIST AGENTS (PRE-style, parallel)
├── One agent per suspect rubric
└── Each outputs: patch for that dimension

Phase 3: VERIFICATION LOOP (recursive)
├── Re-evaluate patched content
└── Retry with feedback if still FAIL

Why this pattern

Adaptive cost: 80% of clean content pays near-zero
PRE accuracy: One dimension at a time reduces hallucination
Recursive safety: Verify patches don't introduce regressions

Vale Discovery (Session Turning Point)

Web search for "prose linter markdown documentation CI" revealed Vale:

Open-source, extensible prose linter
YAML-based rules (existence, substitution, consistency checks)
CI-friendly (runs in seconds, no LLM cost)
Already used by: Microsoft, Google, GitLab, DigitalOcean

Realization: 5 of our rubrics are pattern-based → Vale can handle them:

Format Integrity (missing language tags)
Semantic Headings (banned words: "Overview", "Introduction")
Deterministic Instructions (hedging words: "might", "consider")
Terminology Strictness (consistency checks)
Token Efficiency (filler phrases)

This shifted architecture from "Claude skill with all-LLM" to "standalone tool with Vale + LLM hybrid"

Consensus Results

"Should doc-review be a skill or separate repo?"

GPT: "Start with MVP in separate repo. Skill integration can come later." Gemini: "Invest in orch as orchestration layer. Design doc-review for reuse." Both agreed: Separate repo is the right call.

Final Architecture (Vale + LLM Hybrid)

Stage 1: Vale (deterministic, fast, free)
├── Catches ~40% of issues instantly
├── Runs in CI on every commit
└── No LLM cost for clean docs

Stage 2: LLM Triage (cheap model)
├── Only runs if Vale passes
└── Evaluates 3 semantic rubrics (contextual independence, code executability, execution verification)

Stage 3: LLM Specialists (capable model)
├── One agent per failed rubric
└── Generates patches

Session Metrics

Commits made: 1 (139a521 - doc-review design session complete)
Files touched: 19 (per git diff –stat)
Lines added: +4187
Lines removed: -65
Beads created: 7 (skills-bcu, skills-1ig, skills-53k in skills; doc-review-i8n, doc-review-4to, doc-review-xy3 in doc-review; orch-z2x, orch-6o3 in orch)
Beads closed: 3 (skills-bcu, skills-1ig, skills-53k)
Web searches: 8 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive, prose linter, Vale)
Web fetches: 5 (GitHub blog, kapa.ai, biel.ai, Monte Carlo, vale.sh)
Gemini consultations: 4 (conventions → rubrics v2 → prompt critique → architecture synthesis)
Consensus runs: 2 (spin-off decision, implementation approach)
Rubrics evolution: 7 (v1) → 12 (v2) → 10 (v3 merged)
Key outcome: Vale + LLM hybrid architecture, spun off to ~/proj/doc-review

20 KiB Raw Permalink Blame History Unescape Escape