docs: complete worklog for doc-review design session

Adds Vale discovery, spin-off decision, migration details,
and updated session metrics to the design session worklog.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
dan 2025-12-04 17:54:05 -08:00
parent 139a521a89
commit def212bc5b

View file

@ -1,7 +1,7 @@
#+TITLE: Doc-Review Skill Design: Research and Rubric Development
#+TITLE: Doc-Review Skill Design: Research, Rubrics, Vale Discovery, and Repo Spin-off
#+DATE: 2025-12-04
#+KEYWORDS: doc-review, rubrics, LLM-as-judge, agent-friendly-docs, documentation-drift, AGENTS.md
#+COMMITS: 0
#+KEYWORDS: doc-review, rubrics, LLM-as-judge, agent-friendly-docs, documentation-drift, AGENTS.md, vale, hybrid-architecture
#+COMMITS: 1
#+COMPRESSION_STATUS: uncompressed
* Session Summary
@ -17,10 +17,20 @@
- [X] Filed orch UX issue (orch-z2x) for missing model availability command
- [X] Fixed .envrc to include use_api_keys for orch access
- [X] Got Gemini critique via orch - identified overlaps and 3 missing dimensions
- [X] Revised rubrics to v2 (10 dimensions) based on AI-optimization feedback
- [ ] Update rubrics-v2.md with full definitions
- [ ] Prompt template (next step)
- [ ] Before/after examples (next step)
- [X] Revised rubrics to v2 (12 dimensions), merged to v3 (10 dimensions)
- [X] Created prompt-template-v1.md with full system prompt and example
- [X] Got Gemini critique on prompt - identified cognitive overload with 10 rubrics
- [X] Researched multi-pass/recursive LLM evaluation architectures (PRE, G-Eval, ARISE)
- [X] Designed "Triage & Specialist" cascade architecture
- [X] Discovered Vale prose linter - can handle 4-7 rubrics deterministically
- [X] Designed Vale + LLM hybrid architecture (deterministic first, LLM only for semantic)
- [X] Ran GPT/Gemini consensus on spin-off decision - both recommended separate repo
- [X] Created ~/proj/doc-review with beads setup
- [X] Migrated artifacts: rubrics-v3.md, prompt-template-v1.md to design/ folder
- [X] Updated doc-review/AGENTS.md with full architecture context
- [X] Created beads in doc-review: doc-review-i8n (Vale), doc-review-4to (LLM), doc-review-xy3 (CLI)
- [X] Closed skills repo beads: skills-bcu, skills-1ig, skills-53k
- [X] Committed and pushed skills repo
* Key Decisions
** Decision 1: Non-interactive patch-based workflow
@ -62,6 +72,27 @@
- Decision: Added Rubric 7 - Instruction Clarity (required vs optional vs dangerous)
- Impact: Covers clarity of optionality and risk in all documentation, not just agent instruction files
** Decision 6: Vale + LLM hybrid architecture
- Context: Gemini critique revealed 10 rubrics in one prompt causes cognitive overload
- Discovery: Vale (prose linter) can handle 4-7 rubrics deterministically with YAML rules
- Options considered:
1. Pure LLM approach (original plan)
2. Pure Vale approach (limited to pattern matching)
3. Vale + LLM hybrid (deterministic first, LLM for semantic)
- Rationale: Vale catches ~40% of issues instantly, for free, in CI. LLM only needed for semantic analysis.
- Impact: Three-stage pipeline:
- Stage 1: Vale (fast/free) - Format Integrity, Semantic Headings, Deterministic Instructions, Terminology, Token Efficiency
- Stage 2: LLM Triage (cheap model) - Quick semantic scan
- Stage 3: LLM Specialists (capable model) - Patch generation per failed rubric
** Decision 7: Spin off to separate repository
- Context: Project grew from "Claude skill" to "standalone tool with Vale + LLM"
- Ran consensus: GPT and Gemini both recommended separate repo
- GPT reasoning: "Start with MVP in separate repo, iterate fast"
- Gemini reasoning: "Invest in orch integration, design for reuse"
- Decision: Created ~/proj/doc-review as standalone project
- Impact: Clean separation, own beads, can be used outside claude-code context
* Problems & Solutions
| Problem | Solution | Learning |
|---------+----------+----------|
@ -69,48 +100,87 @@
| API keys not loading in skills repo | Added `use_api_keys` to .envrc (was only `use flake`) | Check .envrc has use_api_keys for repos needing API access |
| False distinction between "general docs" and "agent instruction files" | User pushed back - all docs are agent-readable | Don't create artificial categories; generalize patterns |
| Rubric research scattered across sources | Synthesized into single bead description | Centralize findings in beads for future reference |
| 10 rubrics in one LLM prompt = cognitive overload | Split: Vale for deterministic, LLM for semantic only | Play to tool strengths - deterministic tools for pattern matching |
| OPENROUTER_KEY vs OPENROUTER_API_KEY mismatch | Filed bug orch-6o3 in ~/proj/orch | Environment variable naming must be consistent |
| bd sync failed with worktree error | Used regular git add + commit instead | bd sync has edge cases with certain git states |
| doc-review repo missing git identity | Left for user to configure user.email/name | New repos need git config before first commit |
* Technical Details
** Artifacts Created
- `/tmp/doc-review-drafts/rubrics-v1.md` - Full rubric definitions (181 lines)
- `skills-bcu` - Main design bead with workflow description
- `skills-1ig` - Conventions research bead (in_progress)
- `skills-53k` - Graph discovery design bead (open)
- `orch-z2x` - Filed in ~/proj/orch for model availability feature
- `/tmp/doc-review-drafts/rubrics-v1.md` - Initial 7 rubric definitions
- `/tmp/doc-review-drafts/rubrics-v3.md` - Final 10 rubrics (320 lines)
- `/tmp/doc-review-drafts/prompt-template-v1.md` - LLM system prompt with example
- `~/proj/doc-review/docs/design/rubrics-v3.md` - Migrated final rubrics
- `~/proj/doc-review/docs/design/prompt-template-v1.md` - Migrated prompt template
- `~/proj/doc-review/AGENTS.md` - Full project context for next agent
** Beads Created/Updated
In skills repo (all closed):
- `skills-bcu` - Main design bead (closed: spun off to doc-review)
- `skills-1ig` - Conventions research (closed: completed)
- `skills-53k` - Graph discovery design (closed: moved to doc-review)
In doc-review repo:
- `doc-review-i8n` - Implement Vale style for rubrics
- `doc-review-4to` - Implement LLM prompts for semantic rubrics
- `doc-review-xy3` - Design CLI interface
In orch repo:
- `orch-z2x` - Model availability command feature
- `orch-6o3` - OPENROUTER_KEY vs OPENROUTER_API_KEY bug
** Commands Used
#+begin_src bash
# Web search for conventions
# Web search for conventions and research
WebSearch: "LLM-friendly documentation conventions AI agent readable docs best practices 2025"
WebSearch: "AGENTS.md llms.txt AI coding assistant documentation format"
WebSearch: "LLM as judge rubric prompt engineering best practices"
WebSearch: "multi-pass LLM evaluation recursive evaluation framework"
WebSearch: "prose linter markdown documentation CI"
# Web fetch for deep dives
WebFetch: github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
WebFetch: docs.kapa.ai/improving/writing-best-practices
WebFetch: biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide
WebFetch: montecarlodata.com/blog-llm-as-judge/
WebFetch: vale.sh
# orch consultations
orch ask gemini --temp 1.2 "Critique these rubrics for AI-optimized documentation..."
orch consensus gpt gemini "Should doc-review be a skill or separate repo?"
# Migration commands
mkdir -p ~/proj/doc-review/docs/design
cp /tmp/doc-review-drafts/*.md ~/proj/doc-review/docs/design/
# Beads workflow
bd create --title="Design doc-review skill" --type=feature
bd dep add skills-bcu skills-1ig
bd dep add skills-bcu skills-53k
bd close skills-bcu skills-1ig skills-53k
#+end_src
** The 7 Rubrics
| # | Rubric | Key Question |
|---+--------+--------------|
| 1 | Self-Containment | Can section be understood alone? |
| 2 | Heading Structure | Logical hierarchy, descriptive names? |
| 3 | Terminology Consistency | Same term for same concept? |
| 4 | Code Example Completeness | Examples runnable as shown? |
| 5 | Explicit Context | Prerequisites stated, no implicit knowledge? |
| 6 | Technical Specificity | Exact versions, flags, commands? |
| 7 | Instruction Clarity | Required vs optional vs dangerous? |
** The 10 Rubrics (Final v3)
Organized by "Agent's Hierarchy of Needs":
| Phase | # | Rubric | Tool | Key Question |
|-------+---+--------+------+--------------|
| Read | 1 | Format Integrity | Vale | Valid markdown? Code blocks tagged? |
| Find | 2 | Semantic Headings | Vale | Headings contain task+object keywords? |
| Find | 3 | Contextual Independence | LLM | No "as mentioned above"? |
| Run | 4 | Configuration Precision | Partial | Exact versions, flags, paths? |
| Run | 5 | Code Executability | LLM | All imports present? |
| Run | 6 | Deterministic Instructions | Vale | No hedging ("might", "consider")? |
| Verify | 7 | Execution Verification | LLM | Expected output + error recovery? |
| Optimize | 8 | Terminology Strictness | Vale | 1:1 term-concept mapping? |
| Optimize | 9 | Token Efficiency | Vale | No filler phrases? |
| Optimize | 10 | Security Boundaries | Partial | No hardcoded secrets? |
Tool assignment:
- **Vale (5)**: 1, 2, 6, 8, 9 - Pattern matching, deterministic
- **LLM (3)**: 3, 5, 7 - Semantic understanding required
- **Partial (2)**: 4, 10 - Hybrid (Vale for patterns, LLM for edge cases)
Each rubric includes:
- Scoring levels with clear definitions
- Scoring levels with clear definitions (PASS/MARGINAL/FAIL)
- Detect patterns (specific things to look for)
- Reasoning requirement (must quote evidence)
- Suggested fix field in output
@ -159,19 +229,21 @@ Each rubric includes:
- How to handle large repos with many doc files? (chunking strategy)
- Should rubrics be weighted differently?
- How to handle generated docs (should they be excluded)?
- What's the right model for different stages (eval vs patch generation)?
- What's the right model for different stages (triage vs patch generation)?
- Vale vs markdownlint vs custom - which linter is best for this use case?
** Next Steps
1. Draft the prompt template that uses these rubrics
2. Create before/after examples for few-shot
3. Design graph-based doc discovery (skills-53k)
4. Prototype scan.sh script
5. Test on real repos
** Next Steps (in ~/proj/doc-review)
1. Implement Vale style for 5 deterministic rubrics (doc-review-i8n)
2. Implement LLM prompts for 3 semantic rubrics (doc-review-4to)
3. Design CLI interface (doc-review-xy3)
4. Graph-based doc discovery (from skills-53k, needs new bead)
5. Test on real repos (skills repo is good candidate)
** Related Work
- skills-53k: Graph-based doc discovery design
- skills-bcu: Parent design bead
- ~/proj/doc-review - New home for this project
- doc-review-i8n, doc-review-4to, doc-review-xy3 - Active beads
- orch-z2x: Model availability feature for orch
- orch-6o3: OPENROUTER env var naming bug
** External References
- https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
@ -180,6 +252,7 @@ Each rubric includes:
- https://www.montecarlodata.com/blog-llm-as-judge/
- https://arxiv.org/abs/2503.23989 (Rubric Is All You Need)
- https://agents.md (AGENTS.md standard)
- https://vale.sh (Vale prose linter)
* Raw Notes
@ -313,13 +386,55 @@ Phase 3: VERIFICATION LOOP (recursive)
- PRE accuracy: One dimension at a time reduces hallucination
- Recursive safety: Verify patches don't introduce regressions
** Vale Discovery (Session Turning Point)
Web search for "prose linter markdown documentation CI" revealed Vale:
- Open-source, extensible prose linter
- YAML-based rules (existence, substitution, consistency checks)
- CI-friendly (runs in seconds, no LLM cost)
- Already used by: Microsoft, Google, GitLab, DigitalOcean
Realization: 5 of our rubrics are pattern-based → Vale can handle them:
- Format Integrity (missing language tags)
- Semantic Headings (banned words: "Overview", "Introduction")
- Deterministic Instructions (hedging words: "might", "consider")
- Terminology Strictness (consistency checks)
- Token Efficiency (filler phrases)
This shifted architecture from "Claude skill with all-LLM" to "standalone tool with Vale + LLM hybrid"
** Consensus Results
*** "Should doc-review be a skill or separate repo?"
GPT: "Start with MVP in separate repo. Skill integration can come later."
Gemini: "Invest in orch as orchestration layer. Design doc-review for reuse."
Both agreed: Separate repo is the right call.
** Final Architecture (Vale + LLM Hybrid)
#+begin_example
Stage 1: Vale (deterministic, fast, free)
├── Catches ~40% of issues instantly
├── Runs in CI on every commit
└── No LLM cost for clean docs
Stage 2: LLM Triage (cheap model)
├── Only runs if Vale passes
└── Evaluates 3 semantic rubrics (contextual independence, code executability, execution verification)
Stage 3: LLM Specialists (capable model)
├── One agent per failed rubric
└── Generates patches
#+end_example
* Session Metrics
- Commits made: 0 (design session, no code yet)
- Files touched: 3 (rubrics-v1.md, rubrics-v3.md, prompt-template-v1.md in /tmp; .envrc)
- Beads created: 4 (skills-bcu, skills-1ig, skills-53k, orch-z2x)
- Beads closed: 1 (skills-dpw - duplicate moved to orch repo)
- Web searches: 6 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive)
- Web fetches: 4 (GitHub blog, kapa.ai, biel.ai, Monte Carlo)
- Commits made: 1 (139a521 - doc-review design session complete)
- Files touched: 19 (per git diff --stat)
- Lines added: +4187
- Lines removed: -65
- Beads created: 7 (skills-bcu, skills-1ig, skills-53k in skills; doc-review-i8n, doc-review-4to, doc-review-xy3 in doc-review; orch-z2x, orch-6o3 in orch)
- Beads closed: 3 (skills-bcu, skills-1ig, skills-53k)
- Web searches: 8 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive, prose linter, Vale)
- Web fetches: 5 (GitHub blog, kapa.ai, biel.ai, Monte Carlo, vale.sh)
- Gemini consultations: 4 (conventions → rubrics v2 → prompt critique → architecture synthesis)
- Consensus runs: 2 (spin-off decision, implementation approach)
- Rubrics evolution: 7 (v1) → 12 (v2) → 10 (v3 merged)
- Key outcome: "Triage & Specialist" cascade architecture
- Key outcome: Vale + LLM hybrid architecture, spun off to ~/proj/doc-review