docs: complete worklog for doc-review design session

Adds Vale discovery, spin-off decision, migration details, and updated session metrics to the design session worklog. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 17:54:05 -08:00 · 2025-12-04 17:54:05 -08:00 · def212bc5b
parent 139a521a89
commit def212bc5b
1 changed files with 158 additions and 43 deletions
--- a/docs/worklogs/2025-12-04-doc-review-skill-design.org
+++ b/docs/worklogs/2025-12-04-doc-review-skill-design.org
@ -1,7 +1,7 @@
-#+TITLE: Doc-Review Skill Design: Research and Rubric Development
+#+TITLE: Doc-Review Skill Design: Research, Rubrics, Vale Discovery, and Repo Spin-off
 #+DATE: 2025-12-04
-#+KEYWORDS: doc-review, rubrics, LLM-as-judge, agent-friendly-docs, documentation-drift, AGENTS.md
-#+COMMITS: 0
+#+KEYWORDS: doc-review, rubrics, LLM-as-judge, agent-friendly-docs, documentation-drift, AGENTS.md, vale, hybrid-architecture
+#+COMMITS: 1
 #+COMPRESSION_STATUS: uncompressed

 * Session Summary
@ -17,10 +17,20 @@
 - [X] Filed orch UX issue (orch-z2x) for missing model availability command
 - [X] Fixed .envrc to include use_api_keys for orch access
 - [X] Got Gemini critique via orch - identified overlaps and 3 missing dimensions
- [X] Revised rubrics to v2 (10 dimensions) based on AI-optimization feedback
- [ ] Update rubrics-v2.md with full definitions
- [ ] Prompt template (next step)
- [ ] Before/after examples (next step)
+- [X] Revised rubrics to v2 (12 dimensions), merged to v3 (10 dimensions)
+- [X] Created prompt-template-v1.md with full system prompt and example
+- [X] Got Gemini critique on prompt - identified cognitive overload with 10 rubrics
+- [X] Researched multi-pass/recursive LLM evaluation architectures (PRE, G-Eval, ARISE)
+- [X] Designed "Triage & Specialist" cascade architecture
+- [X] Discovered Vale prose linter - can handle 4-7 rubrics deterministically
+- [X] Designed Vale + LLM hybrid architecture (deterministic first, LLM only for semantic)
+- [X] Ran GPT/Gemini consensus on spin-off decision - both recommended separate repo
+- [X] Created ~/proj/doc-review with beads setup
+- [X] Migrated artifacts: rubrics-v3.md, prompt-template-v1.md to design/ folder
+- [X] Updated doc-review/AGENTS.md with full architecture context
+- [X] Created beads in doc-review: doc-review-i8n (Vale), doc-review-4to (LLM), doc-review-xy3 (CLI)
+- [X] Closed skills repo beads: skills-bcu, skills-1ig, skills-53k
+- [X] Committed and pushed skills repo

 * Key Decisions
 ** Decision 1: Non-interactive patch-based workflow
@ -62,6 +72,27 @@
 - Decision: Added Rubric 7 - Instruction Clarity (required vs optional vs dangerous)
 - Impact: Covers clarity of optionality and risk in all documentation, not just agent instruction files

+** Decision 6: Vale + LLM hybrid architecture
+- Context: Gemini critique revealed 10 rubrics in one prompt causes cognitive overload
+- Discovery: Vale (prose linter) can handle 4-7 rubrics deterministically with YAML rules
+- Options considered:
+  1. Pure LLM approach (original plan)
+  2. Pure Vale approach (limited to pattern matching)
+  3. Vale + LLM hybrid (deterministic first, LLM for semantic)
+- Rationale: Vale catches ~40% of issues instantly, for free, in CI. LLM only needed for semantic analysis.
+- Impact: Three-stage pipeline:
+  - Stage 1: Vale (fast/free) - Format Integrity, Semantic Headings, Deterministic Instructions, Terminology, Token Efficiency
+  - Stage 2: LLM Triage (cheap model) - Quick semantic scan
+  - Stage 3: LLM Specialists (capable model) - Patch generation per failed rubric
+
+** Decision 7: Spin off to separate repository
+- Context: Project grew from "Claude skill" to "standalone tool with Vale + LLM"
+- Ran consensus: GPT and Gemini both recommended separate repo
+- GPT reasoning: "Start with MVP in separate repo, iterate fast"
+- Gemini reasoning: "Invest in orch integration, design for reuse"
+- Decision: Created ~/proj/doc-review as standalone project
+- Impact: Clean separation, own beads, can be used outside claude-code context
+
 * Problems & Solutions
 | Problem | Solution | Learning |
 |---------+----------+----------|
@ -69,48 +100,87 @@
 | API keys not loading in skills repo | Added `use_api_keys` to .envrc (was only `use flake`) | Check .envrc has use_api_keys for repos needing API access |
 | False distinction between "general docs" and "agent instruction files" | User pushed back - all docs are agent-readable | Don't create artificial categories; generalize patterns |
 | Rubric research scattered across sources | Synthesized into single bead description | Centralize findings in beads for future reference |
+| 10 rubrics in one LLM prompt = cognitive overload | Split: Vale for deterministic, LLM for semantic only | Play to tool strengths - deterministic tools for pattern matching |
+| OPENROUTER_KEY vs OPENROUTER_API_KEY mismatch | Filed bug orch-6o3 in ~/proj/orch | Environment variable naming must be consistent |
+| bd sync failed with worktree error | Used regular git add + commit instead | bd sync has edge cases with certain git states |
+| doc-review repo missing git identity | Left for user to configure user.email/name | New repos need git config before first commit |

 * Technical Details

 ** Artifacts Created
- `/tmp/doc-review-drafts/rubrics-v1.md` - Full rubric definitions (181 lines)
- `skills-bcu` - Main design bead with workflow description
- `skills-1ig` - Conventions research bead (in_progress)
- `skills-53k` - Graph discovery design bead (open)
- `orch-z2x` - Filed in ~/proj/orch for model availability feature
+- `/tmp/doc-review-drafts/rubrics-v1.md` - Initial 7 rubric definitions
+- `/tmp/doc-review-drafts/rubrics-v3.md` - Final 10 rubrics (320 lines)
+- `/tmp/doc-review-drafts/prompt-template-v1.md` - LLM system prompt with example
+- `~/proj/doc-review/docs/design/rubrics-v3.md` - Migrated final rubrics
+- `~/proj/doc-review/docs/design/prompt-template-v1.md` - Migrated prompt template
+- `~/proj/doc-review/AGENTS.md` - Full project context for next agent
+
+** Beads Created/Updated
+In skills repo (all closed):
+- `skills-bcu` - Main design bead (closed: spun off to doc-review)
+- `skills-1ig` - Conventions research (closed: completed)
+- `skills-53k` - Graph discovery design (closed: moved to doc-review)
+
+In doc-review repo:
+- `doc-review-i8n` - Implement Vale style for rubrics
+- `doc-review-4to` - Implement LLM prompts for semantic rubrics
+- `doc-review-xy3` - Design CLI interface
+
+In orch repo:
+- `orch-z2x` - Model availability command feature
+- `orch-6o3` - OPENROUTER_KEY vs OPENROUTER_API_KEY bug

 ** Commands Used
 #+begin_src bash
-# Web search for conventions
+# Web search for conventions and research
 WebSearch: "LLM-friendly documentation conventions AI agent readable docs best practices 2025"
 WebSearch: "AGENTS.md llms.txt AI coding assistant documentation format"
 WebSearch: "LLM as judge rubric prompt engineering best practices"
+WebSearch: "multi-pass LLM evaluation recursive evaluation framework"
+WebSearch: "prose linter markdown documentation CI"

 # Web fetch for deep dives
 WebFetch: github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
 WebFetch: docs.kapa.ai/improving/writing-best-practices
 WebFetch: biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide
 WebFetch: montecarlodata.com/blog-llm-as-judge/
+WebFetch: vale.sh
+
+# orch consultations
+orch ask gemini --temp 1.2 "Critique these rubrics for AI-optimized documentation..."
+orch consensus gpt gemini "Should doc-review be a skill or separate repo?"
+
+# Migration commands
+mkdir -p ~/proj/doc-review/docs/design
+cp /tmp/doc-review-drafts/*.md ~/proj/doc-review/docs/design/

 # Beads workflow
-bd create --title="Design doc-review skill" --type=feature
-bd dep add skills-bcu skills-1ig
-bd dep add skills-bcu skills-53k
+bd close skills-bcu skills-1ig skills-53k
 #+end_src

-** The 7 Rubrics
-| # | Rubric | Key Question |
-|---+--------+--------------|
-| 1 | Self-Containment | Can section be understood alone? |
-| 2 | Heading Structure | Logical hierarchy, descriptive names? |
-| 3 | Terminology Consistency | Same term for same concept? |
-| 4 | Code Example Completeness | Examples runnable as shown? |
-| 5 | Explicit Context | Prerequisites stated, no implicit knowledge? |
-| 6 | Technical Specificity | Exact versions, flags, commands? |
-| 7 | Instruction Clarity | Required vs optional vs dangerous? |
+** The 10 Rubrics (Final v3)
+Organized by "Agent's Hierarchy of Needs":
+
+| Phase | # | Rubric | Tool | Key Question |
+|-------+---+--------+------+--------------|
+| Read | 1 | Format Integrity | Vale | Valid markdown? Code blocks tagged? |
+| Find | 2 | Semantic Headings | Vale | Headings contain task+object keywords? |
+| Find | 3 | Contextual Independence | LLM | No "as mentioned above"? |
+| Run | 4 | Configuration Precision | Partial | Exact versions, flags, paths? |
+| Run | 5 | Code Executability | LLM | All imports present? |
+| Run | 6 | Deterministic Instructions | Vale | No hedging ("might", "consider")? |
+| Verify | 7 | Execution Verification | LLM | Expected output + error recovery? |
+| Optimize | 8 | Terminology Strictness | Vale | 1:1 term-concept mapping? |
+| Optimize | 9 | Token Efficiency | Vale | No filler phrases? |
+| Optimize | 10 | Security Boundaries | Partial | No hardcoded secrets? |
+
+Tool assignment:
+- **Vale (5)**: 1, 2, 6, 8, 9 - Pattern matching, deterministic
+- **LLM (3)**: 3, 5, 7 - Semantic understanding required
+- **Partial (2)**: 4, 10 - Hybrid (Vale for patterns, LLM for edge cases)

 Each rubric includes:
- Scoring levels with clear definitions
+- Scoring levels with clear definitions (PASS/MARGINAL/FAIL)
 - Detect patterns (specific things to look for)
 - Reasoning requirement (must quote evidence)
 - Suggested fix field in output
@ -159,19 +229,21 @@ Each rubric includes:
 - How to handle large repos with many doc files? (chunking strategy)
 - Should rubrics be weighted differently?
 - How to handle generated docs (should they be excluded)?
- What's the right model for different stages (eval vs patch generation)?
+- What's the right model for different stages (triage vs patch generation)?
+- Vale vs markdownlint vs custom - which linter is best for this use case?

-** Next Steps
-1. Draft the prompt template that uses these rubrics
-2. Create before/after examples for few-shot
-3. Design graph-based doc discovery (skills-53k)
-4. Prototype scan.sh script
-5. Test on real repos
+** Next Steps (in ~/proj/doc-review)
+1. Implement Vale style for 5 deterministic rubrics (doc-review-i8n)
+2. Implement LLM prompts for 3 semantic rubrics (doc-review-4to)
+3. Design CLI interface (doc-review-xy3)
+4. Graph-based doc discovery (from skills-53k, needs new bead)
+5. Test on real repos (skills repo is good candidate)

 ** Related Work
- skills-53k: Graph-based doc discovery design
- skills-bcu: Parent design bead
+- ~/proj/doc-review - New home for this project
+- doc-review-i8n, doc-review-4to, doc-review-xy3 - Active beads
 - orch-z2x: Model availability feature for orch
+- orch-6o3: OPENROUTER env var naming bug

 ** External References
 - https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
@ -180,6 +252,7 @@ Each rubric includes:
 - https://www.montecarlodata.com/blog-llm-as-judge/
 - https://arxiv.org/abs/2503.23989 (Rubric Is All You Need)
 - https://agents.md (AGENTS.md standard)
+- https://vale.sh (Vale prose linter)

 * Raw Notes

@ -313,13 +386,55 @@ Phase 3: VERIFICATION LOOP (recursive)
 - PRE accuracy: One dimension at a time reduces hallucination
 - Recursive safety: Verify patches don't introduce regressions

+** Vale Discovery (Session Turning Point)
+Web search for "prose linter markdown documentation CI" revealed Vale:
+- Open-source, extensible prose linter
+- YAML-based rules (existence, substitution, consistency checks)
+- CI-friendly (runs in seconds, no LLM cost)
+- Already used by: Microsoft, Google, GitLab, DigitalOcean
+
+Realization: 5 of our rubrics are pattern-based → Vale can handle them:
+- Format Integrity (missing language tags)
+- Semantic Headings (banned words: "Overview", "Introduction")
+- Deterministic Instructions (hedging words: "might", "consider")
+- Terminology Strictness (consistency checks)
+- Token Efficiency (filler phrases)
+
+This shifted architecture from "Claude skill with all-LLM" to "standalone tool with Vale + LLM hybrid"
+
+** Consensus Results
+
+*** "Should doc-review be a skill or separate repo?"
+GPT: "Start with MVP in separate repo. Skill integration can come later."
+Gemini: "Invest in orch as orchestration layer. Design doc-review for reuse."
+Both agreed: Separate repo is the right call.
+
+** Final Architecture (Vale + LLM Hybrid)
+#+begin_example
+Stage 1: Vale (deterministic, fast, free)
+├── Catches ~40% of issues instantly
+├── Runs in CI on every commit
+└── No LLM cost for clean docs
+
+Stage 2: LLM Triage (cheap model)
+├── Only runs if Vale passes
+└── Evaluates 3 semantic rubrics (contextual independence, code executability, execution verification)
+
+Stage 3: LLM Specialists (capable model)
+├── One agent per failed rubric
+└── Generates patches
+#+end_example
+
 * Session Metrics
- Commits made: 0 (design session, no code yet)
- Files touched: 3 (rubrics-v1.md, rubrics-v3.md, prompt-template-v1.md in /tmp; .envrc)
- Beads created: 4 (skills-bcu, skills-1ig, skills-53k, orch-z2x)
- Beads closed: 1 (skills-dpw - duplicate moved to orch repo)
- Web searches: 6 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive)
- Web fetches: 4 (GitHub blog, kapa.ai, biel.ai, Monte Carlo)
+- Commits made: 1 (139a521 - doc-review design session complete)
+- Files touched: 19 (per git diff --stat)
+- Lines added: +4187
+- Lines removed: -65
+- Beads created: 7 (skills-bcu, skills-1ig, skills-53k in skills; doc-review-i8n, doc-review-4to, doc-review-xy3 in doc-review; orch-z2x, orch-6o3 in orch)
+- Beads closed: 3 (skills-bcu, skills-1ig, skills-53k)
+- Web searches: 8 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive, prose linter, Vale)
+- Web fetches: 5 (GitHub blog, kapa.ai, biel.ai, Monte Carlo, vale.sh)
 - Gemini consultations: 4 (conventions → rubrics v2 → prompt critique → architecture synthesis)
+- Consensus runs: 2 (spin-off decision, implementation approach)
 - Rubrics evolution: 7 (v1) → 12 (v2) → 10 (v3 merged)
- Key outcome: "Triage & Specialist" cascade architecture
+- Key outcome: Vale + LLM hybrid architecture, spun off to ~/proj/doc-review