docs: complete worklog for doc-review design session
Adds Vale discovery, spin-off decision, migration details, and updated session metrics to the design session worklog. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
139a521a89
commit
def212bc5b
|
|
@ -1,7 +1,7 @@
|
|||
#+TITLE: Doc-Review Skill Design: Research and Rubric Development
|
||||
#+TITLE: Doc-Review Skill Design: Research, Rubrics, Vale Discovery, and Repo Spin-off
|
||||
#+DATE: 2025-12-04
|
||||
#+KEYWORDS: doc-review, rubrics, LLM-as-judge, agent-friendly-docs, documentation-drift, AGENTS.md
|
||||
#+COMMITS: 0
|
||||
#+KEYWORDS: doc-review, rubrics, LLM-as-judge, agent-friendly-docs, documentation-drift, AGENTS.md, vale, hybrid-architecture
|
||||
#+COMMITS: 1
|
||||
#+COMPRESSION_STATUS: uncompressed
|
||||
|
||||
* Session Summary
|
||||
|
|
@ -17,10 +17,20 @@
|
|||
- [X] Filed orch UX issue (orch-z2x) for missing model availability command
|
||||
- [X] Fixed .envrc to include use_api_keys for orch access
|
||||
- [X] Got Gemini critique via orch - identified overlaps and 3 missing dimensions
|
||||
- [X] Revised rubrics to v2 (10 dimensions) based on AI-optimization feedback
|
||||
- [ ] Update rubrics-v2.md with full definitions
|
||||
- [ ] Prompt template (next step)
|
||||
- [ ] Before/after examples (next step)
|
||||
- [X] Revised rubrics to v2 (12 dimensions), merged to v3 (10 dimensions)
|
||||
- [X] Created prompt-template-v1.md with full system prompt and example
|
||||
- [X] Got Gemini critique on prompt - identified cognitive overload with 10 rubrics
|
||||
- [X] Researched multi-pass/recursive LLM evaluation architectures (PRE, G-Eval, ARISE)
|
||||
- [X] Designed "Triage & Specialist" cascade architecture
|
||||
- [X] Discovered Vale prose linter - can handle 4-7 rubrics deterministically
|
||||
- [X] Designed Vale + LLM hybrid architecture (deterministic first, LLM only for semantic)
|
||||
- [X] Ran GPT/Gemini consensus on spin-off decision - both recommended separate repo
|
||||
- [X] Created ~/proj/doc-review with beads setup
|
||||
- [X] Migrated artifacts: rubrics-v3.md, prompt-template-v1.md to design/ folder
|
||||
- [X] Updated doc-review/AGENTS.md with full architecture context
|
||||
- [X] Created beads in doc-review: doc-review-i8n (Vale), doc-review-4to (LLM), doc-review-xy3 (CLI)
|
||||
- [X] Closed skills repo beads: skills-bcu, skills-1ig, skills-53k
|
||||
- [X] Committed and pushed skills repo
|
||||
|
||||
* Key Decisions
|
||||
** Decision 1: Non-interactive patch-based workflow
|
||||
|
|
@ -62,6 +72,27 @@
|
|||
- Decision: Added Rubric 7 - Instruction Clarity (required vs optional vs dangerous)
|
||||
- Impact: Covers clarity of optionality and risk in all documentation, not just agent instruction files
|
||||
|
||||
** Decision 6: Vale + LLM hybrid architecture
|
||||
- Context: Gemini critique revealed 10 rubrics in one prompt causes cognitive overload
|
||||
- Discovery: Vale (prose linter) can handle 4-7 rubrics deterministically with YAML rules
|
||||
- Options considered:
|
||||
1. Pure LLM approach (original plan)
|
||||
2. Pure Vale approach (limited to pattern matching)
|
||||
3. Vale + LLM hybrid (deterministic first, LLM for semantic)
|
||||
- Rationale: Vale catches ~40% of issues instantly, for free, in CI. LLM only needed for semantic analysis.
|
||||
- Impact: Three-stage pipeline:
|
||||
- Stage 1: Vale (fast/free) - Format Integrity, Semantic Headings, Deterministic Instructions, Terminology, Token Efficiency
|
||||
- Stage 2: LLM Triage (cheap model) - Quick semantic scan
|
||||
- Stage 3: LLM Specialists (capable model) - Patch generation per failed rubric
|
||||
|
||||
** Decision 7: Spin off to separate repository
|
||||
- Context: Project grew from "Claude skill" to "standalone tool with Vale + LLM"
|
||||
- Ran consensus: GPT and Gemini both recommended separate repo
|
||||
- GPT reasoning: "Start with MVP in separate repo, iterate fast"
|
||||
- Gemini reasoning: "Invest in orch integration, design for reuse"
|
||||
- Decision: Created ~/proj/doc-review as standalone project
|
||||
- Impact: Clean separation, own beads, can be used outside claude-code context
|
||||
|
||||
* Problems & Solutions
|
||||
| Problem | Solution | Learning |
|
||||
|---------+----------+----------|
|
||||
|
|
@ -69,48 +100,87 @@
|
|||
| API keys not loading in skills repo | Added `use_api_keys` to .envrc (was only `use flake`) | Check .envrc has use_api_keys for repos needing API access |
|
||||
| False distinction between "general docs" and "agent instruction files" | User pushed back - all docs are agent-readable | Don't create artificial categories; generalize patterns |
|
||||
| Rubric research scattered across sources | Synthesized into single bead description | Centralize findings in beads for future reference |
|
||||
| 10 rubrics in one LLM prompt = cognitive overload | Split: Vale for deterministic, LLM for semantic only | Play to tool strengths - deterministic tools for pattern matching |
|
||||
| OPENROUTER_KEY vs OPENROUTER_API_KEY mismatch | Filed bug orch-6o3 in ~/proj/orch | Environment variable naming must be consistent |
|
||||
| bd sync failed with worktree error | Used regular git add + commit instead | bd sync has edge cases with certain git states |
|
||||
| doc-review repo missing git identity | Left for user to configure user.email/name | New repos need git config before first commit |
|
||||
|
||||
* Technical Details
|
||||
|
||||
** Artifacts Created
|
||||
- `/tmp/doc-review-drafts/rubrics-v1.md` - Full rubric definitions (181 lines)
|
||||
- `skills-bcu` - Main design bead with workflow description
|
||||
- `skills-1ig` - Conventions research bead (in_progress)
|
||||
- `skills-53k` - Graph discovery design bead (open)
|
||||
- `orch-z2x` - Filed in ~/proj/orch for model availability feature
|
||||
- `/tmp/doc-review-drafts/rubrics-v1.md` - Initial 7 rubric definitions
|
||||
- `/tmp/doc-review-drafts/rubrics-v3.md` - Final 10 rubrics (320 lines)
|
||||
- `/tmp/doc-review-drafts/prompt-template-v1.md` - LLM system prompt with example
|
||||
- `~/proj/doc-review/docs/design/rubrics-v3.md` - Migrated final rubrics
|
||||
- `~/proj/doc-review/docs/design/prompt-template-v1.md` - Migrated prompt template
|
||||
- `~/proj/doc-review/AGENTS.md` - Full project context for next agent
|
||||
|
||||
** Beads Created/Updated
|
||||
In skills repo (all closed):
|
||||
- `skills-bcu` - Main design bead (closed: spun off to doc-review)
|
||||
- `skills-1ig` - Conventions research (closed: completed)
|
||||
- `skills-53k` - Graph discovery design (closed: moved to doc-review)
|
||||
|
||||
In doc-review repo:
|
||||
- `doc-review-i8n` - Implement Vale style for rubrics
|
||||
- `doc-review-4to` - Implement LLM prompts for semantic rubrics
|
||||
- `doc-review-xy3` - Design CLI interface
|
||||
|
||||
In orch repo:
|
||||
- `orch-z2x` - Model availability command feature
|
||||
- `orch-6o3` - OPENROUTER_KEY vs OPENROUTER_API_KEY bug
|
||||
|
||||
** Commands Used
|
||||
#+begin_src bash
|
||||
# Web search for conventions
|
||||
# Web search for conventions and research
|
||||
WebSearch: "LLM-friendly documentation conventions AI agent readable docs best practices 2025"
|
||||
WebSearch: "AGENTS.md llms.txt AI coding assistant documentation format"
|
||||
WebSearch: "LLM as judge rubric prompt engineering best practices"
|
||||
WebSearch: "multi-pass LLM evaluation recursive evaluation framework"
|
||||
WebSearch: "prose linter markdown documentation CI"
|
||||
|
||||
# Web fetch for deep dives
|
||||
WebFetch: github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
|
||||
WebFetch: docs.kapa.ai/improving/writing-best-practices
|
||||
WebFetch: biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide
|
||||
WebFetch: montecarlodata.com/blog-llm-as-judge/
|
||||
WebFetch: vale.sh
|
||||
|
||||
# orch consultations
|
||||
orch ask gemini --temp 1.2 "Critique these rubrics for AI-optimized documentation..."
|
||||
orch consensus gpt gemini "Should doc-review be a skill or separate repo?"
|
||||
|
||||
# Migration commands
|
||||
mkdir -p ~/proj/doc-review/docs/design
|
||||
cp /tmp/doc-review-drafts/*.md ~/proj/doc-review/docs/design/
|
||||
|
||||
# Beads workflow
|
||||
bd create --title="Design doc-review skill" --type=feature
|
||||
bd dep add skills-bcu skills-1ig
|
||||
bd dep add skills-bcu skills-53k
|
||||
bd close skills-bcu skills-1ig skills-53k
|
||||
#+end_src
|
||||
|
||||
** The 7 Rubrics
|
||||
| # | Rubric | Key Question |
|
||||
|---+--------+--------------|
|
||||
| 1 | Self-Containment | Can section be understood alone? |
|
||||
| 2 | Heading Structure | Logical hierarchy, descriptive names? |
|
||||
| 3 | Terminology Consistency | Same term for same concept? |
|
||||
| 4 | Code Example Completeness | Examples runnable as shown? |
|
||||
| 5 | Explicit Context | Prerequisites stated, no implicit knowledge? |
|
||||
| 6 | Technical Specificity | Exact versions, flags, commands? |
|
||||
| 7 | Instruction Clarity | Required vs optional vs dangerous? |
|
||||
** The 10 Rubrics (Final v3)
|
||||
Organized by "Agent's Hierarchy of Needs":
|
||||
|
||||
| Phase | # | Rubric | Tool | Key Question |
|
||||
|-------+---+--------+------+--------------|
|
||||
| Read | 1 | Format Integrity | Vale | Valid markdown? Code blocks tagged? |
|
||||
| Find | 2 | Semantic Headings | Vale | Headings contain task+object keywords? |
|
||||
| Find | 3 | Contextual Independence | LLM | No "as mentioned above"? |
|
||||
| Run | 4 | Configuration Precision | Partial | Exact versions, flags, paths? |
|
||||
| Run | 5 | Code Executability | LLM | All imports present? |
|
||||
| Run | 6 | Deterministic Instructions | Vale | No hedging ("might", "consider")? |
|
||||
| Verify | 7 | Execution Verification | LLM | Expected output + error recovery? |
|
||||
| Optimize | 8 | Terminology Strictness | Vale | 1:1 term-concept mapping? |
|
||||
| Optimize | 9 | Token Efficiency | Vale | No filler phrases? |
|
||||
| Optimize | 10 | Security Boundaries | Partial | No hardcoded secrets? |
|
||||
|
||||
Tool assignment:
|
||||
- **Vale (5)**: 1, 2, 6, 8, 9 - Pattern matching, deterministic
|
||||
- **LLM (3)**: 3, 5, 7 - Semantic understanding required
|
||||
- **Partial (2)**: 4, 10 - Hybrid (Vale for patterns, LLM for edge cases)
|
||||
|
||||
Each rubric includes:
|
||||
- Scoring levels with clear definitions
|
||||
- Scoring levels with clear definitions (PASS/MARGINAL/FAIL)
|
||||
- Detect patterns (specific things to look for)
|
||||
- Reasoning requirement (must quote evidence)
|
||||
- Suggested fix field in output
|
||||
|
|
@ -159,19 +229,21 @@ Each rubric includes:
|
|||
- How to handle large repos with many doc files? (chunking strategy)
|
||||
- Should rubrics be weighted differently?
|
||||
- How to handle generated docs (should they be excluded)?
|
||||
- What's the right model for different stages (eval vs patch generation)?
|
||||
- What's the right model for different stages (triage vs patch generation)?
|
||||
- Vale vs markdownlint vs custom - which linter is best for this use case?
|
||||
|
||||
** Next Steps
|
||||
1. Draft the prompt template that uses these rubrics
|
||||
2. Create before/after examples for few-shot
|
||||
3. Design graph-based doc discovery (skills-53k)
|
||||
4. Prototype scan.sh script
|
||||
5. Test on real repos
|
||||
** Next Steps (in ~/proj/doc-review)
|
||||
1. Implement Vale style for 5 deterministic rubrics (doc-review-i8n)
|
||||
2. Implement LLM prompts for 3 semantic rubrics (doc-review-4to)
|
||||
3. Design CLI interface (doc-review-xy3)
|
||||
4. Graph-based doc discovery (from skills-53k, needs new bead)
|
||||
5. Test on real repos (skills repo is good candidate)
|
||||
|
||||
** Related Work
|
||||
- skills-53k: Graph-based doc discovery design
|
||||
- skills-bcu: Parent design bead
|
||||
- ~/proj/doc-review - New home for this project
|
||||
- doc-review-i8n, doc-review-4to, doc-review-xy3 - Active beads
|
||||
- orch-z2x: Model availability feature for orch
|
||||
- orch-6o3: OPENROUTER env var naming bug
|
||||
|
||||
** External References
|
||||
- https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
|
||||
|
|
@ -180,6 +252,7 @@ Each rubric includes:
|
|||
- https://www.montecarlodata.com/blog-llm-as-judge/
|
||||
- https://arxiv.org/abs/2503.23989 (Rubric Is All You Need)
|
||||
- https://agents.md (AGENTS.md standard)
|
||||
- https://vale.sh (Vale prose linter)
|
||||
|
||||
* Raw Notes
|
||||
|
||||
|
|
@ -313,13 +386,55 @@ Phase 3: VERIFICATION LOOP (recursive)
|
|||
- PRE accuracy: One dimension at a time reduces hallucination
|
||||
- Recursive safety: Verify patches don't introduce regressions
|
||||
|
||||
** Vale Discovery (Session Turning Point)
|
||||
Web search for "prose linter markdown documentation CI" revealed Vale:
|
||||
- Open-source, extensible prose linter
|
||||
- YAML-based rules (existence, substitution, consistency checks)
|
||||
- CI-friendly (runs in seconds, no LLM cost)
|
||||
- Already used by: Microsoft, Google, GitLab, DigitalOcean
|
||||
|
||||
Realization: 5 of our rubrics are pattern-based → Vale can handle them:
|
||||
- Format Integrity (missing language tags)
|
||||
- Semantic Headings (banned words: "Overview", "Introduction")
|
||||
- Deterministic Instructions (hedging words: "might", "consider")
|
||||
- Terminology Strictness (consistency checks)
|
||||
- Token Efficiency (filler phrases)
|
||||
|
||||
This shifted architecture from "Claude skill with all-LLM" to "standalone tool with Vale + LLM hybrid"
|
||||
|
||||
** Consensus Results
|
||||
|
||||
*** "Should doc-review be a skill or separate repo?"
|
||||
GPT: "Start with MVP in separate repo. Skill integration can come later."
|
||||
Gemini: "Invest in orch as orchestration layer. Design doc-review for reuse."
|
||||
Both agreed: Separate repo is the right call.
|
||||
|
||||
** Final Architecture (Vale + LLM Hybrid)
|
||||
#+begin_example
|
||||
Stage 1: Vale (deterministic, fast, free)
|
||||
├── Catches ~40% of issues instantly
|
||||
├── Runs in CI on every commit
|
||||
└── No LLM cost for clean docs
|
||||
|
||||
Stage 2: LLM Triage (cheap model)
|
||||
├── Only runs if Vale passes
|
||||
└── Evaluates 3 semantic rubrics (contextual independence, code executability, execution verification)
|
||||
|
||||
Stage 3: LLM Specialists (capable model)
|
||||
├── One agent per failed rubric
|
||||
└── Generates patches
|
||||
#+end_example
|
||||
|
||||
* Session Metrics
|
||||
- Commits made: 0 (design session, no code yet)
|
||||
- Files touched: 3 (rubrics-v1.md, rubrics-v3.md, prompt-template-v1.md in /tmp; .envrc)
|
||||
- Beads created: 4 (skills-bcu, skills-1ig, skills-53k, orch-z2x)
|
||||
- Beads closed: 1 (skills-dpw - duplicate moved to orch repo)
|
||||
- Web searches: 6 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive)
|
||||
- Web fetches: 4 (GitHub blog, kapa.ai, biel.ai, Monte Carlo)
|
||||
- Commits made: 1 (139a521 - doc-review design session complete)
|
||||
- Files touched: 19 (per git diff --stat)
|
||||
- Lines added: +4187
|
||||
- Lines removed: -65
|
||||
- Beads created: 7 (skills-bcu, skills-1ig, skills-53k in skills; doc-review-i8n, doc-review-4to, doc-review-xy3 in doc-review; orch-z2x, orch-6o3 in orch)
|
||||
- Beads closed: 3 (skills-bcu, skills-1ig, skills-53k)
|
||||
- Web searches: 8 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive, prose linter, Vale)
|
||||
- Web fetches: 5 (GitHub blog, kapa.ai, biel.ai, Monte Carlo, vale.sh)
|
||||
- Gemini consultations: 4 (conventions → rubrics v2 → prompt critique → architecture synthesis)
|
||||
- Consensus runs: 2 (spin-off decision, implementation approach)
|
||||
- Rubrics evolution: 7 (v1) → 12 (v2) → 10 (v3 merged)
|
||||
- Key outcome: "Triage & Specialist" cascade architecture
|
||||
- Key outcome: Vale + LLM hybrid architecture, spun off to ~/proj/doc-review
|
||||
|
|
|
|||
Loading…
Reference in a new issue