dan/skills

dan 139a521a89 doc-review: design session complete, spun off to ~/proj/doc-review

- Added use_api_keys to .envrc for orch access
- Worklog documents full design process
- Beads closed: skills-bcu, skills-1ig, skills-53k, skills-d6r
- Architecture: Vale + LLM hybrid (deterministic + semantic)
- Implementation continues in dedicated repo

2025-12-04 16:44:49 -08:00

14 KiB

Raw Blame History

Doc-Review Skill Design: Research and Rubric Development

Session Summary
- Date: 2025-12-04 (Day 1 of doc-review skill)
- Focus Area: Designing a doc-review skill to normalize documentation and fight documentation drift
Accomplishments
Key Decisions
Problems & Solutions
Technical Details
Process and Workflow
- What Worked Well
- What Was Challenging
Learning and Insights
Context for Future Work
Raw Notes
Session Metrics

Session Summary

Date: 2025-12-04 (Day 1 of doc-review skill)

Focus Area: Designing a doc-review skill to normalize documentation and fight documentation drift

Accomplishments

Created bead structure for doc-review skill design (skills-bcu with 2 blockers)
Researched agent-friendly documentation conventions via web search
Analyzed GitHub's study of 2,500+ AGENTS.md repositories
Researched LLM-as-judge rubric best practices
Drafted 7 decomposed rubrics for documentation evaluation (v1)
Filed orch UX issue (orch-z2x) for missing model availability command
Fixed .envrc to include use_api_keys for orch access
Got Gemini critique via orch - identified overlaps and 3 missing dimensions
Revised rubrics to v2 (10 dimensions) based on AI-optimization feedback
Update rubrics-v2.md with full definitions
Prompt template (next step)
Before/after examples (next step)

Key Decisions

Decision 1: Non-interactive patch-based workflow

Context: Deciding how doc-review should integrate with human workflow
Options considered:
1. Direct edit mode - LLM directly modifies docs
2. Report-only mode - LLM produces report, human edits
3. Patch-based mode - LLM generates patches, human reviews and applies
Rationale: Patch-based gives human control, enables review before apply, works with existing git workflow
Impact: Two-phase workflow: generate patches (non-interactive), review patches (interactive Claude session)

Decision 2: Decomposed rubrics over holistic evaluation

Context: How should the LLM evaluate documentation quality?
Options considered:
1. Single holistic prompt ("rate this doc's quality")
2. Multi-dimensional single pass ("rate on 10 dimensions")
3. Decomposed rubrics (one dimension per evaluation)
Rationale: Research shows LLMs better at "guided summarization" than complex reasoning. Decomposed approach plays to LLM strengths.
Impact: 7 separate rubrics, each with clear question, scoring levels, and detect patterns

Decision 3: Three-level scoring (PASS/MARGINAL/FAIL)

Context: What granularity for scoring?
Options considered:
1. Binary (pass/fail)
2. Three-level (pass/marginal/fail)
3. Five or ten point scale
Rationale: Binary loses nuance, 5+ point scales introduce ambiguity. Three levels give actionable distinction.
Impact: FAIL triggers patch generation, MARGINAL flags for human review, PASS means agent-friendly

Decision 4: Graph-based doc discovery (deferred)

Context: How does doc-review find which docs to evaluate?
Decision: Start from README.md or AGENTS.md and graph out via links
Rationale: Not all .md files are documentation. Following links from root finds connected docs.
Impact: Created separate bead (skills-53k) to design discovery algorithm

Decision 5: "Instruction Clarity" rubric added

Context: Initial 6 rubrics didn't cover the "boundaries" pattern from AGENTS.md research
Discussion: User asked what "boundaries" meant. Realized the ✅/⚠️/🚫 pattern generalizes beyond AGENTS.md
Decision: Added Rubric 7 - Instruction Clarity (required vs optional vs dangerous)
Impact: Covers clarity of optionality and risk in all documentation, not just agent instruction files

Problems & Solutions

Problem	Solution	Learning
orch CLI failed - no API keys configured	Tried 4 models (gemini, flash, deepseek, gpt) - all failed	Need `orch models` command to show available models
API keys not loading in skills repo	Added `use_api_keys` to .envrc (was only `use flake`)	Check .envrc has use_api_keys for repos needing API access
False distinction between "general docs" and "agent instruction files"	User pushed back - all docs are agent-readable	Don't create artificial categories; generalize patterns
Rubric research scattered across sources	Synthesized into single bead description	Centralize findings in beads for future reference

Technical Details

Artifacts Created

`/tmp/doc-review-drafts/rubrics-v1.md` - Full rubric definitions (181 lines)
`skills-bcu` - Main design bead with workflow description
`skills-1ig` - Conventions research bead (in_progress)
`skills-53k` - Graph discovery design bead (open)
`orch-z2x` - Filed in ~/proj/orch for model availability feature

Commands Used

# Web search for conventions
WebSearch: "LLM-friendly documentation conventions AI agent readable docs best practices 2025"
WebSearch: "AGENTS.md llms.txt AI coding assistant documentation format"
WebSearch: "LLM as judge rubric prompt engineering best practices"

# Web fetch for deep dives
WebFetch: github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/
WebFetch: docs.kapa.ai/improving/writing-best-practices
WebFetch: biel.ai/blog/optimizing-docs-for-ai-agents-complete-guide
WebFetch: montecarlodata.com/blog-llm-as-judge/

# Beads workflow
bd create --title="Design doc-review skill" --type=feature
bd dep add skills-bcu skills-1ig
bd dep add skills-bcu skills-53k

The 7 Rubrics

#	Rubric	Key Question
1	Self-Containment	Can section be understood alone?
2	Heading Structure	Logical hierarchy, descriptive names?
3	Terminology Consistency	Same term for same concept?
4	Code Example Completeness	Examples runnable as shown?
5	Explicit Context	Prerequisites stated, no implicit knowledge?
6	Technical Specificity	Exact versions, flags, commands?
7	Instruction Clarity	Required vs optional vs dangerous?

Each rubric includes:

Scoring levels with clear definitions
Detect patterns (specific things to look for)
Reasoning requirement (must quote evidence)
Suggested fix field in output

Process and Workflow

What Worked Well

Web search → web fetch pipeline for research
Beads for tracking design decisions
Iterative rubric development with user feedback
User pushing back on false distinctions

What Was Challenging

orch unavailable due to missing API keys - blocked brainstorming with external model
Keeping rubrics crisp - tendency to over-specify

Learning and Insights

Technical Insights

LLMs are "better at guided summarization than complex reasoning"
One few-shot example often outperforms multiple (diminishing returns)
Decomposed evaluation + deterministic aggregation > holistic evaluation
llms.txt is emerging as robots.txt equivalent for LLM access
MCP (Model Context Protocol) for structured doc discovery

Process Insights

Research before design - the web search surfaced patterns we wouldn't have invented
GitHub's 2,500 repo study provided concrete evidence for conventions
Asking "what would a rubric for X look like" forces clarity on what X means

Architectural Insights

Two-phase workflow (generate then review) separates concerns
Patches as intermediate format enables git integration
Per-dimension rubrics enable independent iteration

Key Research Sources

GitHub blog: 2,500+ AGENTS.md analysis - "one snippet beats three paragraphs"
kapa.ai: proximity principle, self-contained sections, semantic discoverability
biel.ai: single purpose per section, complete code examples
Monte Carlo: 7 best practices for LLM-as-judge, failure modes
ACM ICER 2025: "Rubric Is All You Need" paper

Context for Future Work

Open Questions

How to handle large repos with many doc files? (chunking strategy)
Should rubrics be weighted differently?
How to handle generated docs (should they be excluded)?
What's the right model for different stages (eval vs patch generation)?

Next Steps

Draft the prompt template that uses these rubrics
Create before/after examples for few-shot
Design graph-based doc discovery (skills-53k)
Prototype scan.sh script
Test on real repos

Related Work

skills-53k: Graph-based doc discovery design
skills-bcu: Parent design bead
orch-z2x: Model availability feature for orch

External References

Raw Notes

Envisioned Workflow (from initial brainstorm)

# Phase 1: Generate patches (non-interactive, burnable credits)
doc-review scan ~/proj/foo --model claude-sonnet --output /tmp/foo-patches/

# Phase 2: Review patches (interactive session)
cd ~/proj/foo
claude  # human reviews patches, applies selectively

Skill Structure (tentative)

skills/doc-review/
├── prompt.md           # Core review instructions + style guide
├── scan.sh             # Orchestrates: find docs → invoke claude → emit patches
└── README.md

Output Format for Evaluations

{
  "section": "## Installation",
  "line_range": [15, 42],
  "evaluations": [
    {
      "rubric": "self_containment",
      "score": "FAIL",
      "reasoning": "Line 23 says 'as configured above' but the configuration is in a different file.",
      "evidence": "as configured above",
      "suggested_fix": "Add explicit reference: 'as configured in config.yaml (see Configuration section)'"
    }
  ]
}

Key Quote from Research

"LLMs are better at guided summarization than complex reasoning. If you can break down a problem into multiple smaller steps of guided summarization by creating a rubric, you can play to the strengths of LLMs and then rely on simple deterministic code to parse the output and score the rubric." — Monte Carlo, LLM-as-Judge best practices

Gemini Critique of Rubrics (via orch)

After fixing .envrc to include use_api_keys, ran high-temp Gemini critique.

Overlaps Identified

Self-Containment vs Explicit Context had significant overlap
Fix: Split into Contextual Independence (referential) + Environmental State (runtime)

Missing Dimensions for AI

Verifiability - agents get stuck when they can't tell if command succeeded
Format Integrity - broken markdown/JSON causes hallucinations
Token Efficiency - fluff dilutes semantic weight in vector search

Key Insight

"Agents fail when given choices without criteria. Change 'Use a large instance' to 'Use instance type t3.2xlarge or greater.'"

Revised Rubrics (v3 final: 10 dimensions, merged)

Phase	#	Rubric
Read	1	Format Integrity
Find	2	Semantic Headings
Find	3	Contextual Independence
Run	4	Configuration Precision (merged: Env State + Tech Specificity)
Run	5	Code Executability
Run	6	Deterministic Instructions
Verify	7	Execution Verification (merged: Verifiable Output + Error Recovery)
Optimize	8	Terminology Strictness
Optimize	9	Token Efficiency
Optimize	10	Security Boundaries

Prompt Template Critique (Gemini round 3)

Problem: 10 rubrics = cognitive overload

"Accuracy drops significantly after 3-5 distinct evaluation criteria per prompt"

Anti-patterns identified

JSON-only output - model commits to score before thinking (fix: CoT first)
Evidence hallucination - model invents quotes (fix: allow null)
Positional bias - middle rubrics degrade (fix: split passes)

Architecture options

2-Pass: Linter (structural) + Agent Simulator (semantic)
Many-Pass: One rubric per pass (max accuracy, max latency)
Iterative/Recursive: Quick scan → deep-dive on failures
Hybrid: Single-pass quick, multi-pass on demand

The "Agent Simulator" persona

"Simulate execution step-by-step. Identify where you get stuck, where instructions are ambiguous, or where you might hallucinate due to missing context."

Architecture Research (web search + Gemini round 4)

Frameworks discovered

Framework	Key Insight
PRE (ACM 2025)	One rubric per call for accuracy
G-Eval	3-step: define → CoT → execute
EvalPlanner	Plan → Execute → Judge
Spring AI Recursive	Generate → Evaluate → Retry loop
ARISE	22 specialized agents for different tasks
Realm	Recursive refinement with Bayesian aggregation
HuCoSC	Break complex problems into independent analyses

Final Architecture: "Triage & Specialist" Cascade

Phase 1: TRIAGE SCAN (cheap model)
├── Input: doc chunk + 10 rubric definitions
├── Output: list of suspect rubrics
└── Cost: ~0 for clean docs

Phase 2: SPECIALIST AGENTS (PRE-style, parallel)
├── One agent per suspect rubric
└── Each outputs: patch for that dimension

Phase 3: VERIFICATION LOOP (recursive)
├── Re-evaluate patched content
└── Retry with feedback if still FAIL

Why this pattern

Adaptive cost: 80% of clean content pays near-zero
PRE accuracy: One dimension at a time reduces hallucination
Recursive safety: Verify patches don't introduce regressions

Session Metrics

Commits made: 0 (design session, no code yet)
Files touched: 3 (rubrics-v1.md, rubrics-v3.md, prompt-template-v1.md in /tmp; .envrc)
Beads created: 4 (skills-bcu, skills-1ig, skills-53k, orch-z2x)
Beads closed: 1 (skills-dpw - duplicate moved to orch repo)
Web searches: 6 (agent-friendly docs, AGENTS.md, LLM-as-judge, multi-pass, recursive)
Web fetches: 4 (GitHub blog, kapa.ai, biel.ai, Monte Carlo)
Gemini consultations: 4 (conventions → rubrics v2 → prompt critique → architecture synthesis)
Rubrics evolution: 7 (v1) → 12 (v2) → 10 (v3 merged)
Key outcome: "Triage & Specialist" cascade architecture

14 KiB Raw Blame History

Doc-Review Skill Design: Research and Rubric Development

Session Summary

Date: 2025-12-04 (Day 1 of doc-review skill)

Focus Area: Designing a doc-review skill to normalize documentation and fight documentation drift

Accomplishments

Key Decisions

Decision 1: Non-interactive patch-based workflow

Decision 2: Decomposed rubrics over holistic evaluation

Decision 3: Three-level scoring (PASS/MARGINAL/FAIL)

Decision 4: Graph-based doc discovery (deferred)

Decision 5: "Instruction Clarity" rubric added

Problems & Solutions

Technical Details

Artifacts Created

Commands Used

The 7 Rubrics

Process and Workflow

What Worked Well

What Was Challenging

Learning and Insights

Technical Insights

Process Insights

Architectural Insights

Key Research Sources

Context for Future Work

Open Questions

Next Steps

Related Work

External References

Raw Notes

Envisioned Workflow (from initial brainstorm)

Skill Structure (tentative)

Output Format for Evaluations

Key Quote from Research

Gemini Critique of Rubrics (via orch)

Overlaps Identified

Missing Dimensions for AI

Key Insight

Revised Rubrics (v3 final: 10 dimensions, merged)

Prompt Template Critique (Gemini round 3)

Problem: 10 rubrics = cognitive overload

Anti-patterns identified

Architecture options

The "Agent Simulator" persona

Architecture Research (web search + Gemini round 4)

Frameworks discovered

Final Architecture: "Triage & Specialist" Cascade

Why this pattern

Session Metrics

14 KiB

Raw Blame History