feat(skills): Add test-review and verify-work skills

- skills/verify-work: The Gatekeeper (runs verify.sh) - skills/test-review: The Auditor (checks test quality) - skills.nix: Registered new skills
2026-01-19 15:31:22 -08:00 · 2026-01-19 15:31:22 -08:00 · 97a2806d47
parent 5f1a9a4d3d
commit 97a2806d47
11 changed files with 434 additions and 0 deletions
--- a/skills.nix
+++ b/skills.nix
@ -12,6 +12,8 @@
  screenshot-latest = "Find latest screenshots";
  spec-review = "Technical specification review";
  tufte-press = "Generate study card JSON";
+  test-review = "Audit test quality for flakiness, tautologies, and anti-patterns";
+  verify-work = "The Gatekeeper: Run project build/tests before claiming success";
  worklog = "Create structured worklogs";
  update-spec-kit = "Update spec-kit ecosystem";
  update-opencode = "Update OpenCode via Nix";
--- a/skills/test-review/README.md
+++ b/skills/test-review/README.md
@ -0,0 +1,27 @@
+# Test Review Skill (The Auditor)
+
+This skill focuses on the **quality** of your test code.
+
+While `verify-work` checks if tests *pass*, `test-review` checks if tests are *good*. A test suite can pass 100% and still be useless if it contains tautologies, brittle mocks, or race conditions.
+
+## What It Checks
+
+1.  **Flakiness:** Arbitrary sleeps, global state, race conditions.
+2.  **Tautologies:** Tests that always pass (`expect(true).toBe(true)`).
+3.  **Brittle Mocks:** Mocking internals instead of behavior.
+4.  **Structure:** Arrange-Act-Assert (AAA) pattern violations.
+5.  **PBT Opportunities:** Suggesting Property-Based Testing for suitable logic.
+
+## Usage
+
+Ask the agent to review specific test files or the whole suite:
+
+*   "Review `src/auth.test.ts` for anti-patterns"
+*   "Audit my test suite quality"
+*   "Are these tests brittle?"
+
+## Philosophy
+
+"Test behavior, not implementation."
+"Wait for conditions, not time."
+"Don't mock what you don't own."
--- a/skills/test-review/SKILL.md
+++ b/skills/test-review/SKILL.md
@ -0,0 +1,57 @@
+---
+name: test-review
+description: Audit test quality using multiple lenses. Analyzes for flakiness, tautologies, structure, and mocking anti-patterns. Distinct from running tests (verify-work).
+---
+
+# Test Review Skill (The Auditor)
+
+Run focused analysis on test files to ensure they are robust and meaningful.
+
+## When to Use
+
+Invoke this skill when:
+- "Review these tests"
+- "Audit my test suite"
+- "Are these tests brittle?"
+- "Check for test anti-patterns"
+
+## Available Lenses
+
+Lenses are located in `skills/test-review/lenses/`:
+
+| Lens | Focus |
+|------|-------|
+| `flakiness.md` | Timing issues, sleeps, global state, race conditions |
+| `lies.md` | Tautologies, meaningless assertions, swallowed errors |
+| `structure.md` | Arrange-Act-Assert violations, single-action check |
+| `mocking.md` | "Don't mock what you don't own", implementation vs behavior |
+
+## Workflow
+
+### Phase 1: Target Selection
+1. Identify test files to review (e.g., `*.test.ts`, `test_*.py`).
+2. Show file list to user for confirmation.
+
+### Phase 2: Lens Execution
+For each lens, analyze the target test files:
+1. Apply the lens criteria.
+2. Collect findings (Severity: HIGH/MED/LOW).
+
+### Phase 3: Synthesis
+1. Deduplicate findings.
+2. Group related issues.
+3. Present summary report.
+
+### Phase 4: Interactive Review
+Ask: "Which findings should I file as issues?"
+
+## Guidelines
+
+1. **Behavior Over Implementation:** Ensure tests verify what the code *does*, not how it *looks*.
+2. **Dynamic vs Static:** Remind user that this is a quality audit. Use `verify-work` to confirm if they actually *pass*.
+3. **No Flakiness:** Be ruthless about `sleep` and `setTimeout`.
+
+## Integration
+
+- **Issue Tracking:** Uses `bd create` for beads issues.
+- **Verification:** Works as part of the Triad (Verify -> Test-Review -> Code-Review).
--- a/skills/test-review/lenses/README.md
+++ b/skills/test-review/lenses/README.md
@ -0,0 +1,9 @@
+# Lenses for Test Review
+
+These lenses are used by the `test-review` skill to audit test quality.
+
+## Installation
+
+For global availability, these should be copied to `~/.config/lenses/test/` (or similar path configured in your agent settings).
+
+The `test-review` skill will also look for them in the skill's local `lenses/` directory if they are not found globally.
--- a/skills/test-review/lenses/flakiness.md
+++ b/skills/test-review/lenses/flakiness.md
@ -0,0 +1,20 @@
+# Lens: Flakiness Auditor
+
+Analyze the tests for timing dependencies and non-deterministic behavior.
+
+## Checklist
+
+### 1. Arbitrary Sleeps
+*   **❌ Flag:** `setTimeout`, `sleep(100)`, `time.Sleep(time.Second)`.
+*   **Why:** Fast machines fail, slow machines wait too long. It is a "guess" at timing.
+*   **Fix:** Use condition-based waiting (e.g., `waitFor(() => condition)`).
+
+### 2. Global State Mutation
+*   **❌ Flag:** Tests that modify global variables, environment variables, or singleton states without cleanup.
+*   **Why:** Causes order-dependent failures.
+*   **Fix:** Use `beforeEach`/`afterEach` or generate unique IDs per test.
+
+### 3. Unmocked Network/IO
+*   **❌ Flag:** Tests that make real network calls or write to system-wide paths (e.g., `/tmp/test`).
+*   **Why:** External failures cause test flakiness.
+*   **Fix:** Mock network boundaries or use isolated temp directories.
--- a/skills/test-review/lenses/lies.md
+++ b/skills/test-review/lenses/lies.md
@ -0,0 +1,24 @@
+# Lens: Tautology Auditor
+
+Check if tests are "lying" or providing false confidence by asserting things that are always true.
+
+## Checklist
+
+### 1. Mock Assertions
+*   **❌ Flag:** `expect(mock).toHaveBeenCalled()`.
+*   **Why:** You are testing that the mock framework works, not that the code is correct.
+*   **Fix:** Assert on the *result* of the call or its *observable side effects*.
+
+### 2. Logic Duplication
+*   **❌ Flag:** `expect(add(a,b)).toBe(a+b)`.
+*   **Why:** If the logic is wrong in both, the test passes.
+*   **Fix:** Use hardcoded expected values (`expect(add(2,2)).toBe(4)`).
+
+### 3. Swallowed Errors
+*   **❌ Flag:** `try { ... } catch (e) {}` inside a test without re-throwing or failing.
+*   **Why:** Silences actual failures.
+*   **Fix:** Use the test runner's built-in error handling or `expect().toThrow()`.
+
+### 4. Meaningless Assertions
+*   **❌ Flag:** `expect(result).toBeDefined()` when the result is just a status code.
+*   **Fix:** Assert on the *specific value* (`expect(result.status).toBe(200)`).
--- a/skills/test-review/lenses/mocking.md
+++ b/skills/test-review/lenses/mocking.md
@ -0,0 +1,18 @@
+# Lens: Mocking Auditor
+
+Analyze how external dependencies are handled.
+
+## Checklist
+
+### 1. "Don't Mock What You Don't Own"
+*   **❌ Flag:** Mocking `axios`, `fs`, `react-router` or other 3rd party libraries directly.
+*   **Why:** Brittle. Changes in the library break your tests even if your code is fine.
+*   **Fix:** Suggest a wrapper (Adapter pattern) and mock your own wrapper.
+
+### 2. Implementation vs Behavior
+*   **❌ Flag:** Tests that depend on private variables or internal method names.
+*   **Fix:** Test only the public API.
+
+### 3. Complex Mock Setup
+*   **❌ Flag:** Mock setup that is > 50% of the test file.
+*   **Fix:** Suggest an integration test with real (managed) dependencies or a simpler mock strategy.
--- a/skills/test-review/lenses/structure.md
+++ b/skills/test-review/lenses/structure.md
@ -0,0 +1,18 @@
+# Lens: Structure Auditor
+
+Check if tests follow a clear, maintainable structure.
+
+## Checklist
+
+### 1. Arrange-Act-Assert (AAA)
+*   **Check:** Does the test have clear setup, execution, and verification phases?
+*   **❌ Flag:** Interleaved acting and asserting.
+*   **Fix:** Refactor into `// Arrange`, `// Act`, `// Assert` blocks.
+
+### 2. Single Action per Test
+*   **❌ Flag:** Tests that test 5 different things in one block.
+*   **Fix:** Split into multiple smaller tests.
+
+### 3. Readability
+*   **Check:** Does the test description match the assertion?
+*   **Fix:** Use descriptive names like `test('should reject invalid emails')` instead of `test('email test')`.
--- a/skills/verify-work/README.md
+++ b/skills/verify-work/README.md
@ -0,0 +1,46 @@
+# Verify Work Skill (The Gatekeeper)
+
+A skill that enforces "The Iron Law": **No completion claims without fresh verification evidence.**
+
+It prevents the "Hallucinated Success" problem where an AI agent claims "I fixed it" because the code *looks* right, even though it fails to compile or pass tests.
+
+## Features
+
+*   **Universal Verification:** Automatically detects project type (Nix, Node, Rust, Python, Go, Make) and runs the appropriate checks.
+*   **Zero-Config:** Works out of the box for standard project structures.
+*   **Strict Gate:** Instructs the agent that it is *forbidden* from claiming success if verification fails.
+
+## Usage
+
+### For Agents
+The agent invokes this skill whenever it believes a task is done.
+
+```bash
+# Agent runs this
+./skills/verify-work/scripts/verify.sh
+```
+
+### For Humans
+You can run the script manually to verify the current directory:
+
+```bash
+# Verify the current project
+~/.claude/skills/verify-work/scripts/verify.sh
+```
+
+## Supported Project Types
+
+1.  **Nix:** `flake.nix` -> `nix flake check`
+2.  **Rust:** `Cargo.toml` -> `cargo test && cargo build`
+3.  **Node/TS:** `package.json` -> `npm test`, `npm run build`
+4.  **Python:** `pyproject.toml`/`requirements.txt` -> `pytest` or `unittest`
+5.  **Go:** `go.mod` -> `go test ./...`
+6.  **Makefile:** Target `test`, `build`, or `all`
+
+## Philosophy
+
+This skill is the "Gatekeeper". Ideally, it runs *after* code changes but *before* `code-review` or `test-review`.
+
+1.  **Verify-Work:** Does it compile/pass? (If No -> STOP).
+2.  **Test-Review:** Are the tests meaningful?
+3.  **Code-Review:** Is the code maintainable?
--- a/skills/verify-work/SKILL.md
+++ b/skills/verify-work/SKILL.md
@ -0,0 +1,72 @@
+---
+name: verify-work
+description: Use BEFORE claiming completion of any task. Enforces "The Iron Law": No completion claims without fresh verification evidence. Runs project-specific build and test commands to prove correctness.
+---
+
+# Verify Work (The Gatekeeper)
+
+## The Iron Law
+
+```
+NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE
+```
+
+You are the **Gatekeeper**. Your job is to prevent "Hallucinated Success".
+
+**Core Principle:** A claim of "Done" is a lie unless backed by a green checkmark from the compiler/test runner *in this very turn*.
+
+## When to Use
+
+**MANDATORY usage before:**
+- Claiming a task is "Done" or "Complete"
+- Creating a Pull Request
+- Committing code (unless explicitly WIP)
+- Delegating to another agent (handing off "working" code)
+- Expressing satisfaction ("Great, it works!")
+
+**Trigger phrases:**
+- "Verify my work"
+- "Run checks"
+- "Is this ready?"
+- "Check if I broke anything"
+
+## The Process
+
+1.  **Detect & Run:** Execute the verification script.
+    ```bash
+    ./skills/verify-work/scripts/verify.sh
+    ```
+2.  **Analyze Output:**
+    *   **Exit Code 0:** ✅ PASS.
+    *   **Exit Code != 0:** ❌ FAIL.
+3.  **Report:**
+    *   **IF PASS:** "Verification Passed. [Quote output summary]. Task is complete."
+    *   **IF FAIL:** "Verification Failed. [Quote specific error]. Task is NOT complete."
+
+## Handling Failures
+
+If verification fails, you are **FORBIDDEN** from claiming success.
+*   **Correct:** "I attempted the fix, but verification failed with [Error]. I need to fix this."
+*   **Incorrect:** "Fixed the bug! (But tests failed)." -> This is a lie. If tests failed, the bug is not fixed.
+
+## Common Rationalizations (DO NOT USE)
+
+| Rationalization | Reality |
+|-----------------|---------|
+| "It was a small change" | Small changes break builds. Verify. |
+| "I'm confident" | Confidence is not evidence. Verify. |
+| "The linter passed" | Linters aren't compilers. Verify. |
+| "I'll run it later" | Then you are not "Done" now. |
+| "It should work" | "Should" is the language of failure. Verify. |
+
+## Script Behavior
+
+The `verify.sh` script automatically detects the project type:
+*   **Nix:** `nix flake check` or `nix build`
+*   **Node:** `npm test && npm run build`
+*   **Rust:** `cargo test && cargo build`
+*   **Python:** `pytest`
+*   **Go:** `go test ./...`
+*   **Generic:** Checks for `Makefile` (target `test` or `build`)
+
+You do not need to know the specific command; the script handles it.
--- a/skills/verify-work/scripts/verify.sh
+++ b/skills/verify-work/scripts/verify.sh
@ -0,0 +1,141 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# verify.sh - The "One Script" to verify any project
+# Part of the verify-work skill
+
+# Colors for output
+GREEN='\033[0;32m'
+RED='\033[0;31m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+
+log_info() { echo -e "${GREEN}[VERIFY]${NC} $1"; }
+log_warn() { echo -e "${YELLOW}[VERIFY]${NC} $1"; }
+log_err()  { echo -e "${RED}[VERIFY]${NC} $1" >&2; }
+
+# Function to run a command and time it
+run_cmd() {
+    local cmd="$1"
+    local desc="$2"
+    log_info "Running $desc..."
+    if eval "$cmd"; then
+        echo -e "${GREEN}✓ $desc passed${NC}"
+        return 0
+    else
+        echo -e "${RED}✗ $desc failed${NC}"
+        return 1
+    fi
+}
+
+# 1. Detect Project Root (if not current)
+# We assume we are running from the project root usually, but good to check
+PROJECT_ROOT=$(git rev-parse --show-toplevel 2>/dev/null || pwd)
+cd "$PROJECT_ROOT"
+
+log_info "Verifying project in: $PROJECT_ROOT"
+
+# 2. Heuristics for Verification
+# We check for most specific to least specific
+
+# --- NIX ---
+if [ -f "flake.nix" ]; then
+    log_info "Detected Nix Flake"
+    # Try 'nix flake check' first as it's the gold standard
+    if run_cmd "nix flake check --keep-going --print-build-logs" "Nix Flake Check"; then
+        exit 0
+    fi
+    # Fallback/Additional: If check fails or isn't enough, usually that's a hard fail.
+    # But maybe we just want to try building default package?
+    exit 1
+fi
+
+# --- RUST ---
+if [ -f "Cargo.toml" ]; then
+    log_info "Detected Rust Project"
+    if run_cmd "cargo test" "Rust Tests" && run_cmd "cargo build" "Rust Build"; then
+        exit 0
+    fi
+    exit 1
+fi
+
+# --- NODE/JS/TS ---
+if [ -f "package.json" ]; then
+    log_info "Detected Node Project"
+    
+    # Check if 'test' script exists
+    if jq -e '.scripts.test' package.json >/dev/null; then
+        if ! run_cmd "npm test" "NPM Tests"; then
+            exit 1
+        fi
+    else
+        log_warn "No 'test' script found in package.json"
+    fi
+
+    # Check if 'build' script exists
+    if jq -e '.scripts.build' package.json >/dev/null; then
+        if ! run_cmd "npm run build" "NPM Build"; then
+            exit 1
+        fi
+    fi
+    
+    # If we ran at least one thing, success. If neither existed, warn.
+    if ! jq -e '.scripts.test' package.json >/dev/null && ! jq -e '.scripts.build' package.json >/dev/null; then
+        log_warn "No test or build scripts found. Verification inconclusive."
+        # We don't fail, but we warn.
+    fi
+    exit 0
+fi
+
+# --- PYTHON ---
+if [ -f "pyproject.toml" ] || [ -f "requirements.txt" ] || [ -f "setup.py" ]; then
+    log_info "Detected Python Project"
+    if command -v pytest >/dev/null; then
+        if run_cmd "pytest" "Pytest"; then
+            exit 0
+        fi
+        exit 1
+    else
+        log_warn "pytest not found in path. Trying unittest..."
+        if run_cmd "python3 -m unittest discover" "Unittest"; then
+            exit 0
+        fi
+        exit 1
+    fi
+fi
+
+# --- GO ---
+if [ -f "go.mod" ]; then
+    log_info "Detected Go Project"
+    if run_cmd "go test ./..." "Go Tests"; then
+        exit 0
+    fi
+    exit 1
+fi
+
+# --- MAKEFILE ---
+if [ -f "Makefile" ]; then
+    log_info "Detected Makefile"
+    # Try 'test' target
+    if grep -q "^test:" Makefile; then
+        if ! run_cmd "make test" "Make Test"; then
+            exit 1
+        fi
+    fi
+    # Try 'build' or 'all' target
+    if grep -q "^build:" Makefile; then
+        if ! run_cmd "make build" "Make Build"; then
+            exit 1
+        fi
+    elif grep -q "^all:" Makefile; then
+        if ! run_cmd "make all" "Make All"; then
+            exit 1
+        fi
+    fi
+    exit 0
+fi
+
+# --- UNKNOWN ---
+log_err "Could not detect project type (no flake.nix, package.json, Cargo.toml, etc.)"
+log_err "Please create a known configuration or add a 'test' script."
+exit 1