feat(skills): Add test-review and verify-work skills

- skills/verify-work: The Gatekeeper (runs verify.sh)
- skills/test-review: The Auditor (checks test quality)
- skills.nix: Registered new skills
This commit is contained in:
dan 2026-01-19 15:31:22 -08:00
parent 5f1a9a4d3d
commit 97a2806d47
11 changed files with 434 additions and 0 deletions

View file

@ -12,6 +12,8 @@
screenshot-latest = "Find latest screenshots";
spec-review = "Technical specification review";
tufte-press = "Generate study card JSON";
test-review = "Audit test quality for flakiness, tautologies, and anti-patterns";
verify-work = "The Gatekeeper: Run project build/tests before claiming success";
worklog = "Create structured worklogs";
update-spec-kit = "Update spec-kit ecosystem";
update-opencode = "Update OpenCode via Nix";

View file

@ -0,0 +1,27 @@
# Test Review Skill (The Auditor)
This skill focuses on the **quality** of your test code.
While `verify-work` checks if tests *pass*, `test-review` checks if tests are *good*. A test suite can pass 100% and still be useless if it contains tautologies, brittle mocks, or race conditions.
## What It Checks
1. **Flakiness:** Arbitrary sleeps, global state, race conditions.
2. **Tautologies:** Tests that always pass (`expect(true).toBe(true)`).
3. **Brittle Mocks:** Mocking internals instead of behavior.
4. **Structure:** Arrange-Act-Assert (AAA) pattern violations.
5. **PBT Opportunities:** Suggesting Property-Based Testing for suitable logic.
## Usage
Ask the agent to review specific test files or the whole suite:
* "Review `src/auth.test.ts` for anti-patterns"
* "Audit my test suite quality"
* "Are these tests brittle?"
## Philosophy
"Test behavior, not implementation."
"Wait for conditions, not time."
"Don't mock what you don't own."

View file

@ -0,0 +1,57 @@
---
name: test-review
description: Audit test quality using multiple lenses. Analyzes for flakiness, tautologies, structure, and mocking anti-patterns. Distinct from running tests (verify-work).
---
# Test Review Skill (The Auditor)
Run focused analysis on test files to ensure they are robust and meaningful.
## When to Use
Invoke this skill when:
- "Review these tests"
- "Audit my test suite"
- "Are these tests brittle?"
- "Check for test anti-patterns"
## Available Lenses
Lenses are located in `skills/test-review/lenses/`:
| Lens | Focus |
|------|-------|
| `flakiness.md` | Timing issues, sleeps, global state, race conditions |
| `lies.md` | Tautologies, meaningless assertions, swallowed errors |
| `structure.md` | Arrange-Act-Assert violations, single-action check |
| `mocking.md` | "Don't mock what you don't own", implementation vs behavior |
## Workflow
### Phase 1: Target Selection
1. Identify test files to review (e.g., `*.test.ts`, `test_*.py`).
2. Show file list to user for confirmation.
### Phase 2: Lens Execution
For each lens, analyze the target test files:
1. Apply the lens criteria.
2. Collect findings (Severity: HIGH/MED/LOW).
### Phase 3: Synthesis
1. Deduplicate findings.
2. Group related issues.
3. Present summary report.
### Phase 4: Interactive Review
Ask: "Which findings should I file as issues?"
## Guidelines
1. **Behavior Over Implementation:** Ensure tests verify what the code *does*, not how it *looks*.
2. **Dynamic vs Static:** Remind user that this is a quality audit. Use `verify-work` to confirm if they actually *pass*.
3. **No Flakiness:** Be ruthless about `sleep` and `setTimeout`.
## Integration
- **Issue Tracking:** Uses `bd create` for beads issues.
- **Verification:** Works as part of the Triad (Verify -> Test-Review -> Code-Review).

View file

@ -0,0 +1,9 @@
# Lenses for Test Review
These lenses are used by the `test-review` skill to audit test quality.
## Installation
For global availability, these should be copied to `~/.config/lenses/test/` (or similar path configured in your agent settings).
The `test-review` skill will also look for them in the skill's local `lenses/` directory if they are not found globally.

View file

@ -0,0 +1,20 @@
# Lens: Flakiness Auditor
Analyze the tests for timing dependencies and non-deterministic behavior.
## Checklist
### 1. Arbitrary Sleeps
* **❌ Flag:** `setTimeout`, `sleep(100)`, `time.Sleep(time.Second)`.
* **Why:** Fast machines fail, slow machines wait too long. It is a "guess" at timing.
* **Fix:** Use condition-based waiting (e.g., `waitFor(() => condition)`).
### 2. Global State Mutation
* **❌ Flag:** Tests that modify global variables, environment variables, or singleton states without cleanup.
* **Why:** Causes order-dependent failures.
* **Fix:** Use `beforeEach`/`afterEach` or generate unique IDs per test.
### 3. Unmocked Network/IO
* **❌ Flag:** Tests that make real network calls or write to system-wide paths (e.g., `/tmp/test`).
* **Why:** External failures cause test flakiness.
* **Fix:** Mock network boundaries or use isolated temp directories.

View file

@ -0,0 +1,24 @@
# Lens: Tautology Auditor
Check if tests are "lying" or providing false confidence by asserting things that are always true.
## Checklist
### 1. Mock Assertions
* **❌ Flag:** `expect(mock).toHaveBeenCalled()`.
* **Why:** You are testing that the mock framework works, not that the code is correct.
* **Fix:** Assert on the *result* of the call or its *observable side effects*.
### 2. Logic Duplication
* **❌ Flag:** `expect(add(a,b)).toBe(a+b)`.
* **Why:** If the logic is wrong in both, the test passes.
* **Fix:** Use hardcoded expected values (`expect(add(2,2)).toBe(4)`).
### 3. Swallowed Errors
* **❌ Flag:** `try { ... } catch (e) {}` inside a test without re-throwing or failing.
* **Why:** Silences actual failures.
* **Fix:** Use the test runner's built-in error handling or `expect().toThrow()`.
### 4. Meaningless Assertions
* **❌ Flag:** `expect(result).toBeDefined()` when the result is just a status code.
* **Fix:** Assert on the *specific value* (`expect(result.status).toBe(200)`).

View file

@ -0,0 +1,18 @@
# Lens: Mocking Auditor
Analyze how external dependencies are handled.
## Checklist
### 1. "Don't Mock What You Don't Own"
* **❌ Flag:** Mocking `axios`, `fs`, `react-router` or other 3rd party libraries directly.
* **Why:** Brittle. Changes in the library break your tests even if your code is fine.
* **Fix:** Suggest a wrapper (Adapter pattern) and mock your own wrapper.
### 2. Implementation vs Behavior
* **❌ Flag:** Tests that depend on private variables or internal method names.
* **Fix:** Test only the public API.
### 3. Complex Mock Setup
* **❌ Flag:** Mock setup that is > 50% of the test file.
* **Fix:** Suggest an integration test with real (managed) dependencies or a simpler mock strategy.

View file

@ -0,0 +1,18 @@
# Lens: Structure Auditor
Check if tests follow a clear, maintainable structure.
## Checklist
### 1. Arrange-Act-Assert (AAA)
* **Check:** Does the test have clear setup, execution, and verification phases?
* **❌ Flag:** Interleaved acting and asserting.
* **Fix:** Refactor into `// Arrange`, `// Act`, `// Assert` blocks.
### 2. Single Action per Test
* **❌ Flag:** Tests that test 5 different things in one block.
* **Fix:** Split into multiple smaller tests.
### 3. Readability
* **Check:** Does the test description match the assertion?
* **Fix:** Use descriptive names like `test('should reject invalid emails')` instead of `test('email test')`.

View file

@ -0,0 +1,46 @@
# Verify Work Skill (The Gatekeeper)
A skill that enforces "The Iron Law": **No completion claims without fresh verification evidence.**
It prevents the "Hallucinated Success" problem where an AI agent claims "I fixed it" because the code *looks* right, even though it fails to compile or pass tests.
## Features
* **Universal Verification:** Automatically detects project type (Nix, Node, Rust, Python, Go, Make) and runs the appropriate checks.
* **Zero-Config:** Works out of the box for standard project structures.
* **Strict Gate:** Instructs the agent that it is *forbidden* from claiming success if verification fails.
## Usage
### For Agents
The agent invokes this skill whenever it believes a task is done.
```bash
# Agent runs this
./skills/verify-work/scripts/verify.sh
```
### For Humans
You can run the script manually to verify the current directory:
```bash
# Verify the current project
~/.claude/skills/verify-work/scripts/verify.sh
```
## Supported Project Types
1. **Nix:** `flake.nix` -> `nix flake check`
2. **Rust:** `Cargo.toml` -> `cargo test && cargo build`
3. **Node/TS:** `package.json` -> `npm test`, `npm run build`
4. **Python:** `pyproject.toml`/`requirements.txt` -> `pytest` or `unittest`
5. **Go:** `go.mod` -> `go test ./...`
6. **Makefile:** Target `test`, `build`, or `all`
## Philosophy
This skill is the "Gatekeeper". Ideally, it runs *after* code changes but *before* `code-review` or `test-review`.
1. **Verify-Work:** Does it compile/pass? (If No -> STOP).
2. **Test-Review:** Are the tests meaningful?
3. **Code-Review:** Is the code maintainable?

View file

@ -0,0 +1,72 @@
---
name: verify-work
description: Use BEFORE claiming completion of any task. Enforces "The Iron Law": No completion claims without fresh verification evidence. Runs project-specific build and test commands to prove correctness.
---
# Verify Work (The Gatekeeper)
## The Iron Law
```
NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE
```
You are the **Gatekeeper**. Your job is to prevent "Hallucinated Success".
**Core Principle:** A claim of "Done" is a lie unless backed by a green checkmark from the compiler/test runner *in this very turn*.
## When to Use
**MANDATORY usage before:**
- Claiming a task is "Done" or "Complete"
- Creating a Pull Request
- Committing code (unless explicitly WIP)
- Delegating to another agent (handing off "working" code)
- Expressing satisfaction ("Great, it works!")
**Trigger phrases:**
- "Verify my work"
- "Run checks"
- "Is this ready?"
- "Check if I broke anything"
## The Process
1. **Detect & Run:** Execute the verification script.
```bash
./skills/verify-work/scripts/verify.sh
```
2. **Analyze Output:**
* **Exit Code 0:** ✅ PASS.
* **Exit Code != 0:** ❌ FAIL.
3. **Report:**
* **IF PASS:** "Verification Passed. [Quote output summary]. Task is complete."
* **IF FAIL:** "Verification Failed. [Quote specific error]. Task is NOT complete."
## Handling Failures
If verification fails, you are **FORBIDDEN** from claiming success.
* **Correct:** "I attempted the fix, but verification failed with [Error]. I need to fix this."
* **Incorrect:** "Fixed the bug! (But tests failed)." -> This is a lie. If tests failed, the bug is not fixed.
## Common Rationalizations (DO NOT USE)
| Rationalization | Reality |
|-----------------|---------|
| "It was a small change" | Small changes break builds. Verify. |
| "I'm confident" | Confidence is not evidence. Verify. |
| "The linter passed" | Linters aren't compilers. Verify. |
| "I'll run it later" | Then you are not "Done" now. |
| "It should work" | "Should" is the language of failure. Verify. |
## Script Behavior
The `verify.sh` script automatically detects the project type:
* **Nix:** `nix flake check` or `nix build`
* **Node:** `npm test && npm run build`
* **Rust:** `cargo test && cargo build`
* **Python:** `pytest`
* **Go:** `go test ./...`
* **Generic:** Checks for `Makefile` (target `test` or `build`)
You do not need to know the specific command; the script handles it.

View file

@ -0,0 +1,141 @@
#!/usr/bin/env bash
set -euo pipefail
# verify.sh - The "One Script" to verify any project
# Part of the verify-work skill
# Colors for output
GREEN='\033[0;32m'
RED='\033[0;31m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
log_info() { echo -e "${GREEN}[VERIFY]${NC} $1"; }
log_warn() { echo -e "${YELLOW}[VERIFY]${NC} $1"; }
log_err() { echo -e "${RED}[VERIFY]${NC} $1" >&2; }
# Function to run a command and time it
run_cmd() {
local cmd="$1"
local desc="$2"
log_info "Running $desc..."
if eval "$cmd"; then
echo -e "${GREEN}$desc passed${NC}"
return 0
else
echo -e "${RED}$desc failed${NC}"
return 1
fi
}
# 1. Detect Project Root (if not current)
# We assume we are running from the project root usually, but good to check
PROJECT_ROOT=$(git rev-parse --show-toplevel 2>/dev/null || pwd)
cd "$PROJECT_ROOT"
log_info "Verifying project in: $PROJECT_ROOT"
# 2. Heuristics for Verification
# We check for most specific to least specific
# --- NIX ---
if [ -f "flake.nix" ]; then
log_info "Detected Nix Flake"
# Try 'nix flake check' first as it's the gold standard
if run_cmd "nix flake check --keep-going --print-build-logs" "Nix Flake Check"; then
exit 0
fi
# Fallback/Additional: If check fails or isn't enough, usually that's a hard fail.
# But maybe we just want to try building default package?
exit 1
fi
# --- RUST ---
if [ -f "Cargo.toml" ]; then
log_info "Detected Rust Project"
if run_cmd "cargo test" "Rust Tests" && run_cmd "cargo build" "Rust Build"; then
exit 0
fi
exit 1
fi
# --- NODE/JS/TS ---
if [ -f "package.json" ]; then
log_info "Detected Node Project"
# Check if 'test' script exists
if jq -e '.scripts.test' package.json >/dev/null; then
if ! run_cmd "npm test" "NPM Tests"; then
exit 1
fi
else
log_warn "No 'test' script found in package.json"
fi
# Check if 'build' script exists
if jq -e '.scripts.build' package.json >/dev/null; then
if ! run_cmd "npm run build" "NPM Build"; then
exit 1
fi
fi
# If we ran at least one thing, success. If neither existed, warn.
if ! jq -e '.scripts.test' package.json >/dev/null && ! jq -e '.scripts.build' package.json >/dev/null; then
log_warn "No test or build scripts found. Verification inconclusive."
# We don't fail, but we warn.
fi
exit 0
fi
# --- PYTHON ---
if [ -f "pyproject.toml" ] || [ -f "requirements.txt" ] || [ -f "setup.py" ]; then
log_info "Detected Python Project"
if command -v pytest >/dev/null; then
if run_cmd "pytest" "Pytest"; then
exit 0
fi
exit 1
else
log_warn "pytest not found in path. Trying unittest..."
if run_cmd "python3 -m unittest discover" "Unittest"; then
exit 0
fi
exit 1
fi
fi
# --- GO ---
if [ -f "go.mod" ]; then
log_info "Detected Go Project"
if run_cmd "go test ./..." "Go Tests"; then
exit 0
fi
exit 1
fi
# --- MAKEFILE ---
if [ -f "Makefile" ]; then
log_info "Detected Makefile"
# Try 'test' target
if grep -q "^test:" Makefile; then
if ! run_cmd "make test" "Make Test"; then
exit 1
fi
fi
# Try 'build' or 'all' target
if grep -q "^build:" Makefile; then
if ! run_cmd "make build" "Make Build"; then
exit 1
fi
elif grep -q "^all:" Makefile; then
if ! run_cmd "make all" "Make All"; then
exit 1
fi
fi
exit 0
fi
# --- UNKNOWN ---
log_err "Could not detect project type (no flake.nix, package.json, Cargo.toml, etc.)"
log_err "Please create a known configuration or add a 'test' script."
exit 1