Multi-lens review skill for operational infrastructure (Nix, shell, Docker, CI/CD). Modeled on code-review with linter-first hybrid architecture. Phase 1 lenses (core safety): - secrets: credential exposure, Nix store, Docker layers, CI masking - shell-safety: shellcheck-backed, temp files, guard snippets - blast-radius: targeting/scoping, dry-run, rollback - privilege: least-privilege, containers, systemd sandboxing Design reviewed via orch consensus (sonar, flash-or, gemini, gpt). Lenses deploy to ~/.config/lenses/ops/ via home-manager. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
261 lines
8.8 KiB
Markdown
261 lines
8.8 KiB
Markdown
# ops-review Skill Design
|
|
|
|
A multi-lens review skill for operational infrastructure, modeled on code-review.
|
|
|
|
## Problem Statement
|
|
|
|
Ops artifacts (Nix configs, shell scripts, Python automation, Docker Compose, CI/CD) accumulate technical debt and security issues just like application code. Unlike code, they rarely get systematic review.
|
|
|
|
## Target Artifacts
|
|
|
|
Based on actual infrastructure in dotfiles and prox-setup:
|
|
|
|
| Category | Examples |
|
|
|----------|----------|
|
|
| **Nix/NixOS** | flake.nix, modules/*.nix, home-manager configs |
|
|
| **Shell Scripts** | bin/*.sh, setup_*.sh, fix_*.sh, deploy.sh |
|
|
| **Python Automation** | Proxmox API scripts, multi-stage deployments |
|
|
| **Container Configs** | docker-compose.yml, Dockerfile |
|
|
| **CI/CD** | .gitea/workflows/*.yml, .github/actions/*.yml |
|
|
| **Service Configs** | systemd units, Ory configs, SOPS files |
|
|
|
|
## Architecture: Linter-First Hybrid
|
|
|
|
**Consensus from model review**: Use deterministic tools as primary signals, LLM for interpretation and semantic analysis.
|
|
|
|
```
|
|
Stage 1: Static Tools (fast, deterministic)
|
|
├── shellcheck for shell scripts
|
|
├── statix + deadnix for Nix
|
|
├── hadolint for Dockerfiles
|
|
└── yamllint for YAML configs
|
|
|
|
Stage 2: LLM Analysis (semantic, contextual)
|
|
├── Interprets tool output in context
|
|
├── Finds logic bugs tools miss
|
|
├── Synthesizes cross-file issues
|
|
└── Suggests actionable fixes
|
|
```
|
|
|
|
**Why**: LLMs hallucinate syntax but excel at understanding intent and impact. Tools catch syntax but miss semantics.
|
|
|
|
## Proposed Lenses (10 total)
|
|
|
|
### Core Safety (Phase 1)
|
|
|
|
#### 1. secrets
|
|
**Focus**: Credential hygiene
|
|
- Hardcoded secrets, API keys, tokens
|
|
- SOPS config issues
|
|
- Secrets in logs or error messages
|
|
- Secrets passed via CLI args (visible in process list)
|
|
- Missing encryption for sensitive data
|
|
|
|
#### 2. shell-safety
|
|
**Focus**: Shell script robustness (backed by shellcheck)
|
|
- Missing `set -euo pipefail`
|
|
- Unquoted variables (SC2086)
|
|
- Unsafe command substitution
|
|
- Missing error handling
|
|
- Hardcoded paths that should be parameters
|
|
|
|
#### 3. blast-radius
|
|
**Focus**: Change safety and risk containment
|
|
- Destructive operations without confirmation
|
|
- Missing dry-run mode
|
|
- No rollback strategy
|
|
- Bulk operations without batching
|
|
- Missing pre-flight checks
|
|
- No canary/progressive approach
|
|
|
|
#### 4. privilege
|
|
**Focus**: Least privilege violations
|
|
- Unnecessary sudo/root usage
|
|
- Containers running as root
|
|
- Overly permissive file modes (chmod 777)
|
|
- Missing capability drops
|
|
- Docker socket mounting
|
|
- systemd units without sandboxing (ProtectSystem, PrivateTmp)
|
|
|
|
### Reliability (Phase 2)
|
|
|
|
#### 5. idempotency
|
|
**Focus**: Safe re-execution and convergence
|
|
- Scripts that break on re-run
|
|
- Missing existence checks (create-if-not-exists)
|
|
- Non-atomic operations (partial failure states)
|
|
- Check-then-act race conditions
|
|
- Missing cleanup on failure
|
|
|
|
#### 6. supply-chain
|
|
**Focus**: Dependency provenance and pinning
|
|
- Unpinned versions (`latest` tags, floating refs)
|
|
- GitHub/Gitea actions not pinned to SHA
|
|
- Missing Nix flake.lock or SRI hashes
|
|
- Unsigned artifacts
|
|
- Untrusted substituters/registries
|
|
|
|
#### 7. observability
|
|
**Focus**: Visibility into system state
|
|
- Silent failures (no logging/alerting)
|
|
- Missing health checks (Docker healthcheck, systemd ExecStartPre)
|
|
- Incomplete metrics coverage
|
|
- Missing structured logging
|
|
- No correlation IDs in multi-step scripts
|
|
|
|
### Architecture (Phase 3)
|
|
|
|
#### 8. nix-hygiene
|
|
**Focus**: Nix-specific quality (backed by statix/deadnix)
|
|
- Dead code (unused let bindings, imports)
|
|
- Anti-patterns (with lib abuse, IFD without justification)
|
|
- Module boundary violations
|
|
- Overlay/override issues
|
|
- Missing type annotations on options
|
|
|
|
#### 9. resilience
|
|
**Focus**: Runtime fault tolerance
|
|
- Missing timeouts on network calls
|
|
- No retries with backoff/jitter
|
|
- Missing circuit breakers for API calls
|
|
- No graceful shutdown handling (SIGTERM)
|
|
- Missing resource limits (systemd MemoryMax, Docker mem_limit)
|
|
|
|
#### 10. orchestration
|
|
**Focus**: Execution ordering and coupling (formerly dependency-chains)
|
|
- Unclear prerequisites
|
|
- Missing documentation of execution order
|
|
- Circular dependencies
|
|
- Scripts assuming prior state without checking
|
|
- Implicit coupling between components
|
|
|
|
## Crisp Boundaries
|
|
|
|
To avoid duplicate findings across overlapping lenses:
|
|
|
|
| Lens | Owns | Does NOT Own |
|
|
|------|------|--------------|
|
|
| **idempotency** | Safe re-run, convergence, atomic writes, create-if-exists | Rollback (blast-radius), retries (resilience) |
|
|
| **resilience** | Runtime fault tolerance, timeouts, retries, graceful shutdown | Change safety (blast-radius), re-run safety (idempotency) |
|
|
| **blast-radius** | Change safety, dry-run, rollback, confirmation gates, batching | Runtime behavior (resilience), re-run (idempotency) |
|
|
|
|
## Skill Structure
|
|
|
|
```
|
|
skills/ops-review/
|
|
├── SKILL.md # Agent instructions (workflow)
|
|
├── README.md # User documentation
|
|
└── lenses/
|
|
├── README.md # Lens index
|
|
├── secrets.md
|
|
├── shell-safety.md
|
|
├── blast-radius.md
|
|
├── privilege.md
|
|
├── idempotency.md
|
|
├── supply-chain.md
|
|
├── observability.md
|
|
├── nix-hygiene.md
|
|
├── resilience.md
|
|
└── orchestration.md
|
|
```
|
|
|
|
Lenses deploy to `~/.config/lenses/ops/` via home-manager.
|
|
|
|
## Workflow
|
|
|
|
### Standard Mode
|
|
1. **Target selection** - files/directory to review
|
|
2. **Pre-pass** - Run static tools (shellcheck, statix, etc.)
|
|
3. **Reference mapping** - Build lightweight call graph (source, imports, ExecStart)
|
|
4. **Lens execution** - One pass per lens, tool output in context
|
|
5. **Synthesis** - Dedupe across lenses, rank by severity
|
|
6. **Interactive review** - User approves findings
|
|
7. **Issue filing** - `bd create` for approved items
|
|
|
|
### Quick Mode (`--quick`)
|
|
Runs Phase 1 lenses only: secrets, shell-safety, blast-radius, privilege.
|
|
Ideal for pre-commit or CI gates.
|
|
|
|
## Output Format
|
|
|
|
Per-lens findings:
|
|
```
|
|
[LENS-TAG] <severity:HIGH|MED|LOW> <file:line>
|
|
Issue: <what's wrong>
|
|
Suggest: <how to fix>
|
|
Evidence: <why it matters>
|
|
```
|
|
|
|
### Severity Rubric
|
|
|
|
| Severity | Criteria |
|
|
|----------|----------|
|
|
| **HIGH** | Exploitable vulnerability, data loss risk, or will break on next run |
|
|
| **MED** | Reliability issue, tech debt, or violation of best practice |
|
|
| **LOW** | Polish, maintainability, or defense-in-depth improvement |
|
|
|
|
Context matters: same issue may be HIGH in production, LOW in homelab.
|
|
|
|
## Cross-File Awareness
|
|
|
|
Build a simple reference map before review:
|
|
- **Shell**: `source`, `.` includes, invoked scripts
|
|
- **Nix**: imports, flake inputs
|
|
- **CI**: referenced scripts, env vars, secrets names
|
|
- **Compose**: service dependencies, volumes, env files
|
|
- **systemd**: ExecStart targets, dependencies
|
|
|
|
This enables finding issues in the seams between components.
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1: Safety Net (High ROI, Low Ambiguity)
|
|
1. **secrets** - Non-negotiable, prevents catastrophes
|
|
2. **shell-safety** - Most brittle artifact type, shellcheck-backed
|
|
3. **blast-radius** - Where LLMs shine (understanding implications)
|
|
4. **privilege** - Highly actionable, high impact
|
|
|
|
### Phase 2: Reliability Layer
|
|
5. **idempotency** - Essential for setup/deploy scripts
|
|
6. **supply-chain** - Critical for reproducibility
|
|
7. **observability** - Easy to check, high debugging value
|
|
|
|
### Phase 3: Architecture Polish
|
|
8. **nix-hygiene** - statix/deadnix backed, LLM explains
|
|
9. **resilience** - Needs nuance to avoid bad advice
|
|
10. **orchestration** - Most complex, needs full context
|
|
|
|
## Design Decisions
|
|
|
|
1. **Linter-first, LLM-second**: Static tools for syntax, LLM for semantics
|
|
2. **Crisp lens boundaries**: Each rule has one primary owner
|
|
3. **Severity tied to impact**: Not all violations are equal
|
|
4. **Quick mode**: Phase 1 for pre-commit/CI
|
|
5. **Cross-file awareness**: Grep-based reference mapping
|
|
6. **Escape hatches**: Intentional patterns can be flagged + suppressed
|
|
|
|
## Success Criteria
|
|
|
|
- Can review dotfiles/ and find real issues
|
|
- Can review prox-setup/ and find real issues
|
|
- Findings are actionable, not noise
|
|
- Phase 1 lenses have <10% false positive rate
|
|
- Integrates with existing bd issue tracking
|
|
- Quick mode runs in <30 seconds
|
|
|
|
## Open Questions (Resolved)
|
|
|
|
| Question | Resolution |
|
|
|----------|------------|
|
|
| Nix: statix/deadnix or pure LLM? | **Hybrid**: Tools first, LLM interprets |
|
|
| Shell: integrate shellcheck? | **Yes**: Treat as compiler, LLM groups/prioritizes |
|
|
| Multi-file dependencies? | **Grep-based reference map** pre-pass |
|
|
| Quick mode? | **Yes**: Phase 1 lenses only |
|
|
| Prioritize across artifact types? | **By risk**: secrets/destructive ops first, not file type |
|
|
|
|
## References
|
|
|
|
- [Google SRE Book](https://sre.google/sre-book/table-of-contents/)
|
|
- [OWASP Infrastructure Security](https://owasp.org/www-project-devsecops-guideline/)
|
|
- Consensus review: sonar, flash-or, gemini, gpt (2025-01-01)
|