skills/docs/design/human-observability.md
dan 1c66d019bd feat: add worker CLI scaffold in Nim
Multi-agent coordination CLI with SQLite message bus:
- State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED
- Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat
- SQLite WAL mode, dedicated heartbeat thread, channel-based IPC
- cligen for CLI, tiny_sqlite for DB, ORC memory management

Design docs for branch-per-worker, state machine, message passing,
and human observability patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 18:47:47 -08:00

406 lines
9.8 KiB
Markdown

# Human Observability Design
**Status**: Draft
**Bead**: skills-yak
**Epic**: skills-s6y (Multi-agent orchestration: Lego brick architecture)
## Overview
This document defines the observability interface for human orchestrators managing AI workers. The design follows kubectl/docker patterns: one root command with subcommands, table output, watch mode, and detailed describe views.
## Design Decisions (from orch consensus)
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Command structure | One root, many subcommands | kubectl-style; composable; easy to extend |
| Status columns | id, state, age, heartbeat, staleness | At-a-glance signal for stuck/healthy workers |
| Stale threshold | 3x heartbeat = WARN, 10x = STALE | Avoid false positives from jitter |
| Watch mode | `--watch` flag | Multi-agent is event-driven; real-time visibility |
| Detail view | Full spec, transitions, errors, "why stuck" | Answer "what, why, where stuck?" |
## Commands
### Command Overview
```
worker status [options] # Dashboard table
worker show <task-id> # Detailed view
worker logs <task-id> # Message history
worker stats # Aggregate metrics
```
### `worker status`
Dashboard view of all workers.
```bash
worker status [--state STATE] [--stale] [--watch] [--json]
```
**Default Output**:
```
TASK STATE AGE HEARTBEAT STATUS SUMMARY
skills-abc WORKING 45m 2m ago ok Fix auth bug
skills-xyz IN_REVIEW 2h -- ok Add login form
skills-123 WORKING 3h 12m ago STALE Refactor database
skills-456 CONFLICTED 1h 5m ago blocked Update API endpoints
```
**Columns**:
| Column | Source | Description |
|--------|--------|-------------|
| TASK | state file | Task ID |
| STATE | state file | Current state (color-coded) |
| AGE | state file | Time since created |
| HEARTBEAT | bus.db | Time since last heartbeat |
| STATUS | computed | ok, WARN, STALE, blocked, error |
| SUMMARY | task description | First 30 chars of description |
**State Colors** (if terminal supports):
- 🟢 Green: WORKING
- 🟡 Yellow: IN_REVIEW, ASSIGNED
- 🔴 Red: FAILED, CONFLICTED, STALE
- ⚪ Gray: COMPLETED
**Options**:
| Option | Description |
|--------|-------------|
| `--state STATE` | Filter by state (WORKING, IN_REVIEW, etc.) |
| `--stale` | Show only stale workers |
| `--watch` | Refresh every 2 seconds |
| `--json` | Output as JSON array |
| `--wide` | Show additional columns (branch, worker type) |
**Watch Mode**:
```bash
worker status --watch
# Clears screen, refreshes every 2s
# Ctrl+C to exit
```
### `worker show <task-id>`
Detailed view of a single worker.
```bash
worker show skills-abc [--events] [--json]
```
**Output**:
```
Task: skills-abc
Description: Fix authentication bug in login flow
State: WORKING
Branch: feat/skills-abc
Worktree: worktrees/skills-abc
Created: 2026-01-10 14:00:00 (45m ago)
State Changed: 2026-01-10 14:05:00 (40m ago)
Last Heartbeat: 2026-01-10 14:43:00 (2m ago)
Status: ok
✓ Heartbeat within threshold
✓ State progressing normally
State History:
14:00:00 → ASSIGNED (spawned)
14:05:00 → WORKING (worker start)
Git Status:
Branch: feat/skills-abc
Ahead of integration: 3 commits
Behind integration: 0 commits
Uncommitted changes: 2 files
Recent Messages:
14:43:00 heartbeat status=working, progress=0.6
14:35:00 heartbeat status=working, progress=0.4
14:05:00 state_change ASSIGNED → WORKING
14:00:00 task_assign from=orchestrator
```
**With `--events`**: Show full message history.
**Sections**:
1. **Header**: Task ID, description, current state, branch
2. **Timestamps**: Created, state changed, last heartbeat
3. **Status Check**: Is it healthy? What's wrong?
4. **State History**: Timeline of transitions
5. **Git Status**: Branch status, commits, conflicts
6. **Recent Messages**: Last 10 messages from bus.db
### `worker logs <task-id>`
Stream message history for a task.
```bash
worker logs skills-abc [--follow] [--since 1h] [--type heartbeat]
```
**Output**:
```
2026-01-10 14:00:00 task_assign from=orchestrator
2026-01-10 14:05:00 state_change ASSIGNED → WORKING
2026-01-10 14:15:00 heartbeat status=working
2026-01-10 14:25:00 heartbeat status=working
...
```
**Options**:
| Option | Description |
|--------|-------------|
| `--follow, -f` | Stream new messages as they arrive |
| `--since DURATION` | Show messages from last N minutes/hours |
| `--type TYPE` | Filter by message type |
| `--limit N` | Show last N messages (default: 50) |
### `worker stats`
Aggregate metrics across all workers.
```bash
worker stats [--since 24h]
```
**Output**:
```
Workers by State:
WORKING: 2
IN_REVIEW: 1
COMPLETED: 5
FAILED: 1
Total: 9
Health:
Healthy: 3 (33%)
Stale: 1 (11%)
Timing (median):
ASSIGNED → WORKING: 2m
WORKING → IN_REVIEW: 45m
IN_REVIEW → APPROVED: 15m
Full cycle: 1h 10m
Failures (last 24h): 1
skills-789: Rebase conflict (3h ago)
```
## Stale Detection
### Thresholds
Based on heartbeat interval `H` (default: 10s):
| Level | Threshold | Meaning |
|-------|-----------|---------|
| OK | < 3H (30s) | Normal operation |
| WARN | 3H - 10H (30s - 100s) | Possible issue |
| STALE | > 10H (100s) | Worker likely stuck/dead |
| DEAD | > 30H (5m) | Worker definitely dead |
### Stale vs Stuck
Two different problems:
| Condition | Symptom | Detection |
|-----------|---------|-----------|
| **Stale worker** | No heartbeats | `now - last_heartbeat > threshold` |
| **Stuck task** | Heartbeating but no progress | Same state for > N minutes |
**Stuck detection**:
```sql
SELECT task_id FROM workers
WHERE state = 'WORKING'
AND state_changed_at < datetime('now', '-30 minutes')
AND last_heartbeat > datetime('now', '-1 minute')
```
## Status Computation
```python
def compute_status(worker: Worker) -> str:
now = datetime.utcnow()
heartbeat_age = (now - worker.last_heartbeat).total_seconds()
state_age = (now - worker.state_changed_at).total_seconds()
H = 10 # heartbeat interval
# Check heartbeat freshness
if worker.state in ('ASSIGNED', 'WORKING'):
if heartbeat_age > 30 * H: # 5 min
return 'DEAD'
if heartbeat_age > 10 * H: # 100s
return 'STALE'
if heartbeat_age > 3 * H: # 30s
return 'WARN'
# Check for conflicts/failures
if worker.state == 'CONFLICTED':
return 'blocked'
if worker.state == 'FAILED':
return 'error'
# Check for stuck (working but no progress)
if worker.state == 'WORKING' and state_age > 30 * 60: # 30 min
return 'stuck'
return 'ok'
```
## Output Formatting
### Table Format
Use fixed-width columns for alignment:
```python
COLUMNS = [
('TASK', 12),
('STATE', 11),
('AGE', 6),
('HEARTBEAT', 10),
('STATUS', 7),
('SUMMARY', 30),
]
def format_row(worker):
return f"{worker.task_id:<12} {worker.state:<11} {format_age(worker.age):<6} ..."
```
### JSON Format
For scripting:
```json
[
{
"task_id": "skills-abc",
"state": "WORKING",
"age_seconds": 2700,
"last_heartbeat": "2026-01-10T14:43:00Z",
"status": "ok",
"branch": "feat/skills-abc"
}
]
```
### Color Codes
```python
STATE_COLORS = {
'ASSIGNED': 'yellow',
'WORKING': 'green',
'IN_REVIEW': 'yellow',
'APPROVED': 'green',
'COMPLETED': 'dim',
'CONFLICTED': 'red',
'FAILED': 'red',
'STALE': 'red',
}
STATUS_COLORS = {
'ok': 'green',
'WARN': 'yellow',
'STALE': 'red',
'DEAD': 'red',
'blocked': 'yellow',
'error': 'red',
'stuck': 'yellow',
}
```
## Integration
### With Worker CLI (skills-sse)
`worker status` is part of the same CLI:
```python
@worker.command()
@click.option('--watch', is_flag=True)
@click.option('--state')
@click.option('--stale', is_flag=True)
@click.option('--json', 'as_json', is_flag=True)
def status(watch, state, stale, as_json):
"""Show worker dashboard."""
...
```
### With Message Bus (skills-ms5)
Query SQLite for heartbeats:
```sql
SELECT task_id, MAX(ts) as last_heartbeat
FROM messages
WHERE type = 'heartbeat'
GROUP BY task_id
```
### With State Files
Read `.worker-state/workers/*.json` for current state.
## Implementation
### Python Package
```
skills/worker/
├── commands/
│ ├── status.py # Dashboard
│ ├── show.py # Detail view
│ ├── logs.py # Message history
│ └── stats.py # Aggregate metrics
└── display/
├── table.py # Table formatting
├── colors.py # Terminal colors
└── watch.py # Watch mode loop
```
### Watch Mode Implementation
```python
import time
import os
def watch_loop(render_fn, interval=2):
"""Clear screen and re-render every interval seconds."""
try:
while True:
os.system('clear') # or use curses
render_fn()
time.sleep(interval)
except KeyboardInterrupt:
pass
```
## MVP Scope
For MVP, implement:
1.`worker status` - basic table with state, age, heartbeat
2.`worker status --watch` - refresh mode
3.`worker status --json` - JSON output
4.`worker show <id>` - basic detail view
5. ⏸️ `worker logs` - defer (can query SQLite directly)
6. ⏸️ `worker stats` - defer (nice to have)
7. ⏸️ TUI dashboard - defer (CLI sufficient for 2-4 workers)
## Open Questions
1. **Notification/alerts**: Should there be `worker watch --alert` that beeps on failures?
2. **Shell integration**: Prompt integration showing worker count/status?
3. **Log streaming**: Real-time follow for `worker logs --follow`?
## References
- `docker ps` output format
- `kubectl get pods` and `kubectl describe`
- `systemctl status` detail level
- `htop` for TUI inspiration (future)