# Human Observability Design **Status**: Draft **Bead**: skills-yak **Epic**: skills-s6y (Multi-agent orchestration: Lego brick architecture) ## Overview This document defines the observability interface for human orchestrators managing AI workers. The design follows kubectl/docker patterns: one root command with subcommands, table output, watch mode, and detailed describe views. ## Design Decisions (from orch consensus) | Decision | Choice | Rationale | |----------|--------|-----------| | Command structure | One root, many subcommands | kubectl-style; composable; easy to extend | | Status columns | id, state, age, heartbeat, staleness | At-a-glance signal for stuck/healthy workers | | Stale threshold | 3x heartbeat = WARN, 10x = STALE | Avoid false positives from jitter | | Watch mode | `--watch` flag | Multi-agent is event-driven; real-time visibility | | Detail view | Full spec, transitions, errors, "why stuck" | Answer "what, why, where stuck?" | ## Commands ### Command Overview ``` worker status [options] # Dashboard table worker show # Detailed view worker logs # Message history worker stats # Aggregate metrics ``` ### `worker status` Dashboard view of all workers. ```bash worker status [--state STATE] [--stale] [--watch] [--json] ``` **Default Output**: ``` TASK STATE AGE HEARTBEAT STATUS SUMMARY skills-abc WORKING 45m 2m ago ok Fix auth bug skills-xyz IN_REVIEW 2h -- ok Add login form skills-123 WORKING 3h 12m ago STALE Refactor database skills-456 CONFLICTED 1h 5m ago blocked Update API endpoints ``` **Columns**: | Column | Source | Description | |--------|--------|-------------| | TASK | state file | Task ID | | STATE | state file | Current state (color-coded) | | AGE | state file | Time since created | | HEARTBEAT | bus.db | Time since last heartbeat | | STATUS | computed | ok, WARN, STALE, blocked, error | | SUMMARY | task description | First 30 chars of description | **State Colors** (if terminal supports): - 🟒 Green: WORKING - 🟑 Yellow: IN_REVIEW, ASSIGNED - πŸ”΄ Red: FAILED, CONFLICTED, STALE - βšͺ Gray: COMPLETED **Options**: | Option | Description | |--------|-------------| | `--state STATE` | Filter by state (WORKING, IN_REVIEW, etc.) | | `--stale` | Show only stale workers | | `--watch` | Refresh every 2 seconds | | `--json` | Output as JSON array | | `--wide` | Show additional columns (branch, worker type) | **Watch Mode**: ```bash worker status --watch # Clears screen, refreshes every 2s # Ctrl+C to exit ``` ### `worker show ` Detailed view of a single worker. ```bash worker show skills-abc [--events] [--json] ``` **Output**: ``` Task: skills-abc Description: Fix authentication bug in login flow State: WORKING Branch: feat/skills-abc Worktree: worktrees/skills-abc Created: 2026-01-10 14:00:00 (45m ago) State Changed: 2026-01-10 14:05:00 (40m ago) Last Heartbeat: 2026-01-10 14:43:00 (2m ago) Status: ok βœ“ Heartbeat within threshold βœ“ State progressing normally State History: 14:00:00 β†’ ASSIGNED (spawned) 14:05:00 β†’ WORKING (worker start) Git Status: Branch: feat/skills-abc Ahead of integration: 3 commits Behind integration: 0 commits Uncommitted changes: 2 files Recent Messages: 14:43:00 heartbeat status=working, progress=0.6 14:35:00 heartbeat status=working, progress=0.4 14:05:00 state_change ASSIGNED β†’ WORKING 14:00:00 task_assign from=orchestrator ``` **With `--events`**: Show full message history. **Sections**: 1. **Header**: Task ID, description, current state, branch 2. **Timestamps**: Created, state changed, last heartbeat 3. **Status Check**: Is it healthy? What's wrong? 4. **State History**: Timeline of transitions 5. **Git Status**: Branch status, commits, conflicts 6. **Recent Messages**: Last 10 messages from bus.db ### `worker logs ` Stream message history for a task. ```bash worker logs skills-abc [--follow] [--since 1h] [--type heartbeat] ``` **Output**: ``` 2026-01-10 14:00:00 task_assign from=orchestrator 2026-01-10 14:05:00 state_change ASSIGNED β†’ WORKING 2026-01-10 14:15:00 heartbeat status=working 2026-01-10 14:25:00 heartbeat status=working ... ``` **Options**: | Option | Description | |--------|-------------| | `--follow, -f` | Stream new messages as they arrive | | `--since DURATION` | Show messages from last N minutes/hours | | `--type TYPE` | Filter by message type | | `--limit N` | Show last N messages (default: 50) | ### `worker stats` Aggregate metrics across all workers. ```bash worker stats [--since 24h] ``` **Output**: ``` Workers by State: WORKING: 2 IN_REVIEW: 1 COMPLETED: 5 FAILED: 1 Total: 9 Health: Healthy: 3 (33%) Stale: 1 (11%) Timing (median): ASSIGNED β†’ WORKING: 2m WORKING β†’ IN_REVIEW: 45m IN_REVIEW β†’ APPROVED: 15m Full cycle: 1h 10m Failures (last 24h): 1 skills-789: Rebase conflict (3h ago) ``` ## Stale Detection ### Thresholds Based on heartbeat interval `H` (default: 10s): | Level | Threshold | Meaning | |-------|-----------|---------| | OK | < 3H (30s) | Normal operation | | WARN | 3H - 10H (30s - 100s) | Possible issue | | STALE | > 10H (100s) | Worker likely stuck/dead | | DEAD | > 30H (5m) | Worker definitely dead | ### Stale vs Stuck Two different problems: | Condition | Symptom | Detection | |-----------|---------|-----------| | **Stale worker** | No heartbeats | `now - last_heartbeat > threshold` | | **Stuck task** | Heartbeating but no progress | Same state for > N minutes | **Stuck detection**: ```sql SELECT task_id FROM workers WHERE state = 'WORKING' AND state_changed_at < datetime('now', '-30 minutes') AND last_heartbeat > datetime('now', '-1 minute') ``` ## Status Computation ```python def compute_status(worker: Worker) -> str: now = datetime.utcnow() heartbeat_age = (now - worker.last_heartbeat).total_seconds() state_age = (now - worker.state_changed_at).total_seconds() H = 10 # heartbeat interval # Check heartbeat freshness if worker.state in ('ASSIGNED', 'WORKING'): if heartbeat_age > 30 * H: # 5 min return 'DEAD' if heartbeat_age > 10 * H: # 100s return 'STALE' if heartbeat_age > 3 * H: # 30s return 'WARN' # Check for conflicts/failures if worker.state == 'CONFLICTED': return 'blocked' if worker.state == 'FAILED': return 'error' # Check for stuck (working but no progress) if worker.state == 'WORKING' and state_age > 30 * 60: # 30 min return 'stuck' return 'ok' ``` ## Output Formatting ### Table Format Use fixed-width columns for alignment: ```python COLUMNS = [ ('TASK', 12), ('STATE', 11), ('AGE', 6), ('HEARTBEAT', 10), ('STATUS', 7), ('SUMMARY', 30), ] def format_row(worker): return f"{worker.task_id:<12} {worker.state:<11} {format_age(worker.age):<6} ..." ``` ### JSON Format For scripting: ```json [ { "task_id": "skills-abc", "state": "WORKING", "age_seconds": 2700, "last_heartbeat": "2026-01-10T14:43:00Z", "status": "ok", "branch": "feat/skills-abc" } ] ``` ### Color Codes ```python STATE_COLORS = { 'ASSIGNED': 'yellow', 'WORKING': 'green', 'IN_REVIEW': 'yellow', 'APPROVED': 'green', 'COMPLETED': 'dim', 'CONFLICTED': 'red', 'FAILED': 'red', 'STALE': 'red', } STATUS_COLORS = { 'ok': 'green', 'WARN': 'yellow', 'STALE': 'red', 'DEAD': 'red', 'blocked': 'yellow', 'error': 'red', 'stuck': 'yellow', } ``` ## Integration ### With Worker CLI (skills-sse) `worker status` is part of the same CLI: ```python @worker.command() @click.option('--watch', is_flag=True) @click.option('--state') @click.option('--stale', is_flag=True) @click.option('--json', 'as_json', is_flag=True) def status(watch, state, stale, as_json): """Show worker dashboard.""" ... ``` ### With Message Bus (skills-ms5) Query SQLite for heartbeats: ```sql SELECT task_id, MAX(ts) as last_heartbeat FROM messages WHERE type = 'heartbeat' GROUP BY task_id ``` ### With State Files Read `.worker-state/workers/*.json` for current state. ## Implementation ### Python Package ``` skills/worker/ β”œβ”€β”€ commands/ β”‚ β”œβ”€β”€ status.py # Dashboard β”‚ β”œβ”€β”€ show.py # Detail view β”‚ β”œβ”€β”€ logs.py # Message history β”‚ └── stats.py # Aggregate metrics └── display/ β”œβ”€β”€ table.py # Table formatting β”œβ”€β”€ colors.py # Terminal colors └── watch.py # Watch mode loop ``` ### Watch Mode Implementation ```python import time import os def watch_loop(render_fn, interval=2): """Clear screen and re-render every interval seconds.""" try: while True: os.system('clear') # or use curses render_fn() time.sleep(interval) except KeyboardInterrupt: pass ``` ## MVP Scope For MVP, implement: 1. βœ… `worker status` - basic table with state, age, heartbeat 2. βœ… `worker status --watch` - refresh mode 3. βœ… `worker status --json` - JSON output 4. βœ… `worker show ` - basic detail view 5. ⏸️ `worker logs` - defer (can query SQLite directly) 6. ⏸️ `worker stats` - defer (nice to have) 7. ⏸️ TUI dashboard - defer (CLI sufficient for 2-4 workers) ## Open Questions 1. **Notification/alerts**: Should there be `worker watch --alert` that beeps on failures? 2. **Shell integration**: Prompt integration showing worker count/status? 3. **Log streaming**: Real-time follow for `worker logs --follow`? ## References - `docker ps` output format - `kubectl get pods` and `kubectl describe` - `systemctl status` detail level - `htop` for TUI inspiration (future)