skills/docs/design/human-observability.md
dan 1c66d019bd feat: add worker CLI scaffold in Nim
Multi-agent coordination CLI with SQLite message bus:
- State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED
- Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat
- SQLite WAL mode, dedicated heartbeat thread, channel-based IPC
- cligen for CLI, tiny_sqlite for DB, ORC memory management

Design docs for branch-per-worker, state machine, message passing,
and human observability patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 18:47:47 -08:00

9.8 KiB

Human Observability Design

Status: Draft Bead: skills-yak Epic: skills-s6y (Multi-agent orchestration: Lego brick architecture)

Overview

This document defines the observability interface for human orchestrators managing AI workers. The design follows kubectl/docker patterns: one root command with subcommands, table output, watch mode, and detailed describe views.

Design Decisions (from orch consensus)

Decision Choice Rationale
Command structure One root, many subcommands kubectl-style; composable; easy to extend
Status columns id, state, age, heartbeat, staleness At-a-glance signal for stuck/healthy workers
Stale threshold 3x heartbeat = WARN, 10x = STALE Avoid false positives from jitter
Watch mode --watch flag Multi-agent is event-driven; real-time visibility
Detail view Full spec, transitions, errors, "why stuck" Answer "what, why, where stuck?"

Commands

Command Overview

worker status [options]     # Dashboard table
worker show <task-id>       # Detailed view
worker logs <task-id>       # Message history
worker stats                # Aggregate metrics

worker status

Dashboard view of all workers.

worker status [--state STATE] [--stale] [--watch] [--json]

Default Output:

TASK         STATE       AGE    HEARTBEAT   STATUS    SUMMARY
skills-abc   WORKING     45m    2m ago      ok        Fix auth bug
skills-xyz   IN_REVIEW   2h     --          ok        Add login form
skills-123   WORKING     3h     12m ago     STALE     Refactor database
skills-456   CONFLICTED  1h     5m ago      blocked   Update API endpoints

Columns:

Column Source Description
TASK state file Task ID
STATE state file Current state (color-coded)
AGE state file Time since created
HEARTBEAT bus.db Time since last heartbeat
STATUS computed ok, WARN, STALE, blocked, error
SUMMARY task description First 30 chars of description

State Colors (if terminal supports):

  • 🟢 Green: WORKING
  • 🟡 Yellow: IN_REVIEW, ASSIGNED
  • 🔴 Red: FAILED, CONFLICTED, STALE
  • Gray: COMPLETED

Options:

Option Description
--state STATE Filter by state (WORKING, IN_REVIEW, etc.)
--stale Show only stale workers
--watch Refresh every 2 seconds
--json Output as JSON array
--wide Show additional columns (branch, worker type)

Watch Mode:

worker status --watch
# Clears screen, refreshes every 2s
# Ctrl+C to exit

worker show <task-id>

Detailed view of a single worker.

worker show skills-abc [--events] [--json]

Output:

Task: skills-abc
Description: Fix authentication bug in login flow
State: WORKING
Branch: feat/skills-abc
Worktree: worktrees/skills-abc

Created:        2026-01-10 14:00:00 (45m ago)
State Changed:  2026-01-10 14:05:00 (40m ago)
Last Heartbeat: 2026-01-10 14:43:00 (2m ago)

Status: ok
  ✓ Heartbeat within threshold
  ✓ State progressing normally

State History:
  14:00:00  → ASSIGNED   (spawned)
  14:05:00  → WORKING    (worker start)

Git Status:
  Branch: feat/skills-abc
  Ahead of integration: 3 commits
  Behind integration: 0 commits
  Uncommitted changes: 2 files

Recent Messages:
  14:43:00  heartbeat    status=working, progress=0.6
  14:35:00  heartbeat    status=working, progress=0.4
  14:05:00  state_change ASSIGNED → WORKING
  14:00:00  task_assign  from=orchestrator

With --events: Show full message history.

Sections:

  1. Header: Task ID, description, current state, branch
  2. Timestamps: Created, state changed, last heartbeat
  3. Status Check: Is it healthy? What's wrong?
  4. State History: Timeline of transitions
  5. Git Status: Branch status, commits, conflicts
  6. Recent Messages: Last 10 messages from bus.db

worker logs <task-id>

Stream message history for a task.

worker logs skills-abc [--follow] [--since 1h] [--type heartbeat]

Output:

2026-01-10 14:00:00  task_assign     from=orchestrator
2026-01-10 14:05:00  state_change    ASSIGNED → WORKING
2026-01-10 14:15:00  heartbeat       status=working
2026-01-10 14:25:00  heartbeat       status=working
...

Options:

Option Description
--follow, -f Stream new messages as they arrive
--since DURATION Show messages from last N minutes/hours
--type TYPE Filter by message type
--limit N Show last N messages (default: 50)

worker stats

Aggregate metrics across all workers.

worker stats [--since 24h]

Output:

Workers by State:
  WORKING:    2
  IN_REVIEW:  1
  COMPLETED:  5
  FAILED:     1
  Total:      9

Health:
  Healthy:    3 (33%)
  Stale:      1 (11%)

Timing (median):
  ASSIGNED → WORKING:     2m
  WORKING → IN_REVIEW:    45m
  IN_REVIEW → APPROVED:   15m
  Full cycle:             1h 10m

Failures (last 24h): 1
  skills-789: Rebase conflict (3h ago)

Stale Detection

Thresholds

Based on heartbeat interval H (default: 10s):

Level Threshold Meaning
OK < 3H (30s) Normal operation
WARN 3H - 10H (30s - 100s) Possible issue
STALE > 10H (100s) Worker likely stuck/dead
DEAD > 30H (5m) Worker definitely dead

Stale vs Stuck

Two different problems:

Condition Symptom Detection
Stale worker No heartbeats now - last_heartbeat > threshold
Stuck task Heartbeating but no progress Same state for > N minutes

Stuck detection:

SELECT task_id FROM workers
WHERE state = 'WORKING'
  AND state_changed_at < datetime('now', '-30 minutes')
  AND last_heartbeat > datetime('now', '-1 minute')

Status Computation

def compute_status(worker: Worker) -> str:
    now = datetime.utcnow()
    heartbeat_age = (now - worker.last_heartbeat).total_seconds()
    state_age = (now - worker.state_changed_at).total_seconds()

    H = 10  # heartbeat interval

    # Check heartbeat freshness
    if worker.state in ('ASSIGNED', 'WORKING'):
        if heartbeat_age > 30 * H:  # 5 min
            return 'DEAD'
        if heartbeat_age > 10 * H:  # 100s
            return 'STALE'
        if heartbeat_age > 3 * H:   # 30s
            return 'WARN'

    # Check for conflicts/failures
    if worker.state == 'CONFLICTED':
        return 'blocked'
    if worker.state == 'FAILED':
        return 'error'

    # Check for stuck (working but no progress)
    if worker.state == 'WORKING' and state_age > 30 * 60:  # 30 min
        return 'stuck'

    return 'ok'

Output Formatting

Table Format

Use fixed-width columns for alignment:

COLUMNS = [
    ('TASK', 12),
    ('STATE', 11),
    ('AGE', 6),
    ('HEARTBEAT', 10),
    ('STATUS', 7),
    ('SUMMARY', 30),
]

def format_row(worker):
    return f"{worker.task_id:<12} {worker.state:<11} {format_age(worker.age):<6} ..."

JSON Format

For scripting:

[
  {
    "task_id": "skills-abc",
    "state": "WORKING",
    "age_seconds": 2700,
    "last_heartbeat": "2026-01-10T14:43:00Z",
    "status": "ok",
    "branch": "feat/skills-abc"
  }
]

Color Codes

STATE_COLORS = {
    'ASSIGNED': 'yellow',
    'WORKING': 'green',
    'IN_REVIEW': 'yellow',
    'APPROVED': 'green',
    'COMPLETED': 'dim',
    'CONFLICTED': 'red',
    'FAILED': 'red',
    'STALE': 'red',
}

STATUS_COLORS = {
    'ok': 'green',
    'WARN': 'yellow',
    'STALE': 'red',
    'DEAD': 'red',
    'blocked': 'yellow',
    'error': 'red',
    'stuck': 'yellow',
}

Integration

With Worker CLI (skills-sse)

worker status is part of the same CLI:

@worker.command()
@click.option('--watch', is_flag=True)
@click.option('--state')
@click.option('--stale', is_flag=True)
@click.option('--json', 'as_json', is_flag=True)
def status(watch, state, stale, as_json):
    """Show worker dashboard."""
    ...

With Message Bus (skills-ms5)

Query SQLite for heartbeats:

SELECT task_id, MAX(ts) as last_heartbeat
FROM messages
WHERE type = 'heartbeat'
GROUP BY task_id

With State Files

Read .worker-state/workers/*.json for current state.

Implementation

Python Package

skills/worker/
├── commands/
│   ├── status.py      # Dashboard
│   ├── show.py        # Detail view
│   ├── logs.py        # Message history
│   └── stats.py       # Aggregate metrics
└── display/
    ├── table.py       # Table formatting
    ├── colors.py      # Terminal colors
    └── watch.py       # Watch mode loop

Watch Mode Implementation

import time
import os

def watch_loop(render_fn, interval=2):
    """Clear screen and re-render every interval seconds."""
    try:
        while True:
            os.system('clear')  # or use curses
            render_fn()
            time.sleep(interval)
    except KeyboardInterrupt:
        pass

MVP Scope

For MVP, implement:

  1. worker status - basic table with state, age, heartbeat
  2. worker status --watch - refresh mode
  3. worker status --json - JSON output
  4. worker show <id> - basic detail view
  5. ⏸️ worker logs - defer (can query SQLite directly)
  6. ⏸️ worker stats - defer (nice to have)
  7. ⏸️ TUI dashboard - defer (CLI sufficient for 2-4 workers)

Open Questions

  1. Notification/alerts: Should there be worker watch --alert that beeps on failures?
  2. Shell integration: Prompt integration showing worker count/status?
  3. Log streaming: Real-time follow for worker logs --follow?

References

  • docker ps output format
  • kubectl get pods and kubectl describe
  • systemctl status detail level
  • htop for TUI inspiration (future)