dan/skills

dan 1c66d019bd feat: add worker CLI scaffold in Nim

Multi-agent coordination CLI with SQLite message bus:
- State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED
- Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat
- SQLite WAL mode, dedicated heartbeat thread, channel-based IPC
- cligen for CLI, tiny_sqlite for DB, ORC memory management

Design docs for branch-per-worker, state machine, message passing,
and human observability patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-10 18:47:47 -08:00

9.8 KiB

Raw Blame History

Human Observability Design

Status: Draft Bead: skills-yak Epic: skills-s6y (Multi-agent orchestration: Lego brick architecture)

Overview

This document defines the observability interface for human orchestrators managing AI workers. The design follows kubectl/docker patterns: one root command with subcommands, table output, watch mode, and detailed describe views.

Design Decisions (from orch consensus)

Decision	Choice	Rationale
Command structure	One root, many subcommands	kubectl-style; composable; easy to extend
Status columns	id, state, age, heartbeat, staleness	At-a-glance signal for stuck/healthy workers
Stale threshold	3x heartbeat = WARN, 10x = STALE	Avoid false positives from jitter
Watch mode	`--watch` flag	Multi-agent is event-driven; real-time visibility
Detail view	Full spec, transitions, errors, "why stuck"	Answer "what, why, where stuck?"

Commands

Command Overview

worker status [options]     # Dashboard table
worker show <task-id>       # Detailed view
worker logs <task-id>       # Message history
worker stats                # Aggregate metrics

`worker status`

Dashboard view of all workers.

worker status [--state STATE] [--stale] [--watch] [--json]

Default Output:

TASK         STATE       AGE    HEARTBEAT   STATUS    SUMMARY
skills-abc   WORKING     45m    2m ago      ok        Fix auth bug
skills-xyz   IN_REVIEW   2h     --          ok        Add login form
skills-123   WORKING     3h     12m ago     STALE     Refactor database
skills-456   CONFLICTED  1h     5m ago      blocked   Update API endpoints

Columns:

Column	Source	Description
TASK	state file	Task ID
STATE	state file	Current state (color-coded)
AGE	state file	Time since created
HEARTBEAT	bus.db	Time since last heartbeat
STATUS	computed	ok, WARN, STALE, blocked, error
SUMMARY	task description	First 30 chars of description

State Colors (if terminal supports):

🟢 Green: WORKING
🟡 Yellow: IN_REVIEW, ASSIGNED
🔴 Red: FAILED, CONFLICTED, STALE
⚪ Gray: COMPLETED

Options:

Option	Description
`--state STATE`	Filter by state (WORKING, IN_REVIEW, etc.)
`--stale`	Show only stale workers
`--watch`	Refresh every 2 seconds
`--json`	Output as JSON array
`--wide`	Show additional columns (branch, worker type)

Watch Mode:

worker status --watch
# Clears screen, refreshes every 2s
# Ctrl+C to exit

`worker show <task-id>`

Detailed view of a single worker.

worker show skills-abc [--events] [--json]

Output:

Task: skills-abc
Description: Fix authentication bug in login flow
State: WORKING
Branch: feat/skills-abc
Worktree: worktrees/skills-abc

Created:        2026-01-10 14:00:00 (45m ago)
State Changed:  2026-01-10 14:05:00 (40m ago)
Last Heartbeat: 2026-01-10 14:43:00 (2m ago)

Status: ok
  ✓ Heartbeat within threshold
  ✓ State progressing normally

State History:
  14:00:00  → ASSIGNED   (spawned)
  14:05:00  → WORKING    (worker start)

Git Status:
  Branch: feat/skills-abc
  Ahead of integration: 3 commits
  Behind integration: 0 commits
  Uncommitted changes: 2 files

Recent Messages:
  14:43:00  heartbeat    status=working, progress=0.6
  14:35:00  heartbeat    status=working, progress=0.4
  14:05:00  state_change ASSIGNED → WORKING
  14:00:00  task_assign  from=orchestrator

With --events: Show full message history.

Sections:

Header: Task ID, description, current state, branch
Timestamps: Created, state changed, last heartbeat
Status Check: Is it healthy? What's wrong?
State History: Timeline of transitions
Git Status: Branch status, commits, conflicts
Recent Messages: Last 10 messages from bus.db

`worker logs <task-id>`

Stream message history for a task.

worker logs skills-abc [--follow] [--since 1h] [--type heartbeat]

Output:

2026-01-10 14:00:00  task_assign     from=orchestrator
2026-01-10 14:05:00  state_change    ASSIGNED → WORKING
2026-01-10 14:15:00  heartbeat       status=working
2026-01-10 14:25:00  heartbeat       status=working
...

Options:

Option	Description
`--follow, -f`	Stream new messages as they arrive
`--since DURATION`	Show messages from last N minutes/hours
`--type TYPE`	Filter by message type
`--limit N`	Show last N messages (default: 50)

`worker stats`

Aggregate metrics across all workers.

worker stats [--since 24h]

Output:

Workers by State:
  WORKING:    2
  IN_REVIEW:  1
  COMPLETED:  5
  FAILED:     1
  Total:      9

Health:
  Healthy:    3 (33%)
  Stale:      1 (11%)

Timing (median):
  ASSIGNED → WORKING:     2m
  WORKING → IN_REVIEW:    45m
  IN_REVIEW → APPROVED:   15m
  Full cycle:             1h 10m

Failures (last 24h): 1
  skills-789: Rebase conflict (3h ago)

Stale Detection

Thresholds

Based on heartbeat interval H (default: 10s):

Level	Threshold	Meaning
OK	< 3H (30s)	Normal operation
WARN	3H - 10H (30s - 100s)	Possible issue
STALE	> 10H (100s)	Worker likely stuck/dead
DEAD	> 30H (5m)	Worker definitely dead

Stale vs Stuck

Two different problems:

Condition	Symptom	Detection
Stale worker	No heartbeats	`now - last_heartbeat > threshold`
Stuck task	Heartbeating but no progress	Same state for > N minutes

Stuck detection:

SELECT task_id FROM workers
WHERE state = 'WORKING'
  AND state_changed_at < datetime('now', '-30 minutes')
  AND last_heartbeat > datetime('now', '-1 minute')

Status Computation

def compute_status(worker: Worker) -> str:
    now = datetime.utcnow()
    heartbeat_age = (now - worker.last_heartbeat).total_seconds()
    state_age = (now - worker.state_changed_at).total_seconds()

    H = 10  # heartbeat interval

    # Check heartbeat freshness
    if worker.state in ('ASSIGNED', 'WORKING'):
        if heartbeat_age > 30 * H:  # 5 min
            return 'DEAD'
        if heartbeat_age > 10 * H:  # 100s
            return 'STALE'
        if heartbeat_age > 3 * H:   # 30s
            return 'WARN'

    # Check for conflicts/failures
    if worker.state == 'CONFLICTED':
        return 'blocked'
    if worker.state == 'FAILED':
        return 'error'

    # Check for stuck (working but no progress)
    if worker.state == 'WORKING' and state_age > 30 * 60:  # 30 min
        return 'stuck'

    return 'ok'

Output Formatting

Table Format

Use fixed-width columns for alignment:

COLUMNS = [
    ('TASK', 12),
    ('STATE', 11),
    ('AGE', 6),
    ('HEARTBEAT', 10),
    ('STATUS', 7),
    ('SUMMARY', 30),
]

def format_row(worker):
    return f"{worker.task_id:<12} {worker.state:<11} {format_age(worker.age):<6} ..."

JSON Format

For scripting:

[
  {
    "task_id": "skills-abc",
    "state": "WORKING",
    "age_seconds": 2700,
    "last_heartbeat": "2026-01-10T14:43:00Z",
    "status": "ok",
    "branch": "feat/skills-abc"
  }
]

Color Codes

STATE_COLORS = {
    'ASSIGNED': 'yellow',
    'WORKING': 'green',
    'IN_REVIEW': 'yellow',
    'APPROVED': 'green',
    'COMPLETED': 'dim',
    'CONFLICTED': 'red',
    'FAILED': 'red',
    'STALE': 'red',
}

STATUS_COLORS = {
    'ok': 'green',
    'WARN': 'yellow',
    'STALE': 'red',
    'DEAD': 'red',
    'blocked': 'yellow',
    'error': 'red',
    'stuck': 'yellow',
}

Integration

With Worker CLI (skills-sse)

worker status is part of the same CLI:

@worker.command()
@click.option('--watch', is_flag=True)
@click.option('--state')
@click.option('--stale', is_flag=True)
@click.option('--json', 'as_json', is_flag=True)
def status(watch, state, stale, as_json):
    """Show worker dashboard."""
    ...

With Message Bus (skills-ms5)

Query SQLite for heartbeats:

SELECT task_id, MAX(ts) as last_heartbeat
FROM messages
WHERE type = 'heartbeat'
GROUP BY task_id

With State Files

Read .worker-state/workers/*.json for current state.

Implementation

Python Package

skills/worker/
├── commands/
│   ├── status.py      # Dashboard
│   ├── show.py        # Detail view
│   ├── logs.py        # Message history
│   └── stats.py       # Aggregate metrics
└── display/
    ├── table.py       # Table formatting
    ├── colors.py      # Terminal colors
    └── watch.py       # Watch mode loop

Watch Mode Implementation

import time
import os

def watch_loop(render_fn, interval=2):
    """Clear screen and re-render every interval seconds."""
    try:
        while True:
            os.system('clear')  # or use curses
            render_fn()
            time.sleep(interval)
    except KeyboardInterrupt:
        pass

MVP Scope

For MVP, implement:

✅ worker status - basic table with state, age, heartbeat
✅ worker status --watch - refresh mode
✅ worker status --json - JSON output
✅ worker show <id> - basic detail view
⏸️ worker logs - defer (can query SQLite directly)
⏸️ worker stats - defer (nice to have)
⏸️ TUI dashboard - defer (CLI sufficient for 2-4 workers)

Open Questions

Notification/alerts: Should there be worker watch --alert that beeps on failures?
Shell integration: Prompt integration showing worker count/status?
Log streaming: Real-time follow for worker logs --follow?

References

docker ps output format
kubectl get pods and kubectl describe
systemctl status detail level
htop for TUI inspiration (future)

9.8 KiB Raw Blame History

Human Observability Design

Overview

Design Decisions (from orch consensus)

Commands

Command Overview

worker status

worker show <task-id>

worker logs <task-id>

worker stats

Stale Detection

Thresholds

Stale vs Stuck

Status Computation

Output Formatting

Table Format

JSON Format

Color Codes

Integration

With Worker CLI (skills-sse)

With Message Bus (skills-ms5)

With State Files

Implementation

Python Package

Watch Mode Implementation

MVP Scope

Open Questions

References

9.8 KiB

Raw Blame History

`worker status`

`worker show <task-id>`

`worker logs <task-id>`

`worker stats`