Multi-agent coordination CLI with SQLite message bus: - State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED - Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat - SQLite WAL mode, dedicated heartbeat thread, channel-based IPC - cligen for CLI, tiny_sqlite for DB, ORC memory management Design docs for branch-per-worker, state machine, message passing, and human observability patterns. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
9.8 KiB
Human Observability Design
Status: Draft Bead: skills-yak Epic: skills-s6y (Multi-agent orchestration: Lego brick architecture)
Overview
This document defines the observability interface for human orchestrators managing AI workers. The design follows kubectl/docker patterns: one root command with subcommands, table output, watch mode, and detailed describe views.
Design Decisions (from orch consensus)
| Decision | Choice | Rationale |
|---|---|---|
| Command structure | One root, many subcommands | kubectl-style; composable; easy to extend |
| Status columns | id, state, age, heartbeat, staleness | At-a-glance signal for stuck/healthy workers |
| Stale threshold | 3x heartbeat = WARN, 10x = STALE | Avoid false positives from jitter |
| Watch mode | --watch flag |
Multi-agent is event-driven; real-time visibility |
| Detail view | Full spec, transitions, errors, "why stuck" | Answer "what, why, where stuck?" |
Commands
Command Overview
worker status [options] # Dashboard table
worker show <task-id> # Detailed view
worker logs <task-id> # Message history
worker stats # Aggregate metrics
worker status
Dashboard view of all workers.
worker status [--state STATE] [--stale] [--watch] [--json]
Default Output:
TASK STATE AGE HEARTBEAT STATUS SUMMARY
skills-abc WORKING 45m 2m ago ok Fix auth bug
skills-xyz IN_REVIEW 2h -- ok Add login form
skills-123 WORKING 3h 12m ago STALE Refactor database
skills-456 CONFLICTED 1h 5m ago blocked Update API endpoints
Columns:
| Column | Source | Description |
|---|---|---|
| TASK | state file | Task ID |
| STATE | state file | Current state (color-coded) |
| AGE | state file | Time since created |
| HEARTBEAT | bus.db | Time since last heartbeat |
| STATUS | computed | ok, WARN, STALE, blocked, error |
| SUMMARY | task description | First 30 chars of description |
State Colors (if terminal supports):
- 🟢 Green: WORKING
- 🟡 Yellow: IN_REVIEW, ASSIGNED
- 🔴 Red: FAILED, CONFLICTED, STALE
- ⚪ Gray: COMPLETED
Options:
| Option | Description |
|---|---|
--state STATE |
Filter by state (WORKING, IN_REVIEW, etc.) |
--stale |
Show only stale workers |
--watch |
Refresh every 2 seconds |
--json |
Output as JSON array |
--wide |
Show additional columns (branch, worker type) |
Watch Mode:
worker status --watch
# Clears screen, refreshes every 2s
# Ctrl+C to exit
worker show <task-id>
Detailed view of a single worker.
worker show skills-abc [--events] [--json]
Output:
Task: skills-abc
Description: Fix authentication bug in login flow
State: WORKING
Branch: feat/skills-abc
Worktree: worktrees/skills-abc
Created: 2026-01-10 14:00:00 (45m ago)
State Changed: 2026-01-10 14:05:00 (40m ago)
Last Heartbeat: 2026-01-10 14:43:00 (2m ago)
Status: ok
✓ Heartbeat within threshold
✓ State progressing normally
State History:
14:00:00 → ASSIGNED (spawned)
14:05:00 → WORKING (worker start)
Git Status:
Branch: feat/skills-abc
Ahead of integration: 3 commits
Behind integration: 0 commits
Uncommitted changes: 2 files
Recent Messages:
14:43:00 heartbeat status=working, progress=0.6
14:35:00 heartbeat status=working, progress=0.4
14:05:00 state_change ASSIGNED → WORKING
14:00:00 task_assign from=orchestrator
With --events: Show full message history.
Sections:
- Header: Task ID, description, current state, branch
- Timestamps: Created, state changed, last heartbeat
- Status Check: Is it healthy? What's wrong?
- State History: Timeline of transitions
- Git Status: Branch status, commits, conflicts
- Recent Messages: Last 10 messages from bus.db
worker logs <task-id>
Stream message history for a task.
worker logs skills-abc [--follow] [--since 1h] [--type heartbeat]
Output:
2026-01-10 14:00:00 task_assign from=orchestrator
2026-01-10 14:05:00 state_change ASSIGNED → WORKING
2026-01-10 14:15:00 heartbeat status=working
2026-01-10 14:25:00 heartbeat status=working
...
Options:
| Option | Description |
|---|---|
--follow, -f |
Stream new messages as they arrive |
--since DURATION |
Show messages from last N minutes/hours |
--type TYPE |
Filter by message type |
--limit N |
Show last N messages (default: 50) |
worker stats
Aggregate metrics across all workers.
worker stats [--since 24h]
Output:
Workers by State:
WORKING: 2
IN_REVIEW: 1
COMPLETED: 5
FAILED: 1
Total: 9
Health:
Healthy: 3 (33%)
Stale: 1 (11%)
Timing (median):
ASSIGNED → WORKING: 2m
WORKING → IN_REVIEW: 45m
IN_REVIEW → APPROVED: 15m
Full cycle: 1h 10m
Failures (last 24h): 1
skills-789: Rebase conflict (3h ago)
Stale Detection
Thresholds
Based on heartbeat interval H (default: 10s):
| Level | Threshold | Meaning |
|---|---|---|
| OK | < 3H (30s) | Normal operation |
| WARN | 3H - 10H (30s - 100s) | Possible issue |
| STALE | > 10H (100s) | Worker likely stuck/dead |
| DEAD | > 30H (5m) | Worker definitely dead |
Stale vs Stuck
Two different problems:
| Condition | Symptom | Detection |
|---|---|---|
| Stale worker | No heartbeats | now - last_heartbeat > threshold |
| Stuck task | Heartbeating but no progress | Same state for > N minutes |
Stuck detection:
SELECT task_id FROM workers
WHERE state = 'WORKING'
AND state_changed_at < datetime('now', '-30 minutes')
AND last_heartbeat > datetime('now', '-1 minute')
Status Computation
def compute_status(worker: Worker) -> str:
now = datetime.utcnow()
heartbeat_age = (now - worker.last_heartbeat).total_seconds()
state_age = (now - worker.state_changed_at).total_seconds()
H = 10 # heartbeat interval
# Check heartbeat freshness
if worker.state in ('ASSIGNED', 'WORKING'):
if heartbeat_age > 30 * H: # 5 min
return 'DEAD'
if heartbeat_age > 10 * H: # 100s
return 'STALE'
if heartbeat_age > 3 * H: # 30s
return 'WARN'
# Check for conflicts/failures
if worker.state == 'CONFLICTED':
return 'blocked'
if worker.state == 'FAILED':
return 'error'
# Check for stuck (working but no progress)
if worker.state == 'WORKING' and state_age > 30 * 60: # 30 min
return 'stuck'
return 'ok'
Output Formatting
Table Format
Use fixed-width columns for alignment:
COLUMNS = [
('TASK', 12),
('STATE', 11),
('AGE', 6),
('HEARTBEAT', 10),
('STATUS', 7),
('SUMMARY', 30),
]
def format_row(worker):
return f"{worker.task_id:<12} {worker.state:<11} {format_age(worker.age):<6} ..."
JSON Format
For scripting:
[
{
"task_id": "skills-abc",
"state": "WORKING",
"age_seconds": 2700,
"last_heartbeat": "2026-01-10T14:43:00Z",
"status": "ok",
"branch": "feat/skills-abc"
}
]
Color Codes
STATE_COLORS = {
'ASSIGNED': 'yellow',
'WORKING': 'green',
'IN_REVIEW': 'yellow',
'APPROVED': 'green',
'COMPLETED': 'dim',
'CONFLICTED': 'red',
'FAILED': 'red',
'STALE': 'red',
}
STATUS_COLORS = {
'ok': 'green',
'WARN': 'yellow',
'STALE': 'red',
'DEAD': 'red',
'blocked': 'yellow',
'error': 'red',
'stuck': 'yellow',
}
Integration
With Worker CLI (skills-sse)
worker status is part of the same CLI:
@worker.command()
@click.option('--watch', is_flag=True)
@click.option('--state')
@click.option('--stale', is_flag=True)
@click.option('--json', 'as_json', is_flag=True)
def status(watch, state, stale, as_json):
"""Show worker dashboard."""
...
With Message Bus (skills-ms5)
Query SQLite for heartbeats:
SELECT task_id, MAX(ts) as last_heartbeat
FROM messages
WHERE type = 'heartbeat'
GROUP BY task_id
With State Files
Read .worker-state/workers/*.json for current state.
Implementation
Python Package
skills/worker/
├── commands/
│ ├── status.py # Dashboard
│ ├── show.py # Detail view
│ ├── logs.py # Message history
│ └── stats.py # Aggregate metrics
└── display/
├── table.py # Table formatting
├── colors.py # Terminal colors
└── watch.py # Watch mode loop
Watch Mode Implementation
import time
import os
def watch_loop(render_fn, interval=2):
"""Clear screen and re-render every interval seconds."""
try:
while True:
os.system('clear') # or use curses
render_fn()
time.sleep(interval)
except KeyboardInterrupt:
pass
MVP Scope
For MVP, implement:
- ✅
worker status- basic table with state, age, heartbeat - ✅
worker status --watch- refresh mode - ✅
worker status --json- JSON output - ✅
worker show <id>- basic detail view - ⏸️
worker logs- defer (can query SQLite directly) - ⏸️
worker stats- defer (nice to have) - ⏸️ TUI dashboard - defer (CLI sufficient for 2-4 workers)
Open Questions
- Notification/alerts: Should there be
worker watch --alertthat beeps on failures? - Shell integration: Prompt integration showing worker count/status?
- Log streaming: Real-time follow for
worker logs --follow?
References
docker psoutput formatkubectl get podsandkubectl describesystemctl statusdetail levelhtopfor TUI inspiration (future)