Multi-agent coordination CLI with SQLite message bus: - State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED - Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat - SQLite WAL mode, dedicated heartbeat thread, channel-based IPC - cligen for CLI, tiny_sqlite for DB, ORC memory management Design docs for branch-per-worker, state machine, message passing, and human observability patterns. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
406 lines
9.8 KiB
Markdown
406 lines
9.8 KiB
Markdown
# Human Observability Design
|
|
|
|
**Status**: Draft
|
|
**Bead**: skills-yak
|
|
**Epic**: skills-s6y (Multi-agent orchestration: Lego brick architecture)
|
|
|
|
## Overview
|
|
|
|
This document defines the observability interface for human orchestrators managing AI workers. The design follows kubectl/docker patterns: one root command with subcommands, table output, watch mode, and detailed describe views.
|
|
|
|
## Design Decisions (from orch consensus)
|
|
|
|
| Decision | Choice | Rationale |
|
|
|----------|--------|-----------|
|
|
| Command structure | One root, many subcommands | kubectl-style; composable; easy to extend |
|
|
| Status columns | id, state, age, heartbeat, staleness | At-a-glance signal for stuck/healthy workers |
|
|
| Stale threshold | 3x heartbeat = WARN, 10x = STALE | Avoid false positives from jitter |
|
|
| Watch mode | `--watch` flag | Multi-agent is event-driven; real-time visibility |
|
|
| Detail view | Full spec, transitions, errors, "why stuck" | Answer "what, why, where stuck?" |
|
|
|
|
## Commands
|
|
|
|
### Command Overview
|
|
|
|
```
|
|
worker status [options] # Dashboard table
|
|
worker show <task-id> # Detailed view
|
|
worker logs <task-id> # Message history
|
|
worker stats # Aggregate metrics
|
|
```
|
|
|
|
### `worker status`
|
|
|
|
Dashboard view of all workers.
|
|
|
|
```bash
|
|
worker status [--state STATE] [--stale] [--watch] [--json]
|
|
```
|
|
|
|
**Default Output**:
|
|
```
|
|
TASK STATE AGE HEARTBEAT STATUS SUMMARY
|
|
skills-abc WORKING 45m 2m ago ok Fix auth bug
|
|
skills-xyz IN_REVIEW 2h -- ok Add login form
|
|
skills-123 WORKING 3h 12m ago STALE Refactor database
|
|
skills-456 CONFLICTED 1h 5m ago blocked Update API endpoints
|
|
```
|
|
|
|
**Columns**:
|
|
|
|
| Column | Source | Description |
|
|
|--------|--------|-------------|
|
|
| TASK | state file | Task ID |
|
|
| STATE | state file | Current state (color-coded) |
|
|
| AGE | state file | Time since created |
|
|
| HEARTBEAT | bus.db | Time since last heartbeat |
|
|
| STATUS | computed | ok, WARN, STALE, blocked, error |
|
|
| SUMMARY | task description | First 30 chars of description |
|
|
|
|
**State Colors** (if terminal supports):
|
|
- 🟢 Green: WORKING
|
|
- 🟡 Yellow: IN_REVIEW, ASSIGNED
|
|
- 🔴 Red: FAILED, CONFLICTED, STALE
|
|
- ⚪ Gray: COMPLETED
|
|
|
|
**Options**:
|
|
|
|
| Option | Description |
|
|
|--------|-------------|
|
|
| `--state STATE` | Filter by state (WORKING, IN_REVIEW, etc.) |
|
|
| `--stale` | Show only stale workers |
|
|
| `--watch` | Refresh every 2 seconds |
|
|
| `--json` | Output as JSON array |
|
|
| `--wide` | Show additional columns (branch, worker type) |
|
|
|
|
**Watch Mode**:
|
|
```bash
|
|
worker status --watch
|
|
# Clears screen, refreshes every 2s
|
|
# Ctrl+C to exit
|
|
```
|
|
|
|
### `worker show <task-id>`
|
|
|
|
Detailed view of a single worker.
|
|
|
|
```bash
|
|
worker show skills-abc [--events] [--json]
|
|
```
|
|
|
|
**Output**:
|
|
```
|
|
Task: skills-abc
|
|
Description: Fix authentication bug in login flow
|
|
State: WORKING
|
|
Branch: feat/skills-abc
|
|
Worktree: worktrees/skills-abc
|
|
|
|
Created: 2026-01-10 14:00:00 (45m ago)
|
|
State Changed: 2026-01-10 14:05:00 (40m ago)
|
|
Last Heartbeat: 2026-01-10 14:43:00 (2m ago)
|
|
|
|
Status: ok
|
|
✓ Heartbeat within threshold
|
|
✓ State progressing normally
|
|
|
|
State History:
|
|
14:00:00 → ASSIGNED (spawned)
|
|
14:05:00 → WORKING (worker start)
|
|
|
|
Git Status:
|
|
Branch: feat/skills-abc
|
|
Ahead of integration: 3 commits
|
|
Behind integration: 0 commits
|
|
Uncommitted changes: 2 files
|
|
|
|
Recent Messages:
|
|
14:43:00 heartbeat status=working, progress=0.6
|
|
14:35:00 heartbeat status=working, progress=0.4
|
|
14:05:00 state_change ASSIGNED → WORKING
|
|
14:00:00 task_assign from=orchestrator
|
|
```
|
|
|
|
**With `--events`**: Show full message history.
|
|
|
|
**Sections**:
|
|
|
|
1. **Header**: Task ID, description, current state, branch
|
|
2. **Timestamps**: Created, state changed, last heartbeat
|
|
3. **Status Check**: Is it healthy? What's wrong?
|
|
4. **State History**: Timeline of transitions
|
|
5. **Git Status**: Branch status, commits, conflicts
|
|
6. **Recent Messages**: Last 10 messages from bus.db
|
|
|
|
### `worker logs <task-id>`
|
|
|
|
Stream message history for a task.
|
|
|
|
```bash
|
|
worker logs skills-abc [--follow] [--since 1h] [--type heartbeat]
|
|
```
|
|
|
|
**Output**:
|
|
```
|
|
2026-01-10 14:00:00 task_assign from=orchestrator
|
|
2026-01-10 14:05:00 state_change ASSIGNED → WORKING
|
|
2026-01-10 14:15:00 heartbeat status=working
|
|
2026-01-10 14:25:00 heartbeat status=working
|
|
...
|
|
```
|
|
|
|
**Options**:
|
|
|
|
| Option | Description |
|
|
|--------|-------------|
|
|
| `--follow, -f` | Stream new messages as they arrive |
|
|
| `--since DURATION` | Show messages from last N minutes/hours |
|
|
| `--type TYPE` | Filter by message type |
|
|
| `--limit N` | Show last N messages (default: 50) |
|
|
|
|
### `worker stats`
|
|
|
|
Aggregate metrics across all workers.
|
|
|
|
```bash
|
|
worker stats [--since 24h]
|
|
```
|
|
|
|
**Output**:
|
|
```
|
|
Workers by State:
|
|
WORKING: 2
|
|
IN_REVIEW: 1
|
|
COMPLETED: 5
|
|
FAILED: 1
|
|
Total: 9
|
|
|
|
Health:
|
|
Healthy: 3 (33%)
|
|
Stale: 1 (11%)
|
|
|
|
Timing (median):
|
|
ASSIGNED → WORKING: 2m
|
|
WORKING → IN_REVIEW: 45m
|
|
IN_REVIEW → APPROVED: 15m
|
|
Full cycle: 1h 10m
|
|
|
|
Failures (last 24h): 1
|
|
skills-789: Rebase conflict (3h ago)
|
|
```
|
|
|
|
## Stale Detection
|
|
|
|
### Thresholds
|
|
|
|
Based on heartbeat interval `H` (default: 10s):
|
|
|
|
| Level | Threshold | Meaning |
|
|
|-------|-----------|---------|
|
|
| OK | < 3H (30s) | Normal operation |
|
|
| WARN | 3H - 10H (30s - 100s) | Possible issue |
|
|
| STALE | > 10H (100s) | Worker likely stuck/dead |
|
|
| DEAD | > 30H (5m) | Worker definitely dead |
|
|
|
|
### Stale vs Stuck
|
|
|
|
Two different problems:
|
|
|
|
| Condition | Symptom | Detection |
|
|
|-----------|---------|-----------|
|
|
| **Stale worker** | No heartbeats | `now - last_heartbeat > threshold` |
|
|
| **Stuck task** | Heartbeating but no progress | Same state for > N minutes |
|
|
|
|
**Stuck detection**:
|
|
```sql
|
|
SELECT task_id FROM workers
|
|
WHERE state = 'WORKING'
|
|
AND state_changed_at < datetime('now', '-30 minutes')
|
|
AND last_heartbeat > datetime('now', '-1 minute')
|
|
```
|
|
|
|
## Status Computation
|
|
|
|
```python
|
|
def compute_status(worker: Worker) -> str:
|
|
now = datetime.utcnow()
|
|
heartbeat_age = (now - worker.last_heartbeat).total_seconds()
|
|
state_age = (now - worker.state_changed_at).total_seconds()
|
|
|
|
H = 10 # heartbeat interval
|
|
|
|
# Check heartbeat freshness
|
|
if worker.state in ('ASSIGNED', 'WORKING'):
|
|
if heartbeat_age > 30 * H: # 5 min
|
|
return 'DEAD'
|
|
if heartbeat_age > 10 * H: # 100s
|
|
return 'STALE'
|
|
if heartbeat_age > 3 * H: # 30s
|
|
return 'WARN'
|
|
|
|
# Check for conflicts/failures
|
|
if worker.state == 'CONFLICTED':
|
|
return 'blocked'
|
|
if worker.state == 'FAILED':
|
|
return 'error'
|
|
|
|
# Check for stuck (working but no progress)
|
|
if worker.state == 'WORKING' and state_age > 30 * 60: # 30 min
|
|
return 'stuck'
|
|
|
|
return 'ok'
|
|
```
|
|
|
|
## Output Formatting
|
|
|
|
### Table Format
|
|
|
|
Use fixed-width columns for alignment:
|
|
|
|
```python
|
|
COLUMNS = [
|
|
('TASK', 12),
|
|
('STATE', 11),
|
|
('AGE', 6),
|
|
('HEARTBEAT', 10),
|
|
('STATUS', 7),
|
|
('SUMMARY', 30),
|
|
]
|
|
|
|
def format_row(worker):
|
|
return f"{worker.task_id:<12} {worker.state:<11} {format_age(worker.age):<6} ..."
|
|
```
|
|
|
|
### JSON Format
|
|
|
|
For scripting:
|
|
|
|
```json
|
|
[
|
|
{
|
|
"task_id": "skills-abc",
|
|
"state": "WORKING",
|
|
"age_seconds": 2700,
|
|
"last_heartbeat": "2026-01-10T14:43:00Z",
|
|
"status": "ok",
|
|
"branch": "feat/skills-abc"
|
|
}
|
|
]
|
|
```
|
|
|
|
### Color Codes
|
|
|
|
```python
|
|
STATE_COLORS = {
|
|
'ASSIGNED': 'yellow',
|
|
'WORKING': 'green',
|
|
'IN_REVIEW': 'yellow',
|
|
'APPROVED': 'green',
|
|
'COMPLETED': 'dim',
|
|
'CONFLICTED': 'red',
|
|
'FAILED': 'red',
|
|
'STALE': 'red',
|
|
}
|
|
|
|
STATUS_COLORS = {
|
|
'ok': 'green',
|
|
'WARN': 'yellow',
|
|
'STALE': 'red',
|
|
'DEAD': 'red',
|
|
'blocked': 'yellow',
|
|
'error': 'red',
|
|
'stuck': 'yellow',
|
|
}
|
|
```
|
|
|
|
## Integration
|
|
|
|
### With Worker CLI (skills-sse)
|
|
|
|
`worker status` is part of the same CLI:
|
|
|
|
```python
|
|
@worker.command()
|
|
@click.option('--watch', is_flag=True)
|
|
@click.option('--state')
|
|
@click.option('--stale', is_flag=True)
|
|
@click.option('--json', 'as_json', is_flag=True)
|
|
def status(watch, state, stale, as_json):
|
|
"""Show worker dashboard."""
|
|
...
|
|
```
|
|
|
|
### With Message Bus (skills-ms5)
|
|
|
|
Query SQLite for heartbeats:
|
|
|
|
```sql
|
|
SELECT task_id, MAX(ts) as last_heartbeat
|
|
FROM messages
|
|
WHERE type = 'heartbeat'
|
|
GROUP BY task_id
|
|
```
|
|
|
|
### With State Files
|
|
|
|
Read `.worker-state/workers/*.json` for current state.
|
|
|
|
## Implementation
|
|
|
|
### Python Package
|
|
|
|
```
|
|
skills/worker/
|
|
├── commands/
|
|
│ ├── status.py # Dashboard
|
|
│ ├── show.py # Detail view
|
|
│ ├── logs.py # Message history
|
|
│ └── stats.py # Aggregate metrics
|
|
└── display/
|
|
├── table.py # Table formatting
|
|
├── colors.py # Terminal colors
|
|
└── watch.py # Watch mode loop
|
|
```
|
|
|
|
### Watch Mode Implementation
|
|
|
|
```python
|
|
import time
|
|
import os
|
|
|
|
def watch_loop(render_fn, interval=2):
|
|
"""Clear screen and re-render every interval seconds."""
|
|
try:
|
|
while True:
|
|
os.system('clear') # or use curses
|
|
render_fn()
|
|
time.sleep(interval)
|
|
except KeyboardInterrupt:
|
|
pass
|
|
```
|
|
|
|
## MVP Scope
|
|
|
|
For MVP, implement:
|
|
|
|
1. ✅ `worker status` - basic table with state, age, heartbeat
|
|
2. ✅ `worker status --watch` - refresh mode
|
|
3. ✅ `worker status --json` - JSON output
|
|
4. ✅ `worker show <id>` - basic detail view
|
|
5. ⏸️ `worker logs` - defer (can query SQLite directly)
|
|
6. ⏸️ `worker stats` - defer (nice to have)
|
|
7. ⏸️ TUI dashboard - defer (CLI sufficient for 2-4 workers)
|
|
|
|
## Open Questions
|
|
|
|
1. **Notification/alerts**: Should there be `worker watch --alert` that beeps on failures?
|
|
2. **Shell integration**: Prompt integration showing worker count/status?
|
|
3. **Log streaming**: Real-time follow for `worker logs --follow`?
|
|
|
|
## References
|
|
|
|
- `docker ps` output format
|
|
- `kubectl get pods` and `kubectl describe`
|
|
- `systemctl status` detail level
|
|
- `htop` for TUI inspiration (future)
|