skills/docs/design/worker-state-machine.md
dan 1c66d019bd feat: add worker CLI scaffold in Nim
Multi-agent coordination CLI with SQLite message bus:
- State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED
- Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat
- SQLite WAL mode, dedicated heartbeat thread, channel-based IPC
- cligen for CLI, tiny_sqlite for DB, ORC memory management

Design docs for branch-per-worker, state machine, message passing,
and human observability patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 18:47:47 -08:00

259 lines
10 KiB
Markdown

# Worker State Machine Design
**Status**: Draft
**Bead**: skills-4oj
**Epic**: skills-s6y (Multi-agent orchestration: Lego brick architecture)
## Overview
This document defines the state machine for background worker agents in the multi-agent coordination system. Workers are AI coding agents that operate on separate git branches, coordinated through local filesystem state.
## Design Principles
1. **File-based, no database** - State in JSON files, messages in append-only JSONL
2. **Minimal states** - Only what's needed, resist over-engineering
3. **Atomic updates** - Write to temp file, fsync, atomic rename
4. **Crash recovery** - Detect stale workers via timestamp + process checks
5. **Cross-agent compatible** - Any agent that reads/writes files can participate
## State Diagram
```
┌─────────────────────────────────────────────┐
│ │
▼ │
┌──────────┐ assign ┌──────────┐ start ┌─────────┐ │
│ IDLE │ ──────────► │ ASSIGNED │ ─────────► │ WORKING │ ◄───┘
└──────────┘ └──────────┘ └────┬────┘ changes
▲ │ requested
│ │
│ reset submit_pr │
│ ▼
┌──────────┐ timeout ┌──────────┐ review ┌─────────────┐
│ FAILED │ ◄────────── │ STALE │ ◄───────── │ IN_REVIEW │
└──────────┘ └──────────┘ └──────┬──────┘
│ │
│ approve │
│ ▼
│ ┌──────────┐ merge ┌─────────────┐
└───────────────────│ COMPLETED│ ◄───────── │ APPROVED │
retry └──────────┘ └─────────────┘
```
## States
| State | Description | Entry Condition | Exit Condition |
|-------|-------------|-----------------|----------------|
| **IDLE** | Worker available, no task | Initial / after reset | Task assigned |
| **ASSIGNED** | Task claimed, preparing | Orchestrator assigns task | Branch created, work starts |
| **WORKING** | Actively coding/testing | `git checkout -b` succeeds | PR submitted or failure |
| **IN_REVIEW** | PR open, awaiting review | PR created | Approved, changes requested, or timeout |
| **APPROVED** | Review passed, ready to merge | Reviewer approves | Merge succeeds or conflicts |
| **COMPLETED** | Work merged to main | Merge succeeds | (terminal) or recycle to IDLE |
| **STALE** | No progress detected | Heartbeat timeout | Manual intervention or auto-reset |
| **FAILED** | Unrecoverable error | Max retries exceeded | Manual reset or reassign |
## State File Schema
Location: `.worker-state/{worker-id}.json`
```json
{
"worker_id": "worker-auth",
"state": "WORKING",
"task_id": "skills-abc",
"branch": "worker-auth/skills-abc",
"assigned_at": "2026-01-10T14:00:00Z",
"state_changed_at": "2026-01-10T14:05:00Z",
"last_heartbeat": "2026-01-10T14:32:00Z",
"pid": 12345,
"attempt": 1,
"max_attempts": 3,
"last_error": null,
"pr_url": null,
"review_state": null
}
```
### Field Definitions
| Field | Type | Description |
|-------|------|-------------|
| `worker_id` | string | Unique worker identifier (kebab-case) |
| `state` | enum | Current state (see States table) |
| `task_id` | string | Bead ID of assigned task |
| `branch` | string | Git branch name: `{worker_id}/{task_id}` |
| `assigned_at` | ISO8601 | When task was assigned |
| `state_changed_at` | ISO8601 | Last state transition timestamp |
| `last_heartbeat` | ISO8601 | Last activity timestamp |
| `pid` | int | Process ID (for crash detection) |
| `attempt` | int | Current attempt number (1-indexed) |
| `max_attempts` | int | Max retries before FAILED |
| `last_error` | string? | Error message if failed |
| `pr_url` | string? | Pull request URL when in IN_REVIEW |
| `review_state` | enum? | `pending`, `approved`, `changes_requested` |
## Transitions
### Valid Transitions
```
IDLE → ASSIGNED (assign_task)
ASSIGNED → WORKING (start_work)
ASSIGNED → FAILED (setup_failed)
WORKING → IN_REVIEW (submit_pr)
WORKING → FAILED (work_failed, max_attempts)
WORKING → STALE (heartbeat_timeout)
IN_REVIEW → APPROVED (review_approved)
IN_REVIEW → WORKING (changes_requested)
IN_REVIEW → STALE (review_timeout)
APPROVED → COMPLETED (merge_success)
APPROVED → WORKING (merge_conflict → rebase required)
STALE → WORKING (worker_recovered)
STALE → FAILED (timeout_exceeded)
FAILED → IDLE (reset/retry)
COMPLETED → IDLE (recycle)
```
### Transition Commands
Each transition is triggered by a command or detected condition:
| Command | From | To | Action |
|---------|------|----|----|
| `worker assign <id> <task>` | IDLE | ASSIGNED | Write state file, record task |
| `worker start <id>` | ASSIGNED | WORKING | Create branch, start heartbeat |
| `worker submit <id>` | WORKING | IN_REVIEW | Push branch, create PR |
| `worker approve <id>` | IN_REVIEW | APPROVED | Record approval |
| `worker merge <id>` | APPROVED | COMPLETED | Merge PR, delete branch |
| `worker reset <id>` | FAILED | IDLE | Clear state, ready for new task |
## File Operations
### Atomic State Updates
Never write directly to state file. Use atomic rename pattern:
```bash
# Write to temp file
echo "$new_state" > .worker-state/${worker_id}.json.tmp
# Sync to disk
sync .worker-state/${worker_id}.json.tmp
# Atomic rename
mv .worker-state/${worker_id}.json.tmp .worker-state/${worker_id}.json
```
### Optimistic Concurrency
Include `state_changed_at` as a version marker:
1. Read current state file
2. Compute new state
3. Before write, re-read and verify `state_changed_at` unchanged
4. If changed, retry from step 1
### Lock File Pattern
For operations needing exclusive access:
```bash
# Acquire lock (atomic create with O_EXCL)
if ! (set -C; echo "$$" > .worker-state/${worker_id}.lock) 2>/dev/null; then
# Lock held - check if stale
lock_pid=$(cat .worker-state/${worker_id}.lock)
if ! kill -0 "$lock_pid" 2>/dev/null; then
# Process dead, steal lock
rm -f .worker-state/${worker_id}.lock
echo "$$" > .worker-state/${worker_id}.lock
else
exit 1 # Lock held by live process
fi
fi
# ... do work ...
# Release lock
rm -f .worker-state/${worker_id}.lock
```
## Stale Worker Detection
Workers must update `last_heartbeat` during active work (every 30-60 seconds).
Detection algorithm (run by orchestrator or patrol agent):
```
for each worker file in .worker-state/*.json:
if state in (ASSIGNED, WORKING, IN_REVIEW):
if (now - last_heartbeat) > STALE_TIMEOUT:
if pid is set and process not running:
transition to STALE (worker crashed)
elif (now - last_heartbeat) > DEAD_TIMEOUT:
transition to STALE (presumed dead)
```
### Timeout Values
| Timeout | Duration | Description |
|---------|----------|-------------|
| `STALE_TIMEOUT` | 5 minutes | Mark as stale, eligible for intervention |
| `DEAD_TIMEOUT` | 15 minutes | Presume dead, ready for reassignment |
| `REVIEW_TIMEOUT` | 1 hour | Review taking too long, alert human |
## Directory Structure
```
.worker-state/
├── worker-auth.json # Worker state file
├── worker-auth.lock # Lock file (when held)
├── worker-refactor.json
├── messages/
│ ├── orchestrator.jsonl # Orchestrator → workers
│ ├── worker-auth.jsonl # Worker → orchestrator
│ └── worker-refactor.jsonl
└── reviews/
├── worker-auth.json # Review state (review-gate)
└── worker-refactor.json
```
## Integration Points
### With review-gate (skills-byq)
When worker submits PR (WORKING → IN_REVIEW):
1. Worker writes review request to `reviews/{worker-id}.json`
2. review-gate Stop hook checks review state
3. On approval, orchestrator transitions to APPROVED
### With message passing (skills-ms5)
Workers append to their outbox: `.worker-state/messages/{worker-id}.jsonl`
Orchestrator appends to: `.worker-state/messages/orchestrator.jsonl`
Message format:
```json
{"ts": "2026-01-10T14:00:00Z", "from": "worker-auth", "type": "done", "task": "skills-abc"}
```
### With branch isolation (skills-roq)
- Branch naming: `{worker-id}/{task-id}`
- Mandatory rebase before IN_REVIEW → APPROVED
- Merge to main only from APPROVED state
## Open Questions
1. **Heartbeat mechanism**: Should workers write to their state file, a separate heartbeat file, or append to message log?
2. **PR creation**: Who creates the PR - worker agent or orchestrator?
3. **Review source**: Human review, AI review (review-gate), or both?
4. **Recycle vs terminate**: Should COMPLETED workers return to IDLE or be terminated?
## References
- Temporal.io: Event-sourced durable execution
- LangGraph: Graph-based state machines for agents
- Airflow: DAG-based task lifecycle (None → Scheduled → Running → Success)
- OpenHands: Event-sourced (Observation, Thought, Action) loop