Multi-agent coordination CLI with SQLite message bus: - State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED - Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat - SQLite WAL mode, dedicated heartbeat thread, channel-based IPC - cligen for CLI, tiny_sqlite for DB, ORC memory management Design docs for branch-per-worker, state machine, message passing, and human observability patterns. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
10 KiB
Worker State Machine Design
Status: Draft Bead: skills-4oj Epic: skills-s6y (Multi-agent orchestration: Lego brick architecture)
Overview
This document defines the state machine for background worker agents in the multi-agent coordination system. Workers are AI coding agents that operate on separate git branches, coordinated through local filesystem state.
Design Principles
- File-based, no database - State in JSON files, messages in append-only JSONL
- Minimal states - Only what's needed, resist over-engineering
- Atomic updates - Write to temp file, fsync, atomic rename
- Crash recovery - Detect stale workers via timestamp + process checks
- Cross-agent compatible - Any agent that reads/writes files can participate
State Diagram
┌─────────────────────────────────────────────┐
│ │
▼ │
┌──────────┐ assign ┌──────────┐ start ┌─────────┐ │
│ IDLE │ ──────────► │ ASSIGNED │ ─────────► │ WORKING │ ◄───┘
└──────────┘ └──────────┘ └────┬────┘ changes
▲ │ requested
│ │
│ reset submit_pr │
│ ▼
┌──────────┐ timeout ┌──────────┐ review ┌─────────────┐
│ FAILED │ ◄────────── │ STALE │ ◄───────── │ IN_REVIEW │
└──────────┘ └──────────┘ └──────┬──────┘
│ │
│ approve │
│ ▼
│ ┌──────────┐ merge ┌─────────────┐
└───────────────────│ COMPLETED│ ◄───────── │ APPROVED │
retry └──────────┘ └─────────────┘
States
| State | Description | Entry Condition | Exit Condition |
|---|---|---|---|
| IDLE | Worker available, no task | Initial / after reset | Task assigned |
| ASSIGNED | Task claimed, preparing | Orchestrator assigns task | Branch created, work starts |
| WORKING | Actively coding/testing | git checkout -b succeeds |
PR submitted or failure |
| IN_REVIEW | PR open, awaiting review | PR created | Approved, changes requested, or timeout |
| APPROVED | Review passed, ready to merge | Reviewer approves | Merge succeeds or conflicts |
| COMPLETED | Work merged to main | Merge succeeds | (terminal) or recycle to IDLE |
| STALE | No progress detected | Heartbeat timeout | Manual intervention or auto-reset |
| FAILED | Unrecoverable error | Max retries exceeded | Manual reset or reassign |
State File Schema
Location: .worker-state/{worker-id}.json
{
"worker_id": "worker-auth",
"state": "WORKING",
"task_id": "skills-abc",
"branch": "worker-auth/skills-abc",
"assigned_at": "2026-01-10T14:00:00Z",
"state_changed_at": "2026-01-10T14:05:00Z",
"last_heartbeat": "2026-01-10T14:32:00Z",
"pid": 12345,
"attempt": 1,
"max_attempts": 3,
"last_error": null,
"pr_url": null,
"review_state": null
}
Field Definitions
| Field | Type | Description |
|---|---|---|
worker_id |
string | Unique worker identifier (kebab-case) |
state |
enum | Current state (see States table) |
task_id |
string | Bead ID of assigned task |
branch |
string | Git branch name: {worker_id}/{task_id} |
assigned_at |
ISO8601 | When task was assigned |
state_changed_at |
ISO8601 | Last state transition timestamp |
last_heartbeat |
ISO8601 | Last activity timestamp |
pid |
int | Process ID (for crash detection) |
attempt |
int | Current attempt number (1-indexed) |
max_attempts |
int | Max retries before FAILED |
last_error |
string? | Error message if failed |
pr_url |
string? | Pull request URL when in IN_REVIEW |
review_state |
enum? | pending, approved, changes_requested |
Transitions
Valid Transitions
IDLE → ASSIGNED (assign_task)
ASSIGNED → WORKING (start_work)
ASSIGNED → FAILED (setup_failed)
WORKING → IN_REVIEW (submit_pr)
WORKING → FAILED (work_failed, max_attempts)
WORKING → STALE (heartbeat_timeout)
IN_REVIEW → APPROVED (review_approved)
IN_REVIEW → WORKING (changes_requested)
IN_REVIEW → STALE (review_timeout)
APPROVED → COMPLETED (merge_success)
APPROVED → WORKING (merge_conflict → rebase required)
STALE → WORKING (worker_recovered)
STALE → FAILED (timeout_exceeded)
FAILED → IDLE (reset/retry)
COMPLETED → IDLE (recycle)
Transition Commands
Each transition is triggered by a command or detected condition:
| Command | From | To | Action |
|---|---|---|---|
worker assign <id> <task> |
IDLE | ASSIGNED | Write state file, record task |
worker start <id> |
ASSIGNED | WORKING | Create branch, start heartbeat |
worker submit <id> |
WORKING | IN_REVIEW | Push branch, create PR |
worker approve <id> |
IN_REVIEW | APPROVED | Record approval |
worker merge <id> |
APPROVED | COMPLETED | Merge PR, delete branch |
worker reset <id> |
FAILED | IDLE | Clear state, ready for new task |
File Operations
Atomic State Updates
Never write directly to state file. Use atomic rename pattern:
# Write to temp file
echo "$new_state" > .worker-state/${worker_id}.json.tmp
# Sync to disk
sync .worker-state/${worker_id}.json.tmp
# Atomic rename
mv .worker-state/${worker_id}.json.tmp .worker-state/${worker_id}.json
Optimistic Concurrency
Include state_changed_at as a version marker:
- Read current state file
- Compute new state
- Before write, re-read and verify
state_changed_atunchanged - If changed, retry from step 1
Lock File Pattern
For operations needing exclusive access:
# Acquire lock (atomic create with O_EXCL)
if ! (set -C; echo "$$" > .worker-state/${worker_id}.lock) 2>/dev/null; then
# Lock held - check if stale
lock_pid=$(cat .worker-state/${worker_id}.lock)
if ! kill -0 "$lock_pid" 2>/dev/null; then
# Process dead, steal lock
rm -f .worker-state/${worker_id}.lock
echo "$$" > .worker-state/${worker_id}.lock
else
exit 1 # Lock held by live process
fi
fi
# ... do work ...
# Release lock
rm -f .worker-state/${worker_id}.lock
Stale Worker Detection
Workers must update last_heartbeat during active work (every 30-60 seconds).
Detection algorithm (run by orchestrator or patrol agent):
for each worker file in .worker-state/*.json:
if state in (ASSIGNED, WORKING, IN_REVIEW):
if (now - last_heartbeat) > STALE_TIMEOUT:
if pid is set and process not running:
transition to STALE (worker crashed)
elif (now - last_heartbeat) > DEAD_TIMEOUT:
transition to STALE (presumed dead)
Timeout Values
| Timeout | Duration | Description |
|---|---|---|
STALE_TIMEOUT |
5 minutes | Mark as stale, eligible for intervention |
DEAD_TIMEOUT |
15 minutes | Presume dead, ready for reassignment |
REVIEW_TIMEOUT |
1 hour | Review taking too long, alert human |
Directory Structure
.worker-state/
├── worker-auth.json # Worker state file
├── worker-auth.lock # Lock file (when held)
├── worker-refactor.json
├── messages/
│ ├── orchestrator.jsonl # Orchestrator → workers
│ ├── worker-auth.jsonl # Worker → orchestrator
│ └── worker-refactor.jsonl
└── reviews/
├── worker-auth.json # Review state (review-gate)
└── worker-refactor.json
Integration Points
With review-gate (skills-byq)
When worker submits PR (WORKING → IN_REVIEW):
- Worker writes review request to
reviews/{worker-id}.json - review-gate Stop hook checks review state
- On approval, orchestrator transitions to APPROVED
With message passing (skills-ms5)
Workers append to their outbox: .worker-state/messages/{worker-id}.jsonl
Orchestrator appends to: .worker-state/messages/orchestrator.jsonl
Message format:
{"ts": "2026-01-10T14:00:00Z", "from": "worker-auth", "type": "done", "task": "skills-abc"}
With branch isolation (skills-roq)
- Branch naming:
{worker-id}/{task-id} - Mandatory rebase before IN_REVIEW → APPROVED
- Merge to main only from APPROVED state
Open Questions
- Heartbeat mechanism: Should workers write to their state file, a separate heartbeat file, or append to message log?
- PR creation: Who creates the PR - worker agent or orchestrator?
- Review source: Human review, AI review (review-gate), or both?
- Recycle vs terminate: Should COMPLETED workers return to IDLE or be terminated?
References
- Temporal.io: Event-sourced durable execution
- LangGraph: Graph-based state machines for agents
- Airflow: DAG-based task lifecycle (None → Scheduled → Running → Success)
- OpenHands: Event-sourced (Observation, Thought, Action) loop