Multi-agent coordination CLI with SQLite message bus: - State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED - Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat - SQLite WAL mode, dedicated heartbeat thread, channel-based IPC - cligen for CLI, tiny_sqlite for DB, ORC memory management Design docs for branch-per-worker, state machine, message passing, and human observability patterns. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
259 lines
10 KiB
Markdown
259 lines
10 KiB
Markdown
# Worker State Machine Design
|
|
|
|
**Status**: Draft
|
|
**Bead**: skills-4oj
|
|
**Epic**: skills-s6y (Multi-agent orchestration: Lego brick architecture)
|
|
|
|
## Overview
|
|
|
|
This document defines the state machine for background worker agents in the multi-agent coordination system. Workers are AI coding agents that operate on separate git branches, coordinated through local filesystem state.
|
|
|
|
## Design Principles
|
|
|
|
1. **File-based, no database** - State in JSON files, messages in append-only JSONL
|
|
2. **Minimal states** - Only what's needed, resist over-engineering
|
|
3. **Atomic updates** - Write to temp file, fsync, atomic rename
|
|
4. **Crash recovery** - Detect stale workers via timestamp + process checks
|
|
5. **Cross-agent compatible** - Any agent that reads/writes files can participate
|
|
|
|
## State Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────────┐
|
|
│ │
|
|
▼ │
|
|
┌──────────┐ assign ┌──────────┐ start ┌─────────┐ │
|
|
│ IDLE │ ──────────► │ ASSIGNED │ ─────────► │ WORKING │ ◄───┘
|
|
└──────────┘ └──────────┘ └────┬────┘ changes
|
|
▲ │ requested
|
|
│ │
|
|
│ reset submit_pr │
|
|
│ ▼
|
|
┌──────────┐ timeout ┌──────────┐ review ┌─────────────┐
|
|
│ FAILED │ ◄────────── │ STALE │ ◄───────── │ IN_REVIEW │
|
|
└──────────┘ └──────────┘ └──────┬──────┘
|
|
│ │
|
|
│ approve │
|
|
│ ▼
|
|
│ ┌──────────┐ merge ┌─────────────┐
|
|
└───────────────────│ COMPLETED│ ◄───────── │ APPROVED │
|
|
retry └──────────┘ └─────────────┘
|
|
```
|
|
|
|
## States
|
|
|
|
| State | Description | Entry Condition | Exit Condition |
|
|
|-------|-------------|-----------------|----------------|
|
|
| **IDLE** | Worker available, no task | Initial / after reset | Task assigned |
|
|
| **ASSIGNED** | Task claimed, preparing | Orchestrator assigns task | Branch created, work starts |
|
|
| **WORKING** | Actively coding/testing | `git checkout -b` succeeds | PR submitted or failure |
|
|
| **IN_REVIEW** | PR open, awaiting review | PR created | Approved, changes requested, or timeout |
|
|
| **APPROVED** | Review passed, ready to merge | Reviewer approves | Merge succeeds or conflicts |
|
|
| **COMPLETED** | Work merged to main | Merge succeeds | (terminal) or recycle to IDLE |
|
|
| **STALE** | No progress detected | Heartbeat timeout | Manual intervention or auto-reset |
|
|
| **FAILED** | Unrecoverable error | Max retries exceeded | Manual reset or reassign |
|
|
|
|
## State File Schema
|
|
|
|
Location: `.worker-state/{worker-id}.json`
|
|
|
|
```json
|
|
{
|
|
"worker_id": "worker-auth",
|
|
"state": "WORKING",
|
|
"task_id": "skills-abc",
|
|
"branch": "worker-auth/skills-abc",
|
|
"assigned_at": "2026-01-10T14:00:00Z",
|
|
"state_changed_at": "2026-01-10T14:05:00Z",
|
|
"last_heartbeat": "2026-01-10T14:32:00Z",
|
|
"pid": 12345,
|
|
"attempt": 1,
|
|
"max_attempts": 3,
|
|
"last_error": null,
|
|
"pr_url": null,
|
|
"review_state": null
|
|
}
|
|
```
|
|
|
|
### Field Definitions
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `worker_id` | string | Unique worker identifier (kebab-case) |
|
|
| `state` | enum | Current state (see States table) |
|
|
| `task_id` | string | Bead ID of assigned task |
|
|
| `branch` | string | Git branch name: `{worker_id}/{task_id}` |
|
|
| `assigned_at` | ISO8601 | When task was assigned |
|
|
| `state_changed_at` | ISO8601 | Last state transition timestamp |
|
|
| `last_heartbeat` | ISO8601 | Last activity timestamp |
|
|
| `pid` | int | Process ID (for crash detection) |
|
|
| `attempt` | int | Current attempt number (1-indexed) |
|
|
| `max_attempts` | int | Max retries before FAILED |
|
|
| `last_error` | string? | Error message if failed |
|
|
| `pr_url` | string? | Pull request URL when in IN_REVIEW |
|
|
| `review_state` | enum? | `pending`, `approved`, `changes_requested` |
|
|
|
|
## Transitions
|
|
|
|
### Valid Transitions
|
|
|
|
```
|
|
IDLE → ASSIGNED (assign_task)
|
|
ASSIGNED → WORKING (start_work)
|
|
ASSIGNED → FAILED (setup_failed)
|
|
WORKING → IN_REVIEW (submit_pr)
|
|
WORKING → FAILED (work_failed, max_attempts)
|
|
WORKING → STALE (heartbeat_timeout)
|
|
IN_REVIEW → APPROVED (review_approved)
|
|
IN_REVIEW → WORKING (changes_requested)
|
|
IN_REVIEW → STALE (review_timeout)
|
|
APPROVED → COMPLETED (merge_success)
|
|
APPROVED → WORKING (merge_conflict → rebase required)
|
|
STALE → WORKING (worker_recovered)
|
|
STALE → FAILED (timeout_exceeded)
|
|
FAILED → IDLE (reset/retry)
|
|
COMPLETED → IDLE (recycle)
|
|
```
|
|
|
|
### Transition Commands
|
|
|
|
Each transition is triggered by a command or detected condition:
|
|
|
|
| Command | From | To | Action |
|
|
|---------|------|----|----|
|
|
| `worker assign <id> <task>` | IDLE | ASSIGNED | Write state file, record task |
|
|
| `worker start <id>` | ASSIGNED | WORKING | Create branch, start heartbeat |
|
|
| `worker submit <id>` | WORKING | IN_REVIEW | Push branch, create PR |
|
|
| `worker approve <id>` | IN_REVIEW | APPROVED | Record approval |
|
|
| `worker merge <id>` | APPROVED | COMPLETED | Merge PR, delete branch |
|
|
| `worker reset <id>` | FAILED | IDLE | Clear state, ready for new task |
|
|
|
|
## File Operations
|
|
|
|
### Atomic State Updates
|
|
|
|
Never write directly to state file. Use atomic rename pattern:
|
|
|
|
```bash
|
|
# Write to temp file
|
|
echo "$new_state" > .worker-state/${worker_id}.json.tmp
|
|
|
|
# Sync to disk
|
|
sync .worker-state/${worker_id}.json.tmp
|
|
|
|
# Atomic rename
|
|
mv .worker-state/${worker_id}.json.tmp .worker-state/${worker_id}.json
|
|
```
|
|
|
|
### Optimistic Concurrency
|
|
|
|
Include `state_changed_at` as a version marker:
|
|
|
|
1. Read current state file
|
|
2. Compute new state
|
|
3. Before write, re-read and verify `state_changed_at` unchanged
|
|
4. If changed, retry from step 1
|
|
|
|
### Lock File Pattern
|
|
|
|
For operations needing exclusive access:
|
|
|
|
```bash
|
|
# Acquire lock (atomic create with O_EXCL)
|
|
if ! (set -C; echo "$$" > .worker-state/${worker_id}.lock) 2>/dev/null; then
|
|
# Lock held - check if stale
|
|
lock_pid=$(cat .worker-state/${worker_id}.lock)
|
|
if ! kill -0 "$lock_pid" 2>/dev/null; then
|
|
# Process dead, steal lock
|
|
rm -f .worker-state/${worker_id}.lock
|
|
echo "$$" > .worker-state/${worker_id}.lock
|
|
else
|
|
exit 1 # Lock held by live process
|
|
fi
|
|
fi
|
|
|
|
# ... do work ...
|
|
|
|
# Release lock
|
|
rm -f .worker-state/${worker_id}.lock
|
|
```
|
|
|
|
## Stale Worker Detection
|
|
|
|
Workers must update `last_heartbeat` during active work (every 30-60 seconds).
|
|
|
|
Detection algorithm (run by orchestrator or patrol agent):
|
|
|
|
```
|
|
for each worker file in .worker-state/*.json:
|
|
if state in (ASSIGNED, WORKING, IN_REVIEW):
|
|
if (now - last_heartbeat) > STALE_TIMEOUT:
|
|
if pid is set and process not running:
|
|
transition to STALE (worker crashed)
|
|
elif (now - last_heartbeat) > DEAD_TIMEOUT:
|
|
transition to STALE (presumed dead)
|
|
```
|
|
|
|
### Timeout Values
|
|
|
|
| Timeout | Duration | Description |
|
|
|---------|----------|-------------|
|
|
| `STALE_TIMEOUT` | 5 minutes | Mark as stale, eligible for intervention |
|
|
| `DEAD_TIMEOUT` | 15 minutes | Presume dead, ready for reassignment |
|
|
| `REVIEW_TIMEOUT` | 1 hour | Review taking too long, alert human |
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
.worker-state/
|
|
├── worker-auth.json # Worker state file
|
|
├── worker-auth.lock # Lock file (when held)
|
|
├── worker-refactor.json
|
|
├── messages/
|
|
│ ├── orchestrator.jsonl # Orchestrator → workers
|
|
│ ├── worker-auth.jsonl # Worker → orchestrator
|
|
│ └── worker-refactor.jsonl
|
|
└── reviews/
|
|
├── worker-auth.json # Review state (review-gate)
|
|
└── worker-refactor.json
|
|
```
|
|
|
|
## Integration Points
|
|
|
|
### With review-gate (skills-byq)
|
|
|
|
When worker submits PR (WORKING → IN_REVIEW):
|
|
1. Worker writes review request to `reviews/{worker-id}.json`
|
|
2. review-gate Stop hook checks review state
|
|
3. On approval, orchestrator transitions to APPROVED
|
|
|
|
### With message passing (skills-ms5)
|
|
|
|
Workers append to their outbox: `.worker-state/messages/{worker-id}.jsonl`
|
|
Orchestrator appends to: `.worker-state/messages/orchestrator.jsonl`
|
|
|
|
Message format:
|
|
```json
|
|
{"ts": "2026-01-10T14:00:00Z", "from": "worker-auth", "type": "done", "task": "skills-abc"}
|
|
```
|
|
|
|
### With branch isolation (skills-roq)
|
|
|
|
- Branch naming: `{worker-id}/{task-id}`
|
|
- Mandatory rebase before IN_REVIEW → APPROVED
|
|
- Merge to main only from APPROVED state
|
|
|
|
## Open Questions
|
|
|
|
1. **Heartbeat mechanism**: Should workers write to their state file, a separate heartbeat file, or append to message log?
|
|
2. **PR creation**: Who creates the PR - worker agent or orchestrator?
|
|
3. **Review source**: Human review, AI review (review-gate), or both?
|
|
4. **Recycle vs terminate**: Should COMPLETED workers return to IDLE or be terminated?
|
|
|
|
## References
|
|
|
|
- Temporal.io: Event-sourced durable execution
|
|
- LangGraph: Graph-based state machines for agents
|
|
- Airflow: DAG-based task lifecycle (None → Scheduled → Running → Success)
|
|
- OpenHands: Event-sourced (Observation, Thought, Action) loop
|