skills/docs/design/worker-state-machine.md
dan 1c66d019bd feat: add worker CLI scaffold in Nim
Multi-agent coordination CLI with SQLite message bus:
- State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED
- Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat
- SQLite WAL mode, dedicated heartbeat thread, channel-based IPC
- cligen for CLI, tiny_sqlite for DB, ORC memory management

Design docs for branch-per-worker, state machine, message passing,
and human observability patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 18:47:47 -08:00

10 KiB

Worker State Machine Design

Status: Draft Bead: skills-4oj Epic: skills-s6y (Multi-agent orchestration: Lego brick architecture)

Overview

This document defines the state machine for background worker agents in the multi-agent coordination system. Workers are AI coding agents that operate on separate git branches, coordinated through local filesystem state.

Design Principles

  1. File-based, no database - State in JSON files, messages in append-only JSONL
  2. Minimal states - Only what's needed, resist over-engineering
  3. Atomic updates - Write to temp file, fsync, atomic rename
  4. Crash recovery - Detect stale workers via timestamp + process checks
  5. Cross-agent compatible - Any agent that reads/writes files can participate

State Diagram

                    ┌─────────────────────────────────────────────┐
                    │                                             │
                    ▼                                             │
┌──────────┐   assign    ┌──────────┐   start    ┌─────────┐     │
│   IDLE   │ ──────────► │ ASSIGNED │ ─────────► │ WORKING │ ◄───┘
└──────────┘             └──────────┘            └────┬────┘   changes
     ▲                                                │        requested
     │                                                │
     │ reset                              submit_pr   │
     │                                                ▼
┌──────────┐   timeout   ┌──────────┐   review   ┌─────────────┐
│  FAILED  │ ◄────────── │  STALE   │ ◄───────── │ IN_REVIEW   │
└──────────┘             └──────────┘            └──────┬──────┘
     │                                                  │
     │                                         approve  │
     │                                                  ▼
     │                   ┌──────────┐   merge    ┌─────────────┐
     └───────────────────│ COMPLETED│ ◄───────── │  APPROVED   │
           retry         └──────────┘            └─────────────┘

States

State Description Entry Condition Exit Condition
IDLE Worker available, no task Initial / after reset Task assigned
ASSIGNED Task claimed, preparing Orchestrator assigns task Branch created, work starts
WORKING Actively coding/testing git checkout -b succeeds PR submitted or failure
IN_REVIEW PR open, awaiting review PR created Approved, changes requested, or timeout
APPROVED Review passed, ready to merge Reviewer approves Merge succeeds or conflicts
COMPLETED Work merged to main Merge succeeds (terminal) or recycle to IDLE
STALE No progress detected Heartbeat timeout Manual intervention or auto-reset
FAILED Unrecoverable error Max retries exceeded Manual reset or reassign

State File Schema

Location: .worker-state/{worker-id}.json

{
  "worker_id": "worker-auth",
  "state": "WORKING",
  "task_id": "skills-abc",
  "branch": "worker-auth/skills-abc",
  "assigned_at": "2026-01-10T14:00:00Z",
  "state_changed_at": "2026-01-10T14:05:00Z",
  "last_heartbeat": "2026-01-10T14:32:00Z",
  "pid": 12345,
  "attempt": 1,
  "max_attempts": 3,
  "last_error": null,
  "pr_url": null,
  "review_state": null
}

Field Definitions

Field Type Description
worker_id string Unique worker identifier (kebab-case)
state enum Current state (see States table)
task_id string Bead ID of assigned task
branch string Git branch name: {worker_id}/{task_id}
assigned_at ISO8601 When task was assigned
state_changed_at ISO8601 Last state transition timestamp
last_heartbeat ISO8601 Last activity timestamp
pid int Process ID (for crash detection)
attempt int Current attempt number (1-indexed)
max_attempts int Max retries before FAILED
last_error string? Error message if failed
pr_url string? Pull request URL when in IN_REVIEW
review_state enum? pending, approved, changes_requested

Transitions

Valid Transitions

IDLE         → ASSIGNED       (assign_task)
ASSIGNED     → WORKING        (start_work)
ASSIGNED     → FAILED         (setup_failed)
WORKING      → IN_REVIEW      (submit_pr)
WORKING      → FAILED         (work_failed, max_attempts)
WORKING      → STALE          (heartbeat_timeout)
IN_REVIEW    → APPROVED       (review_approved)
IN_REVIEW    → WORKING        (changes_requested)
IN_REVIEW    → STALE          (review_timeout)
APPROVED     → COMPLETED      (merge_success)
APPROVED     → WORKING        (merge_conflict → rebase required)
STALE        → WORKING        (worker_recovered)
STALE        → FAILED         (timeout_exceeded)
FAILED       → IDLE           (reset/retry)
COMPLETED    → IDLE           (recycle)

Transition Commands

Each transition is triggered by a command or detected condition:

Command From To Action
worker assign <id> <task> IDLE ASSIGNED Write state file, record task
worker start <id> ASSIGNED WORKING Create branch, start heartbeat
worker submit <id> WORKING IN_REVIEW Push branch, create PR
worker approve <id> IN_REVIEW APPROVED Record approval
worker merge <id> APPROVED COMPLETED Merge PR, delete branch
worker reset <id> FAILED IDLE Clear state, ready for new task

File Operations

Atomic State Updates

Never write directly to state file. Use atomic rename pattern:

# Write to temp file
echo "$new_state" > .worker-state/${worker_id}.json.tmp

# Sync to disk
sync .worker-state/${worker_id}.json.tmp

# Atomic rename
mv .worker-state/${worker_id}.json.tmp .worker-state/${worker_id}.json

Optimistic Concurrency

Include state_changed_at as a version marker:

  1. Read current state file
  2. Compute new state
  3. Before write, re-read and verify state_changed_at unchanged
  4. If changed, retry from step 1

Lock File Pattern

For operations needing exclusive access:

# Acquire lock (atomic create with O_EXCL)
if ! (set -C; echo "$$" > .worker-state/${worker_id}.lock) 2>/dev/null; then
    # Lock held - check if stale
    lock_pid=$(cat .worker-state/${worker_id}.lock)
    if ! kill -0 "$lock_pid" 2>/dev/null; then
        # Process dead, steal lock
        rm -f .worker-state/${worker_id}.lock
        echo "$$" > .worker-state/${worker_id}.lock
    else
        exit 1  # Lock held by live process
    fi
fi

# ... do work ...

# Release lock
rm -f .worker-state/${worker_id}.lock

Stale Worker Detection

Workers must update last_heartbeat during active work (every 30-60 seconds).

Detection algorithm (run by orchestrator or patrol agent):

for each worker file in .worker-state/*.json:
    if state in (ASSIGNED, WORKING, IN_REVIEW):
        if (now - last_heartbeat) > STALE_TIMEOUT:
            if pid is set and process not running:
                transition to STALE (worker crashed)
            elif (now - last_heartbeat) > DEAD_TIMEOUT:
                transition to STALE (presumed dead)

Timeout Values

Timeout Duration Description
STALE_TIMEOUT 5 minutes Mark as stale, eligible for intervention
DEAD_TIMEOUT 15 minutes Presume dead, ready for reassignment
REVIEW_TIMEOUT 1 hour Review taking too long, alert human

Directory Structure

.worker-state/
├── worker-auth.json          # Worker state file
├── worker-auth.lock          # Lock file (when held)
├── worker-refactor.json
├── messages/
│   ├── orchestrator.jsonl    # Orchestrator → workers
│   ├── worker-auth.jsonl     # Worker → orchestrator
│   └── worker-refactor.jsonl
└── reviews/
    ├── worker-auth.json      # Review state (review-gate)
    └── worker-refactor.json

Integration Points

With review-gate (skills-byq)

When worker submits PR (WORKING → IN_REVIEW):

  1. Worker writes review request to reviews/{worker-id}.json
  2. review-gate Stop hook checks review state
  3. On approval, orchestrator transitions to APPROVED

With message passing (skills-ms5)

Workers append to their outbox: .worker-state/messages/{worker-id}.jsonl Orchestrator appends to: .worker-state/messages/orchestrator.jsonl

Message format:

{"ts": "2026-01-10T14:00:00Z", "from": "worker-auth", "type": "done", "task": "skills-abc"}

With branch isolation (skills-roq)

  • Branch naming: {worker-id}/{task-id}
  • Mandatory rebase before IN_REVIEW → APPROVED
  • Merge to main only from APPROVED state

Open Questions

  1. Heartbeat mechanism: Should workers write to their state file, a separate heartbeat file, or append to message log?
  2. PR creation: Who creates the PR - worker agent or orchestrator?
  3. Review source: Human review, AI review (review-gate), or both?
  4. Recycle vs terminate: Should COMPLETED workers return to IDLE or be terminated?

References

  • Temporal.io: Event-sourced durable execution
  • LangGraph: Graph-based state machines for agents
  • Airflow: DAG-based task lifecycle (None → Scheduled → Running → Success)
  • OpenHands: Event-sourced (Observation, Thought, Action) loop