dan/skills

dan 1c66d019bd feat: add worker CLI scaffold in Nim

Multi-agent coordination CLI with SQLite message bus:
- State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED
- Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat
- SQLite WAL mode, dedicated heartbeat thread, channel-based IPC
- cligen for CLI, tiny_sqlite for DB, ORC memory management

Design docs for branch-per-worker, state machine, message passing,
and human observability patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-10 18:47:47 -08:00

10 KiB

Raw Blame History

Worker State Machine Design

Status: Draft Bead: skills-4oj Epic: skills-s6y (Multi-agent orchestration: Lego brick architecture)

Overview

This document defines the state machine for background worker agents in the multi-agent coordination system. Workers are AI coding agents that operate on separate git branches, coordinated through local filesystem state.

Design Principles

File-based, no database - State in JSON files, messages in append-only JSONL
Minimal states - Only what's needed, resist over-engineering
Atomic updates - Write to temp file, fsync, atomic rename
Crash recovery - Detect stale workers via timestamp + process checks
Cross-agent compatible - Any agent that reads/writes files can participate

State Diagram

                    ┌─────────────────────────────────────────────┐
                    │                                             │
                    ▼                                             │
┌──────────┐   assign    ┌──────────┐   start    ┌─────────┐     │
│   IDLE   │ ──────────► │ ASSIGNED │ ─────────► │ WORKING │ ◄───┘
└──────────┘             └──────────┘            └────┬────┘   changes
     ▲                                                │        requested
     │                                                │
     │ reset                              submit_pr   │
     │                                                ▼
┌──────────┐   timeout   ┌──────────┐   review   ┌─────────────┐
│  FAILED  │ ◄────────── │  STALE   │ ◄───────── │ IN_REVIEW   │
└──────────┘             └──────────┘            └──────┬──────┘
     │                                                  │
     │                                         approve  │
     │                                                  ▼
     │                   ┌──────────┐   merge    ┌─────────────┐
     └───────────────────│ COMPLETED│ ◄───────── │  APPROVED   │
           retry         └──────────┘            └─────────────┘

States

State	Description	Entry Condition	Exit Condition
IDLE	Worker available, no task	Initial / after reset	Task assigned
ASSIGNED	Task claimed, preparing	Orchestrator assigns task	Branch created, work starts
WORKING	Actively coding/testing	`git checkout -b` succeeds	PR submitted or failure
IN_REVIEW	PR open, awaiting review	PR created	Approved, changes requested, or timeout
APPROVED	Review passed, ready to merge	Reviewer approves	Merge succeeds or conflicts
COMPLETED	Work merged to main	Merge succeeds	(terminal) or recycle to IDLE
STALE	No progress detected	Heartbeat timeout	Manual intervention or auto-reset
FAILED	Unrecoverable error	Max retries exceeded	Manual reset or reassign

State File Schema

Location: .worker-state/{worker-id}.json

{
  "worker_id": "worker-auth",
  "state": "WORKING",
  "task_id": "skills-abc",
  "branch": "worker-auth/skills-abc",
  "assigned_at": "2026-01-10T14:00:00Z",
  "state_changed_at": "2026-01-10T14:05:00Z",
  "last_heartbeat": "2026-01-10T14:32:00Z",
  "pid": 12345,
  "attempt": 1,
  "max_attempts": 3,
  "last_error": null,
  "pr_url": null,
  "review_state": null
}

Field Definitions

Field	Type	Description
`worker_id`	string	Unique worker identifier (kebab-case)
`state`	enum	Current state (see States table)
`task_id`	string	Bead ID of assigned task
`branch`	string	Git branch name: `{worker_id}/{task_id}`
`assigned_at`	ISO8601	When task was assigned
`state_changed_at`	ISO8601	Last state transition timestamp
`last_heartbeat`	ISO8601	Last activity timestamp
`pid`	int	Process ID (for crash detection)
`attempt`	int	Current attempt number (1-indexed)
`max_attempts`	int	Max retries before FAILED
`last_error`	string?	Error message if failed
`pr_url`	string?	Pull request URL when in IN_REVIEW
`review_state`	enum?	`pending`, `approved`, `changes_requested`

Transitions

Valid Transitions

IDLE         → ASSIGNED       (assign_task)
ASSIGNED     → WORKING        (start_work)
ASSIGNED     → FAILED         (setup_failed)
WORKING      → IN_REVIEW      (submit_pr)
WORKING      → FAILED         (work_failed, max_attempts)
WORKING      → STALE          (heartbeat_timeout)
IN_REVIEW    → APPROVED       (review_approved)
IN_REVIEW    → WORKING        (changes_requested)
IN_REVIEW    → STALE          (review_timeout)
APPROVED     → COMPLETED      (merge_success)
APPROVED     → WORKING        (merge_conflict → rebase required)
STALE        → WORKING        (worker_recovered)
STALE        → FAILED         (timeout_exceeded)
FAILED       → IDLE           (reset/retry)
COMPLETED    → IDLE           (recycle)

Transition Commands

Each transition is triggered by a command or detected condition:

Command	From	To	Action
`worker assign <id> <task>`	IDLE	ASSIGNED	Write state file, record task
`worker start <id>`	ASSIGNED	WORKING	Create branch, start heartbeat
`worker submit <id>`	WORKING	IN_REVIEW	Push branch, create PR
`worker approve <id>`	IN_REVIEW	APPROVED	Record approval
`worker merge <id>`	APPROVED	COMPLETED	Merge PR, delete branch
`worker reset <id>`	FAILED	IDLE	Clear state, ready for new task

File Operations

Atomic State Updates

Never write directly to state file. Use atomic rename pattern:

# Write to temp file
echo "$new_state" > .worker-state/${worker_id}.json.tmp

# Sync to disk
sync .worker-state/${worker_id}.json.tmp

# Atomic rename
mv .worker-state/${worker_id}.json.tmp .worker-state/${worker_id}.json

Optimistic Concurrency

Include state_changed_at as a version marker:

Read current state file
Compute new state
Before write, re-read and verify state_changed_at unchanged
If changed, retry from step 1

Lock File Pattern

For operations needing exclusive access:

# Acquire lock (atomic create with O_EXCL)
if ! (set -C; echo "$$" > .worker-state/${worker_id}.lock) 2>/dev/null; then
    # Lock held - check if stale
    lock_pid=$(cat .worker-state/${worker_id}.lock)
    if ! kill -0 "$lock_pid" 2>/dev/null; then
        # Process dead, steal lock
        rm -f .worker-state/${worker_id}.lock
        echo "$$" > .worker-state/${worker_id}.lock
    else
        exit 1  # Lock held by live process
    fi
fi

# ... do work ...

# Release lock
rm -f .worker-state/${worker_id}.lock

Stale Worker Detection

Workers must update last_heartbeat during active work (every 30-60 seconds).

Detection algorithm (run by orchestrator or patrol agent):

for each worker file in .worker-state/*.json:
    if state in (ASSIGNED, WORKING, IN_REVIEW):
        if (now - last_heartbeat) > STALE_TIMEOUT:
            if pid is set and process not running:
                transition to STALE (worker crashed)
            elif (now - last_heartbeat) > DEAD_TIMEOUT:
                transition to STALE (presumed dead)

Timeout Values

Timeout	Duration	Description
`STALE_TIMEOUT`	5 minutes	Mark as stale, eligible for intervention
`DEAD_TIMEOUT`	15 minutes	Presume dead, ready for reassignment
`REVIEW_TIMEOUT`	1 hour	Review taking too long, alert human

Directory Structure

.worker-state/
├── worker-auth.json          # Worker state file
├── worker-auth.lock          # Lock file (when held)
├── worker-refactor.json
├── messages/
│   ├── orchestrator.jsonl    # Orchestrator → workers
│   ├── worker-auth.jsonl     # Worker → orchestrator
│   └── worker-refactor.jsonl
└── reviews/
    ├── worker-auth.json      # Review state (review-gate)
    └── worker-refactor.json

Integration Points

With review-gate (skills-byq)

When worker submits PR (WORKING → IN_REVIEW):

Worker writes review request to reviews/{worker-id}.json
review-gate Stop hook checks review state
On approval, orchestrator transitions to APPROVED

With message passing (skills-ms5)

Workers append to their outbox: .worker-state/messages/{worker-id}.jsonl Orchestrator appends to: .worker-state/messages/orchestrator.jsonl

Message format:

{"ts": "2026-01-10T14:00:00Z", "from": "worker-auth", "type": "done", "task": "skills-abc"}

With branch isolation (skills-roq)

Branch naming: {worker-id}/{task-id}
Mandatory rebase before IN_REVIEW → APPROVED
Merge to main only from APPROVED state

Open Questions

Heartbeat mechanism: Should workers write to their state file, a separate heartbeat file, or append to message log?
PR creation: Who creates the PR - worker agent or orchestrator?
Review source: Human review, AI review (review-gate), or both?
Recycle vs terminate: Should COMPLETED workers return to IDLE or be terminated?

References

Temporal.io: Event-sourced durable execution
LangGraph: Graph-based state machines for agents
Airflow: DAG-based task lifecycle (None → Scheduled → Running → Success)
OpenHands: Event-sourced (Observation, Thought, Action) loop

10 KiB Raw Blame History