# Worker State Machine Design **Status**: Draft **Bead**: skills-4oj **Epic**: skills-s6y (Multi-agent orchestration: Lego brick architecture) ## Overview This document defines the state machine for background worker agents in the multi-agent coordination system. Workers are AI coding agents that operate on separate git branches, coordinated through local filesystem state. ## Design Principles 1. **File-based, no database** - State in JSON files, messages in append-only JSONL 2. **Minimal states** - Only what's needed, resist over-engineering 3. **Atomic updates** - Write to temp file, fsync, atomic rename 4. **Crash recovery** - Detect stale workers via timestamp + process checks 5. **Cross-agent compatible** - Any agent that reads/writes files can participate ## State Diagram ``` ┌─────────────────────────────────────────────┐ │ │ ▼ │ ┌──────────┐ assign ┌──────────┐ start ┌─────────┐ │ │ IDLE │ ──────────► │ ASSIGNED │ ─────────► │ WORKING │ ◄───┘ └──────────┘ └──────────┘ └────┬────┘ changes ▲ │ requested │ │ │ reset submit_pr │ │ ▼ ┌──────────┐ timeout ┌──────────┐ review ┌─────────────┐ │ FAILED │ ◄────────── │ STALE │ ◄───────── │ IN_REVIEW │ └──────────┘ └──────────┘ └──────┬──────┘ │ │ │ approve │ │ ▼ │ ┌──────────┐ merge ┌─────────────┐ └───────────────────│ COMPLETED│ ◄───────── │ APPROVED │ retry └──────────┘ └─────────────┘ ``` ## States | State | Description | Entry Condition | Exit Condition | |-------|-------------|-----------------|----------------| | **IDLE** | Worker available, no task | Initial / after reset | Task assigned | | **ASSIGNED** | Task claimed, preparing | Orchestrator assigns task | Branch created, work starts | | **WORKING** | Actively coding/testing | `git checkout -b` succeeds | PR submitted or failure | | **IN_REVIEW** | PR open, awaiting review | PR created | Approved, changes requested, or timeout | | **APPROVED** | Review passed, ready to merge | Reviewer approves | Merge succeeds or conflicts | | **COMPLETED** | Work merged to main | Merge succeeds | (terminal) or recycle to IDLE | | **STALE** | No progress detected | Heartbeat timeout | Manual intervention or auto-reset | | **FAILED** | Unrecoverable error | Max retries exceeded | Manual reset or reassign | ## State File Schema Location: `.worker-state/{worker-id}.json` ```json { "worker_id": "worker-auth", "state": "WORKING", "task_id": "skills-abc", "branch": "worker-auth/skills-abc", "assigned_at": "2026-01-10T14:00:00Z", "state_changed_at": "2026-01-10T14:05:00Z", "last_heartbeat": "2026-01-10T14:32:00Z", "pid": 12345, "attempt": 1, "max_attempts": 3, "last_error": null, "pr_url": null, "review_state": null } ``` ### Field Definitions | Field | Type | Description | |-------|------|-------------| | `worker_id` | string | Unique worker identifier (kebab-case) | | `state` | enum | Current state (see States table) | | `task_id` | string | Bead ID of assigned task | | `branch` | string | Git branch name: `{worker_id}/{task_id}` | | `assigned_at` | ISO8601 | When task was assigned | | `state_changed_at` | ISO8601 | Last state transition timestamp | | `last_heartbeat` | ISO8601 | Last activity timestamp | | `pid` | int | Process ID (for crash detection) | | `attempt` | int | Current attempt number (1-indexed) | | `max_attempts` | int | Max retries before FAILED | | `last_error` | string? | Error message if failed | | `pr_url` | string? | Pull request URL when in IN_REVIEW | | `review_state` | enum? | `pending`, `approved`, `changes_requested` | ## Transitions ### Valid Transitions ``` IDLE → ASSIGNED (assign_task) ASSIGNED → WORKING (start_work) ASSIGNED → FAILED (setup_failed) WORKING → IN_REVIEW (submit_pr) WORKING → FAILED (work_failed, max_attempts) WORKING → STALE (heartbeat_timeout) IN_REVIEW → APPROVED (review_approved) IN_REVIEW → WORKING (changes_requested) IN_REVIEW → STALE (review_timeout) APPROVED → COMPLETED (merge_success) APPROVED → WORKING (merge_conflict → rebase required) STALE → WORKING (worker_recovered) STALE → FAILED (timeout_exceeded) FAILED → IDLE (reset/retry) COMPLETED → IDLE (recycle) ``` ### Transition Commands Each transition is triggered by a command or detected condition: | Command | From | To | Action | |---------|------|----|----| | `worker assign ` | IDLE | ASSIGNED | Write state file, record task | | `worker start ` | ASSIGNED | WORKING | Create branch, start heartbeat | | `worker submit ` | WORKING | IN_REVIEW | Push branch, create PR | | `worker approve ` | IN_REVIEW | APPROVED | Record approval | | `worker merge ` | APPROVED | COMPLETED | Merge PR, delete branch | | `worker reset ` | FAILED | IDLE | Clear state, ready for new task | ## File Operations ### Atomic State Updates Never write directly to state file. Use atomic rename pattern: ```bash # Write to temp file echo "$new_state" > .worker-state/${worker_id}.json.tmp # Sync to disk sync .worker-state/${worker_id}.json.tmp # Atomic rename mv .worker-state/${worker_id}.json.tmp .worker-state/${worker_id}.json ``` ### Optimistic Concurrency Include `state_changed_at` as a version marker: 1. Read current state file 2. Compute new state 3. Before write, re-read and verify `state_changed_at` unchanged 4. If changed, retry from step 1 ### Lock File Pattern For operations needing exclusive access: ```bash # Acquire lock (atomic create with O_EXCL) if ! (set -C; echo "$$" > .worker-state/${worker_id}.lock) 2>/dev/null; then # Lock held - check if stale lock_pid=$(cat .worker-state/${worker_id}.lock) if ! kill -0 "$lock_pid" 2>/dev/null; then # Process dead, steal lock rm -f .worker-state/${worker_id}.lock echo "$$" > .worker-state/${worker_id}.lock else exit 1 # Lock held by live process fi fi # ... do work ... # Release lock rm -f .worker-state/${worker_id}.lock ``` ## Stale Worker Detection Workers must update `last_heartbeat` during active work (every 30-60 seconds). Detection algorithm (run by orchestrator or patrol agent): ``` for each worker file in .worker-state/*.json: if state in (ASSIGNED, WORKING, IN_REVIEW): if (now - last_heartbeat) > STALE_TIMEOUT: if pid is set and process not running: transition to STALE (worker crashed) elif (now - last_heartbeat) > DEAD_TIMEOUT: transition to STALE (presumed dead) ``` ### Timeout Values | Timeout | Duration | Description | |---------|----------|-------------| | `STALE_TIMEOUT` | 5 minutes | Mark as stale, eligible for intervention | | `DEAD_TIMEOUT` | 15 minutes | Presume dead, ready for reassignment | | `REVIEW_TIMEOUT` | 1 hour | Review taking too long, alert human | ## Directory Structure ``` .worker-state/ ├── worker-auth.json # Worker state file ├── worker-auth.lock # Lock file (when held) ├── worker-refactor.json ├── messages/ │ ├── orchestrator.jsonl # Orchestrator → workers │ ├── worker-auth.jsonl # Worker → orchestrator │ └── worker-refactor.jsonl └── reviews/ ├── worker-auth.json # Review state (review-gate) └── worker-refactor.json ``` ## Integration Points ### With review-gate (skills-byq) When worker submits PR (WORKING → IN_REVIEW): 1. Worker writes review request to `reviews/{worker-id}.json` 2. review-gate Stop hook checks review state 3. On approval, orchestrator transitions to APPROVED ### With message passing (skills-ms5) Workers append to their outbox: `.worker-state/messages/{worker-id}.jsonl` Orchestrator appends to: `.worker-state/messages/orchestrator.jsonl` Message format: ```json {"ts": "2026-01-10T14:00:00Z", "from": "worker-auth", "type": "done", "task": "skills-abc"} ``` ### With branch isolation (skills-roq) - Branch naming: `{worker-id}/{task-id}` - Mandatory rebase before IN_REVIEW → APPROVED - Merge to main only from APPROVED state ## Open Questions 1. **Heartbeat mechanism**: Should workers write to their state file, a separate heartbeat file, or append to message log? 2. **PR creation**: Who creates the PR - worker agent or orchestrator? 3. **Review source**: Human review, AI review (review-gate), or both? 4. **Recycle vs terminate**: Should COMPLETED workers return to IDLE or be terminated? ## References - Temporal.io: Event-sourced durable execution - LangGraph: Graph-based state machines for agents - Airflow: DAG-based task lifecycle (None → Scheduled → Running → Success) - OpenHands: Event-sourced (Observation, Thought, Action) loop