skills/docs/design/worker-state-machine.md

# Worker State Machine Design

**Status**: Draft
**Bead**: skills-4oj
**Epic**: skills-s6y (Multi-agent orchestration: Lego brick architecture)

## Overview

This document defines the state machine for background worker agents in the multi-agent coordination system. Workers are AI coding agents that operate on separate git branches, coordinated through local filesystem state.

## Design Principles

1. **File-based, no database** - State in JSON files, messages in append-only JSONL
2. **Minimal states** - Only what's needed, resist over-engineering
3. **Atomic updates** - Write to temp file, fsync, atomic rename
4. **Crash recovery** - Detect stale workers via timestamp + process checks
5. **Cross-agent compatible** - Any agent that reads/writes files can participate

## State Diagram

```
                    ┌─────────────────────────────────────────────┐
                    │                                             │
                    ▼                                             │
┌──────────┐   assign    ┌──────────┐   start    ┌─────────┐     │
│   IDLE   │ ──────────► │ ASSIGNED │ ─────────► │ WORKING │ ◄───┘
└──────────┘             └──────────┘            └────┬────┘   changes
     ▲                                                │        requested
     │                                                │
     │ reset                              submit_pr   │
     │                                                ▼
┌──────────┐   timeout   ┌──────────┐   review   ┌─────────────┐
│  FAILED  │ ◄────────── │  STALE   │ ◄───────── │ IN_REVIEW   │
└──────────┘             └──────────┘            └──────┬──────┘
     │                                                  │
     │                                         approve  │
     │                                                  ▼
     │                   ┌──────────┐   merge    ┌─────────────┐
     └───────────────────│ COMPLETED│ ◄───────── │  APPROVED   │
           retry         └──────────┘            └─────────────┘
```

## States

| State | Description | Entry Condition | Exit Condition |
|-------|-------------|-----------------|----------------|
| **IDLE** | Worker available, no task | Initial / after reset | Task assigned |
| **ASSIGNED** | Task claimed, preparing | Orchestrator assigns task | Branch created, work starts |
| **WORKING** | Actively coding/testing | `git checkout -b` succeeds | PR submitted or failure |
| **IN_REVIEW** | PR open, awaiting review | PR created | Approved, changes requested, or timeout |
| **APPROVED** | Review passed, ready to merge | Reviewer approves | Merge succeeds or conflicts |
| **COMPLETED** | Work merged to main | Merge succeeds | (terminal) or recycle to IDLE |
| **STALE** | No progress detected | Heartbeat timeout | Manual intervention or auto-reset |
| **FAILED** | Unrecoverable error | Max retries exceeded | Manual reset or reassign |

## State File Schema

Location: `.worker-state/{worker-id}.json`

```json
{
  "worker_id": "worker-auth",
  "state": "WORKING",
  "task_id": "skills-abc",
  "branch": "worker-auth/skills-abc",
  "assigned_at": "2026-01-10T14:00:00Z",
  "state_changed_at": "2026-01-10T14:05:00Z",
  "last_heartbeat": "2026-01-10T14:32:00Z",
  "pid": 12345,
  "attempt": 1,
  "max_attempts": 3,
  "last_error": null,
  "pr_url": null,
  "review_state": null
}
```

### Field Definitions

| Field | Type | Description |
|-------|------|-------------|
| `worker_id` | string | Unique worker identifier (kebab-case) |
| `state` | enum | Current state (see States table) |
| `task_id` | string | Bead ID of assigned task |
| `branch` | string | Git branch name: `{worker_id}/{task_id}` |
| `assigned_at` | ISO8601 | When task was assigned |
| `state_changed_at` | ISO8601 | Last state transition timestamp |
| `last_heartbeat` | ISO8601 | Last activity timestamp |
| `pid` | int | Process ID (for crash detection) |
| `attempt` | int | Current attempt number (1-indexed) |
| `max_attempts` | int | Max retries before FAILED |
| `last_error` | string? | Error message if failed |
| `pr_url` | string? | Pull request URL when in IN_REVIEW |
| `review_state` | enum? | `pending`, `approved`, `changes_requested` |

## Transitions

### Valid Transitions

```
IDLE         → ASSIGNED       (assign_task)
ASSIGNED     → WORKING        (start_work)
ASSIGNED     → FAILED         (setup_failed)
WORKING      → IN_REVIEW      (submit_pr)
WORKING      → FAILED         (work_failed, max_attempts)
WORKING      → STALE          (heartbeat_timeout)
IN_REVIEW    → APPROVED       (review_approved)
IN_REVIEW    → WORKING        (changes_requested)
IN_REVIEW    → STALE          (review_timeout)
APPROVED     → COMPLETED      (merge_success)
APPROVED     → WORKING        (merge_conflict → rebase required)
STALE        → WORKING        (worker_recovered)
STALE        → FAILED         (timeout_exceeded)
FAILED       → IDLE           (reset/retry)
COMPLETED    → IDLE           (recycle)
```

### Transition Commands

Each transition is triggered by a command or detected condition:

| Command | From | To | Action |
|---------|------|----|----|
| `worker assign <id> <task>` | IDLE | ASSIGNED | Write state file, record task |
| `worker start <id>` | ASSIGNED | WORKING | Create branch, start heartbeat |
| `worker submit <id>` | WORKING | IN_REVIEW | Push branch, create PR |
| `worker approve <id>` | IN_REVIEW | APPROVED | Record approval |
| `worker merge <id>` | APPROVED | COMPLETED | Merge PR, delete branch |
| `worker reset <id>` | FAILED | IDLE | Clear state, ready for new task |

## File Operations

### Atomic State Updates

Never write directly to state file. Use atomic rename pattern:

```bash
# Write to temp file
echo "$new_state" > .worker-state/${worker_id}.json.tmp

# Sync to disk
sync .worker-state/${worker_id}.json.tmp

# Atomic rename
mv .worker-state/${worker_id}.json.tmp .worker-state/${worker_id}.json
```

### Optimistic Concurrency

Include `state_changed_at` as a version marker:

1. Read current state file
2. Compute new state
3. Before write, re-read and verify `state_changed_at` unchanged
4. If changed, retry from step 1

### Lock File Pattern

For operations needing exclusive access:

```bash
# Acquire lock (atomic create with O_EXCL)
if ! (set -C; echo "$$" > .worker-state/${worker_id}.lock) 2>/dev/null; then
    # Lock held - check if stale
    lock_pid=$(cat .worker-state/${worker_id}.lock)
    if ! kill -0 "$lock_pid" 2>/dev/null; then
        # Process dead, steal lock
        rm -f .worker-state/${worker_id}.lock
        echo "$$" > .worker-state/${worker_id}.lock
    else
        exit 1  # Lock held by live process
    fi
fi

# ... do work ...

# Release lock
rm -f .worker-state/${worker_id}.lock
```

## Stale Worker Detection

Workers must update `last_heartbeat` during active work (every 30-60 seconds).

Detection algorithm (run by orchestrator or patrol agent):

```
for each worker file in .worker-state/*.json:
    if state in (ASSIGNED, WORKING, IN_REVIEW):
        if (now - last_heartbeat) > STALE_TIMEOUT:
            if pid is set and process not running:
                transition to STALE (worker crashed)
            elif (now - last_heartbeat) > DEAD_TIMEOUT:
                transition to STALE (presumed dead)
```

### Timeout Values

| Timeout | Duration | Description |
|---------|----------|-------------|
| `STALE_TIMEOUT` | 5 minutes | Mark as stale, eligible for intervention |
| `DEAD_TIMEOUT` | 15 minutes | Presume dead, ready for reassignment |
| `REVIEW_TIMEOUT` | 1 hour | Review taking too long, alert human |

## Directory Structure

```
.worker-state/
├── worker-auth.json          # Worker state file
├── worker-auth.lock          # Lock file (when held)
├── worker-refactor.json
├── messages/
│   ├── orchestrator.jsonl    # Orchestrator → workers
│   ├── worker-auth.jsonl     # Worker → orchestrator
│   └── worker-refactor.jsonl
└── reviews/
    ├── worker-auth.json      # Review state (review-gate)
    └── worker-refactor.json
```

## Integration Points

### With review-gate (skills-byq)

When worker submits PR (WORKING → IN_REVIEW):
1. Worker writes review request to `reviews/{worker-id}.json`
2. review-gate Stop hook checks review state
3. On approval, orchestrator transitions to APPROVED

### With message passing (skills-ms5)

Workers append to their outbox: `.worker-state/messages/{worker-id}.jsonl`
Orchestrator appends to: `.worker-state/messages/orchestrator.jsonl`

Message format:
```json
{"ts": "2026-01-10T14:00:00Z", "from": "worker-auth", "type": "done", "task": "skills-abc"}
```

### With branch isolation (skills-roq)

- Branch naming: `{worker-id}/{task-id}`
- Mandatory rebase before IN_REVIEW → APPROVED
- Merge to main only from APPROVED state

## Open Questions

1. **Heartbeat mechanism**: Should workers write to their state file, a separate heartbeat file, or append to message log?
2. **PR creation**: Who creates the PR - worker agent or orchestrator?
3. **Review source**: Human review, AI review (review-gate), or both?
4. **Recycle vs terminate**: Should COMPLETED workers return to IDLE or be terminated?

## References

- Temporal.io: Event-sourced durable execution
- LangGraph: Graph-based state machines for agents
- Airflow: DAG-based task lifecycle (None → Scheduled → Running → Success)
- OpenHands: Event-sourced (Observation, Thought, Action) loop