skills/worker-orchestration-aar-2026-01-13.md
dan 461c5ac148 fix(worker): improve spawn reliability and add noFetch flag
- Change default base branch from origin/integration to main
- Add --noFetch flag to skip git fetch (for offline/sandbox use)
- Add try/except with rollback on spawn failure
- Improve error message for missing review-gate
- Add Codex auth.json symlink to use-skills.sh
- Include worker orchestration AAR from 2026-01-13

Addresses pain points from worker-orchestration-aar-2026-01-13.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-15 09:29:36 -08:00

2.4 KiB

Worker Orchestration AAR (2026-01-13)

Goal

Parallelize three bd issues using worker spawn and background worker agents.

Scope

  • Issues: talu-gsga, talu-jid5, talu-w8oq
  • Base branch: main
  • Worker model: sonnet-4.5

Environment

  • Repo: /home/dan/proj/talu
  • Network: restricted sandbox (fetch requires escalation)
  • Tools: worker, bd, git

What Happened

  1. Ran worker spawn using positional args; command failed due to CLI syntax.
  2. Retried with -t and -d flags; worker attempted git fetch and failed due to network restriction.
  3. Retried with -f main; fetch still attempted and failed.
  4. Partial worktrees and branches were created without worker registry entries.
  5. Manually removed worktrees and deleted branches.
  6. Re-ran worker spawn with network escalation; workers created successfully.
  7. review-gate was not found, so review integration was disabled.
  8. Rendered worker prompts and launched background workers.

What Went Well

  • After network access, worker spawn created worktrees/branches reliably.
  • Prompt rendering and background worker launch were straightforward.

Pain Points

  • worker spawn always attempts git fetch, even when --fromBranch is local.
  • Default base branch is origin/integration, which is not present in this repo.
  • Spawn failures left behind branches and worktrees without worker registry state.
  • Missing review-gate produces warnings without guidance on setup.
  • Network access requirements are easy to miss during first-time use.

Impact

  • Time lost to retries and cleanup before workers could start.
  • Non-obvious failure modes and manual recovery steps.

Observed Errors

  • spawn does not expect non-option arguments at "talu-gsga"
  • fatal: not a valid object name: 'origin/integration'
  • ssh: connect to host 192.168.1.108 port 2222: failure
  • WARN: enableReview: failed for <id>: review-gate not found

Recommendations

  1. Allow worker spawn to skip git fetch when the base branch is local.
  2. Make the default base branch configurable or auto-detect a local main branch.
  3. Roll back branch/worktree on spawn failure to avoid manual cleanup.
  4. Improve error messaging to distinguish network vs branch-not-found.
  5. Provide setup guidance when review-gate is missing.

Questions for Worker Team

  • Can worker spawn be configured to avoid network fetches?
  • Is there a way to set a global default base branch (e.g., main)?