skills/docs/design/multi-agent-footguns-and-patterns.md
dan 1c66d019bd feat: add worker CLI scaffold in Nim
Multi-agent coordination CLI with SQLite message bus:
- State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED
- Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat
- SQLite WAL mode, dedicated heartbeat thread, channel-based IPC
- cligen for CLI, tiny_sqlite for DB, ORC memory management

Design docs for branch-per-worker, state machine, message passing,
and human observability patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 18:47:47 -08:00

10 KiB

Multi-Agent Footguns, Patterns, and Emerging Ideas

Status: Research synthesis Date: 2026-01-10 Sources: HN discussions, Reddit, practitioner blogs, orch consensus

Footguns: Lessons Learned the Hard Way

Git & Branch Chaos

Footgun Description Mitigation
Force-resolve conflicts Agents rebase improperly, rewrite history, break CI No direct git access for agents; orchestrator owns git operations
Stale branches Agent works on outdated branch for hours Frequent auto-rebase; version check before major edits
Recovery nightmare Broken git state is hard to recover Git bundles for checkpoints (SkillFS pattern); worktree isolation
Branch naming confusion worker-id/task-id becomes misleading on reassignment Use type/task-id; worker identity in commit author

State & Database Issues

Footgun Description Mitigation
Shared DB pollution Agents debugging against mutated state, heisenbugs Ephemeral namespaced DBs per branch; schema prefixes
Port conflicts Multiple web servers on same port Auto-increment ports; orchestrator manages allocation
Service duplication 10 agents need 10 PostgreSQL/Redis instances Container-per-worktree; or accept serialization
Feature flag races Agents toggle flags in parallel Namespace flags per agent/branch

Coordination Failures

Footgun Description Mitigation
State divergence Each agent has different snapshot of reality Single source of truth artifact; frequent rebase
Silent duplication Two agents "fix" same bug differently Central task ledger with explicit states; idempotent task IDs
Dependency deadlocks A waits on B waits on A Event-driven async; bounded time limits; no sync waits
Role collapse Planner writes code; tester refactors Narrow role boundaries; tool-level constraints

Human Bottlenecks

Footgun Description Mitigation
Review overload 10 agents = 10 partial PRs to reconcile Review funnel: worker → arbiter agent → single synthesized PR
Context switching Human juggling parallel agent outputs Size limits per PR; "one story per PR"
Morale drain Endless nit-picking, people disable agents Pre-review by lint/style agents; humans see substantive deltas only

Agent-Specific Issues

Footgun Description Mitigation
Hallucinated packages 30% of suggested packages don't exist Validate imports against known registries
Temporary fixes Works in session, breaks in Docker Require full env rebuild as acceptance test
Skill atrophy Developers can't code without AI Deliberate practice; understand what AI generates
Test/impl conspiracy Brittle tests + brittle code pass together Separate spec tests from impl tests; mutation testing

Resource & Cost Issues

Footgun Description Mitigation
Token blowups Parallel agents saturate context/API limits Hard budgets per agent; limit context sizes
Credit drain AI fixing its own mistakes in loops Circuit breakers; attempt limits
Timeout misreads Rate limits interpreted as semantic failures Structured error channels; retry with idempotency

Emerging Patterns (2026)

The "Rule of 4"

Research shows effective team sizes limited to ~3-4 agents. Beyond this, communication overhead grows super-linearly (exponent 1.724). Cost of coordination outpaces value.

Implication: Don't build 10-agent swarms. Build 3-4 specialized agents with clear boundaries.

Spec-Driven Development

Adopted by Kiro, Tessl, GitHub Spec Kit:

  • requirements.md - what to build
  • design.md - how to build it
  • tasks.md - decomposed work items

Agents work from specs, not vague prompts. Specs are versioned; agents echo which version they used.

Layered Coordination (Not Monolithic)

Instead of one complex orchestrator, compose independent layers:

  1. Configuration management
  2. Issue tracking (JSONL, merge-friendly)
  3. Atomic locking (PostgreSQL advisory locks)
  4. Filesystem isolation (git worktrees)
  5. Validation gates
  6. Enforcement rules
  7. Session protocols

Each layer independently useful; failures isolated.

PostgreSQL Advisory Locks for Claims

Novel insight: Advisory locks auto-release on crash (no orphaned locks), operate in ~1ms, no table writes. Elegant solution for distributed claim races.

SELECT pg_try_advisory_lock(task_id_hash);
-- Work...
SELECT pg_advisory_unlock(task_id_hash);
-- Or: connection dies → auto-released

Git Bundles for Checkpoints (SkillFS)

Every agent sandbox is a git repo. Session ends → git bundle stored. New session → restore from bundle, continue where left off. Complete audit trail via git log.

Hierarchical Over Flat Swarms

Instead of 100-agent flat swarms:

  • Nested coordination structures
  • Partition the communication graph
  • Supervisor per sub-team
  • Only supervisors talk to each other

Plan-and-Execute Cost Pattern

Expensive model creates strategy; cheap models execute steps. Can reduce costs by 90%.

Orchestrator (Claude Opus) → Plan
Workers (Claude Haiku) → Execute steps
Reviewer (Claude Sonnet) → Validate

Bounded Autonomy Spectrum

Progressive autonomy based on risk:

  1. Human in the loop - approve each action
  2. Human on the loop - monitor, intervene if needed
  3. Human out of the loop - fully autonomous

Match to task complexity and outcome criticality.

Best Practices Synthesis

From HN Discussions

  1. Well-scoped tasks with tight contracts - Not vague prompts
  2. Automated testing gates - Agents must pass before review
  3. 2-3 agents realistic - Not 10 parallel
  4. Exclusive ownership per module - One writer per concern
  5. Short-lived branches - Frequent merge to prevent drift

From orch Consensus

  1. Treat agents as untrusted workers - Not peers with full access
  2. Machine-readable contracts - JSON schema between roles
  3. Per-agent logs with correlation IDs - Distributed systems observability
  4. Guardrail agents - Security/policy checks on every diff
  5. Versioned task specs - Bump version → re-run affected agents

From Practitioner Blogs

  1. Coordination ≠ isolation - Advisory locks (who works on what) + worktrees (how they work)
  2. JSONL for issues - One per line, deterministic merge rules
  3. Session protocols - Explicit start/close procedures
  4. Modular rules with includes - Template configuration

How This Applies to Our Design

Already Covered

Pattern Our Design
SQLite for coordination bus.db with transactions
Git worktrees branch-per-worker.md
State machine worker-state-machine.md
Heartbeats/liveness 10s interval in message-passing
Claim-check pattern SQLite transactions
Task serialization No uncommitted dependencies

Should Add

Pattern Gap Action
Spec-driven tasks Tasks are just titles Add structured task specs (requirements, design, acceptance)
Role boundaries Not enforced Add tool-level constraints per agent type
Review funnel Missing arbiter Add synthesis step before human review
Versioned specs Not tracked Add version field to task assignments
Cost budgets Not implemented Add token/time budgets per agent
Correlation IDs Partial (correlation_id) Ensure end-to-end tracing

Validate Our Decisions

Decision Validation
SQLite over JSONL Confirmed - JSONL for issues only, SQLite for coordination
Orchestrator creates branches Confirmed - reduces agent setup, enforces policy
3-4 agents max Aligns with "Rule of 4" research
Mandatory rebase Confirmed - prevents stale branch drift
Escalate semantic conflicts Confirmed - agents hallucinate resolutions

Open Questions Surfaced

  1. PostgreSQL advisory locks vs SQLite? - Do we need Postgres, or is SQLite sufficient?
  2. Git bundles for checkpoints? - Should we adopt SkillFS pattern?
  3. Spec files per task? - How structured should task specs be?
  4. Arbiter/synthesis agent? - Add to architecture before human review?
  5. Token budgets? - How to enforce across different agent types?

Sources

HN Discussions

Blogs & Articles

Research & Analysis