skills/docs/design/multi-agent-footguns-and-patterns.md
dan 1c66d019bd feat: add worker CLI scaffold in Nim
Multi-agent coordination CLI with SQLite message bus:
- State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED
- Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat
- SQLite WAL mode, dedicated heartbeat thread, channel-based IPC
- cligen for CLI, tiny_sqlite for DB, ORC memory management

Design docs for branch-per-worker, state machine, message passing,
and human observability patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 18:47:47 -08:00

220 lines
10 KiB
Markdown

# Multi-Agent Footguns, Patterns, and Emerging Ideas
**Status**: Research synthesis
**Date**: 2026-01-10
**Sources**: HN discussions, Reddit, practitioner blogs, orch consensus
## Footguns: Lessons Learned the Hard Way
### Git & Branch Chaos
| Footgun | Description | Mitigation |
|---------|-------------|------------|
| **Force-resolve conflicts** | Agents rebase improperly, rewrite history, break CI | No direct git access for agents; orchestrator owns git operations |
| **Stale branches** | Agent works on outdated branch for hours | Frequent auto-rebase; version check before major edits |
| **Recovery nightmare** | Broken git state is hard to recover | Git bundles for checkpoints (SkillFS pattern); worktree isolation |
| **Branch naming confusion** | `worker-id/task-id` becomes misleading on reassignment | Use `type/task-id`; worker identity in commit author |
### State & Database Issues
| Footgun | Description | Mitigation |
|---------|-------------|------------|
| **Shared DB pollution** | Agents debugging against mutated state, heisenbugs | Ephemeral namespaced DBs per branch; schema prefixes |
| **Port conflicts** | Multiple web servers on same port | Auto-increment ports; orchestrator manages allocation |
| **Service duplication** | 10 agents need 10 PostgreSQL/Redis instances | Container-per-worktree; or accept serialization |
| **Feature flag races** | Agents toggle flags in parallel | Namespace flags per agent/branch |
### Coordination Failures
| Footgun | Description | Mitigation |
|---------|-------------|------------|
| **State divergence** | Each agent has different snapshot of reality | Single source of truth artifact; frequent rebase |
| **Silent duplication** | Two agents "fix" same bug differently | Central task ledger with explicit states; idempotent task IDs |
| **Dependency deadlocks** | A waits on B waits on A | Event-driven async; bounded time limits; no sync waits |
| **Role collapse** | Planner writes code; tester refactors | Narrow role boundaries; tool-level constraints |
### Human Bottlenecks
| Footgun | Description | Mitigation |
|---------|-------------|------------|
| **Review overload** | 10 agents = 10 partial PRs to reconcile | Review funnel: worker → arbiter agent → single synthesized PR |
| **Context switching** | Human juggling parallel agent outputs | Size limits per PR; "one story per PR" |
| **Morale drain** | Endless nit-picking, people disable agents | Pre-review by lint/style agents; humans see substantive deltas only |
### Agent-Specific Issues
| Footgun | Description | Mitigation |
|---------|-------------|------------|
| **Hallucinated packages** | 30% of suggested packages don't exist | Validate imports against known registries |
| **Temporary fixes** | Works in session, breaks in Docker | Require full env rebuild as acceptance test |
| **Skill atrophy** | Developers can't code without AI | Deliberate practice; understand what AI generates |
| **Test/impl conspiracy** | Brittle tests + brittle code pass together | Separate spec tests from impl tests; mutation testing |
### Resource & Cost Issues
| Footgun | Description | Mitigation |
|---------|-------------|------------|
| **Token blowups** | Parallel agents saturate context/API limits | Hard budgets per agent; limit context sizes |
| **Credit drain** | AI fixing its own mistakes in loops | Circuit breakers; attempt limits |
| **Timeout misreads** | Rate limits interpreted as semantic failures | Structured error channels; retry with idempotency |
## Emerging Patterns (2026)
### The "Rule of 4"
Research shows effective team sizes limited to ~3-4 agents. Beyond this, communication overhead grows super-linearly (exponent 1.724). Cost of coordination outpaces value.
**Implication**: Don't build 10-agent swarms. Build 3-4 specialized agents with clear boundaries.
### Spec-Driven Development
Adopted by Kiro, Tessl, GitHub Spec Kit:
- `requirements.md` - what to build
- `design.md` - how to build it
- `tasks.md` - decomposed work items
Agents work from specs, not vague prompts. Specs are versioned; agents echo which version they used.
### Layered Coordination (Not Monolithic)
Instead of one complex orchestrator, compose independent layers:
1. Configuration management
2. Issue tracking (JSONL, merge-friendly)
3. Atomic locking (PostgreSQL advisory locks)
4. Filesystem isolation (git worktrees)
5. Validation gates
6. Enforcement rules
7. Session protocols
Each layer independently useful; failures isolated.
### PostgreSQL Advisory Locks for Claims
Novel insight: Advisory locks auto-release on crash (no orphaned locks), operate in ~1ms, no table writes. Elegant solution for distributed claim races.
```sql
SELECT pg_try_advisory_lock(task_id_hash);
-- Work...
SELECT pg_advisory_unlock(task_id_hash);
-- Or: connection dies → auto-released
```
### Git Bundles for Checkpoints (SkillFS)
Every agent sandbox is a git repo. Session ends → git bundle stored. New session → restore from bundle, continue where left off. Complete audit trail via `git log`.
### Hierarchical Over Flat Swarms
Instead of 100-agent flat swarms:
- Nested coordination structures
- Partition the communication graph
- Supervisor per sub-team
- Only supervisors talk to each other
### Plan-and-Execute Cost Pattern
Expensive model creates strategy; cheap models execute steps. Can reduce costs by 90%.
```
Orchestrator (Claude Opus) → Plan
Workers (Claude Haiku) → Execute steps
Reviewer (Claude Sonnet) → Validate
```
### Bounded Autonomy Spectrum
Progressive autonomy based on risk:
1. **Human in the loop** - approve each action
2. **Human on the loop** - monitor, intervene if needed
3. **Human out of the loop** - fully autonomous
Match to task complexity and outcome criticality.
## Best Practices Synthesis
### From HN Discussions
1. **Well-scoped tasks with tight contracts** - Not vague prompts
2. **Automated testing gates** - Agents must pass before review
3. **2-3 agents realistic** - Not 10 parallel
4. **Exclusive ownership per module** - One writer per concern
5. **Short-lived branches** - Frequent merge to prevent drift
### From orch Consensus
1. **Treat agents as untrusted workers** - Not peers with full access
2. **Machine-readable contracts** - JSON schema between roles
3. **Per-agent logs with correlation IDs** - Distributed systems observability
4. **Guardrail agents** - Security/policy checks on every diff
5. **Versioned task specs** - Bump version → re-run affected agents
### From Practitioner Blogs
1. **Coordination ≠ isolation** - Advisory locks (who works on what) + worktrees (how they work)
2. **JSONL for issues** - One per line, deterministic merge rules
3. **Session protocols** - Explicit start/close procedures
4. **Modular rules with includes** - Template configuration
## How This Applies to Our Design
### Already Covered
| Pattern | Our Design |
|---------|------------|
| SQLite for coordination | ✅ bus.db with transactions |
| Git worktrees | ✅ branch-per-worker.md |
| State machine | ✅ worker-state-machine.md |
| Heartbeats/liveness | ✅ 10s interval in message-passing |
| Claim-check pattern | ✅ SQLite transactions |
| Task serialization | ✅ No uncommitted dependencies |
### Should Add
| Pattern | Gap | Action |
|---------|-----|--------|
| Spec-driven tasks | Tasks are just titles | Add structured task specs (requirements, design, acceptance) |
| Role boundaries | Not enforced | Add tool-level constraints per agent type |
| Review funnel | Missing arbiter | Add synthesis step before human review |
| Versioned specs | Not tracked | Add version field to task assignments |
| Cost budgets | Not implemented | Add token/time budgets per agent |
| Correlation IDs | Partial (correlation_id) | Ensure end-to-end tracing |
### Validate Our Decisions
| Decision | Validation |
|----------|------------|
| SQLite over JSONL | ✅ Confirmed - JSONL for issues only, SQLite for coordination |
| Orchestrator creates branches | ✅ Confirmed - reduces agent setup, enforces policy |
| 3-4 agents max | ✅ Aligns with "Rule of 4" research |
| Mandatory rebase | ✅ Confirmed - prevents stale branch drift |
| Escalate semantic conflicts | ✅ Confirmed - agents hallucinate resolutions |
## Open Questions Surfaced
1. **PostgreSQL advisory locks vs SQLite?** - Do we need Postgres, or is SQLite sufficient?
2. **Git bundles for checkpoints?** - Should we adopt SkillFS pattern?
3. **Spec files per task?** - How structured should task specs be?
4. **Arbiter/synthesis agent?** - Add to architecture before human review?
5. **Token budgets?** - How to enforce across different agent types?
## Sources
### HN Discussions
- [Superset: 10 Parallel Coding Agents](https://news.ycombinator.com/item?id=46368739)
- [Desktop App for Parallel Agentic Dev](https://news.ycombinator.com/item?id=46027947)
- [SkillFS: Git-backed Sandboxes](https://news.ycombinator.com/item?id=46543093)
- [Zenflow: Agent Orchestration](https://news.ycombinator.com/item?id=46290617)
- [Git Worktree for Parallel Dev](https://news.ycombinator.com/item?id=46510462)
### Blogs & Articles
- [Building a Multi-Agent Development Workflow](https://itsgg.com/blog/2026/01/08/building-a-multi-agent-development-workflow/)
- [The Real Struggle with AI Coding Agents](https://www.smiansh.com/blogs/the-real-struggle-with-ai-coding-agents-and-how-to-overcome-it/)
- [Why AI Coding Tools Don't Work For Me](https://blog.miguelgrinberg.com/post/why-generative-ai-coding-tools-and-agents-do-not-work-for-me)
- [Microsoft: Multi-Agent Systems at Scale](https://devblogs.microsoft.com/ise/multi-agent-systems-at-scale/)
- [LangChain: How and When to Build Multi-Agent](https://blog.langchain.com/how-and-when-to-build-multi-agent-systems/)
### Research & Analysis
- [VentureBeat: More Agents Isn't Better](https://venturebeat.com/orchestration/research-shows-more-agents-isnt-a-reliable-path-to-better-enterprise-ai)
- [Deloitte: AI Agent Orchestration](https://www.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2026/ai-agent-orchestration.html)
- [10 Things Developers Want from Agentic IDEs](https://redmonk.com/kholterhoff/2025/12/22/10-things-developers-want-from-their-agentic-ides-in-2025/)