skills/docs/design/message-passing-comparison.md
dan 1c66d019bd feat: add worker CLI scaffold in Nim
Multi-agent coordination CLI with SQLite message bus:
- State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED
- Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat
- SQLite WAL mode, dedicated heartbeat thread, channel-based IPC
- cligen for CLI, tiny_sqlite for DB, ORC memory management

Design docs for branch-per-worker, state machine, message passing,
and human observability patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 18:47:47 -08:00

6.4 KiB

Message Passing: Design Comparison

Purpose: Compare our design decisions against Beads and Tissue to validate approach and identify gaps.

Summary Comparison

Decision Our Design (v2) Beads Tissue
Primary storage SQLite (WAL mode) SQLite + JSONL export JSONL (append-only)
Cache/index N/A (SQLite is primary) SQLite is primary SQLite (derived)
Write locking SQLite BEGIN IMMEDIATE SQLite BEGIN IMMEDIATE None (git merge)
Concurrency model SQLite transactions Optimistic (hash IDs) + SQLite txn Optimistic (git merge)
Crash safety SQLite atomic commit SQLite transactions Git (implicit)
Heartbeats Yes (10s interval) No (daemon only) No
Liveness detection SQL query on heartbeat timestamps Not documented Not documented
Large payloads Blob storage (>4KB) Compaction/summarization Not addressed
Coordination Polling + claim-check bd ready queries tissue ready queries
Message schema Explicit (id, ts, from, type, payload) Implicit (issue events) Implicit (issue events)
Human debugging JSONL export (read-only) JSONL in git JSONL primary

Decision (2026-01-10): After orch consensus with 3 models, we aligned with Beads' approach (SQLite primary) over Tissue's (JSONL primary). Key factors:

  • Payloads 1-50KB exceed POSIX atomic write guarantees (~4KB)
  • Crash mid-write with flock still corrupts log
  • SQLite transactions provide true atomicity
  • JSONL export preserves human debugging (tail -f)

Detailed Analysis

Where We Align

1. JSONL as Source of Truth All three systems use append-only JSONL as the authoritative store. This is the right call:

  • Git-friendly (merges cleanly)
  • Human-readable (debuggable with cat | jq)
  • Simple to implement

2. SQLite as Derived Cache All three use SQLite for queries, not as primary storage:

  • Beads: Always-on cache with dirty tracking
  • Tissue: Derived index, gitignored
  • Ours: Phase 2 optimization

3. Pull-Based Coordination All use polling/queries rather than push events:

  • bd ready / tissue ready / our poll() function
  • Simpler than event-driven, works across process boundaries

Where We Diverge

1. Write Locking Strategy

System Approach Trade-off
Ours flock on JSONL file Simple, prevents interleaving, works locally
Beads SQLite BEGIN IMMEDIATE Stronger guarantees, more complex
Tissue None (trust git merge) Simplest, but can corrupt JSONL mid-write

Our rationale: flock is simpler than SQLite transactions and safer than trusting git merge for mid-write crashes. Tissue's approach assumes writes complete atomically, which isn't guaranteed for large JSON lines.

2. Crash Safety

System Approach
Ours Write to staging → validate → append under lock → delete staging
Beads SQLite transactions (rollback on failure)
Tissue Git recovery (implicit)

Our rationale: Staging directory adds explicit crash recovery without SQLite complexity. If agent dies mid-write, staged file is recovered on restart.

3. Heartbeats / Liveness

System Approach
Ours Mandatory heartbeats every 10s, timeout detection
Beads Background daemon (no explicit heartbeats)
Tissue None

Our rationale: LLM API calls can hang indefinitely. Without heartbeats, a stuck agent blocks tasks forever. Beads/Tissue are issue trackers, not real-time coordination systems.

4. Large Payload Handling

System Approach
Ours Blob storage with content-addressable hashing
Beads Compaction (summarize old tasks)
Tissue Not addressed

Our rationale: Code diffs and agent outputs can be large. Blob storage keeps the log scannable. Beads' compaction is for context windows, not payload size.

5. Message Schema

System Schema Type
Ours Explicit message schema (id, ts, from, to, type, payload)
Beads Issue-centric (tasks with dependencies, audit trail)
Tissue Issue-centric (similar to Beads)

Our rationale: We need general message passing (state changes, heartbeats, claims), not just issue tracking. Beads/Tissue are issue trackers first; we're building coordination primitives.

Gaps in Our Design (Learned from Beads)

1. Hash-Based IDs for Merge Safety Beads uses hash-based IDs (e.g., bd-a1b2) to prevent merge collisions. We should consider this for message IDs if multiple agents might create messages offline and merge later.

2. Dirty Tracking for Incremental Export Beads tracks "dirty" issues for efficient JSONL export. When we add SQLite cache, we should track which messages need re-export rather than full rescans.

3. File Hash Validation Beads stores JSONL file hash to detect external modifications. We could add this to detect corruption or manual edits.

Gaps in Our Design (Learned from Tissue)

1. FTS5 Full-Text Search Tissue's SQLite cache includes FTS5 for searching issue content. Useful for "find messages mentioning X" queries in Phase 2.

2. Simpler Concurrency (Maybe) Tissue trusts git merge without explicit locking. For single-machine scenarios with small writes, this might be sufficient. We could offer a "simple mode" without flock for low-contention cases.

Validation Verdict

Our design is more complex than Tissue but simpler than Beads, which matches our use case:

  • Tissue: Issue tracker, optimizes for git collaboration
  • Beads: Full workflow engine with daemon, RPC, recipes
  • Ours: Coordination primitives for multi-agent coding

The key additions we make (heartbeats, blob storage, staging directory) are justified by our real-time coordination requirements that issue trackers don't have.

  1. Add hash-based message IDs - Prevent merge collisions if agents work offline
  2. Add file hash validation - Detect log corruption on startup
  3. Document "simple mode" - No flock for single-agent or low-contention scenarios
  4. Plan for FTS5 - Add to Phase 2 SQLite cache design

References