skills/docs/design/message-passing-comparison.md
dan 1c66d019bd feat: add worker CLI scaffold in Nim
Multi-agent coordination CLI with SQLite message bus:
- State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED
- Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat
- SQLite WAL mode, dedicated heartbeat thread, channel-based IPC
- cligen for CLI, tiny_sqlite for DB, ORC memory management

Design docs for branch-per-worker, state machine, message passing,
and human observability patterns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 18:47:47 -08:00

141 lines
6.4 KiB
Markdown

# Message Passing: Design Comparison
**Purpose**: Compare our design decisions against Beads and Tissue to validate approach and identify gaps.
## Summary Comparison
| Decision | Our Design (v2) | Beads | Tissue |
|----------|-----------------|-------|--------|
| **Primary storage** | SQLite (WAL mode) | SQLite + JSONL export | JSONL (append-only) |
| **Cache/index** | N/A (SQLite is primary) | SQLite is primary | SQLite (derived) |
| **Write locking** | SQLite BEGIN IMMEDIATE | SQLite BEGIN IMMEDIATE | None (git merge) |
| **Concurrency model** | SQLite transactions | Optimistic (hash IDs) + SQLite txn | Optimistic (git merge) |
| **Crash safety** | SQLite atomic commit | SQLite transactions | Git (implicit) |
| **Heartbeats** | Yes (10s interval) | No (daemon only) | No |
| **Liveness detection** | SQL query on heartbeat timestamps | Not documented | Not documented |
| **Large payloads** | Blob storage (>4KB) | Compaction/summarization | Not addressed |
| **Coordination** | Polling + claim-check | `bd ready` queries | `tissue ready` queries |
| **Message schema** | Explicit (id, ts, from, type, payload) | Implicit (issue events) | Implicit (issue events) |
| **Human debugging** | JSONL export (read-only) | JSONL in git | JSONL primary |
**Decision (2026-01-10)**: After orch consensus with 3 models, we aligned with Beads' approach (SQLite primary) over Tissue's (JSONL primary). Key factors:
- Payloads 1-50KB exceed POSIX atomic write guarantees (~4KB)
- Crash mid-write with flock still corrupts log
- SQLite transactions provide true atomicity
- JSONL export preserves human debugging (`tail -f`)
## Detailed Analysis
### Where We Align
**1. JSONL as Source of Truth**
All three systems use append-only JSONL as the authoritative store. This is the right call:
- Git-friendly (merges cleanly)
- Human-readable (debuggable with `cat | jq`)
- Simple to implement
**2. SQLite as Derived Cache**
All three use SQLite for queries, not as primary storage:
- Beads: Always-on cache with dirty tracking
- Tissue: Derived index, gitignored
- Ours: Phase 2 optimization
**3. Pull-Based Coordination**
All use polling/queries rather than push events:
- `bd ready` / `tissue ready` / our `poll()` function
- Simpler than event-driven, works across process boundaries
### Where We Diverge
**1. Write Locking Strategy**
| System | Approach | Trade-off |
|--------|----------|-----------|
| **Ours** | flock on JSONL file | Simple, prevents interleaving, works locally |
| **Beads** | SQLite BEGIN IMMEDIATE | Stronger guarantees, more complex |
| **Tissue** | None (trust git merge) | Simplest, but can corrupt JSONL mid-write |
**Our rationale**: flock is simpler than SQLite transactions and safer than trusting git merge for mid-write crashes. Tissue's approach assumes writes complete atomically, which isn't guaranteed for large JSON lines.
**2. Crash Safety**
| System | Approach |
|--------|----------|
| **Ours** | Write to staging → validate → append under lock → delete staging |
| **Beads** | SQLite transactions (rollback on failure) |
| **Tissue** | Git recovery (implicit) |
**Our rationale**: Staging directory adds explicit crash recovery without SQLite complexity. If agent dies mid-write, staged file is recovered on restart.
**3. Heartbeats / Liveness**
| System | Approach |
|--------|----------|
| **Ours** | Mandatory heartbeats every 10s, timeout detection |
| **Beads** | Background daemon (no explicit heartbeats) |
| **Tissue** | None |
**Our rationale**: LLM API calls can hang indefinitely. Without heartbeats, a stuck agent blocks tasks forever. Beads/Tissue are issue trackers, not real-time coordination systems.
**4. Large Payload Handling**
| System | Approach |
|--------|----------|
| **Ours** | Blob storage with content-addressable hashing |
| **Beads** | Compaction (summarize old tasks) |
| **Tissue** | Not addressed |
**Our rationale**: Code diffs and agent outputs can be large. Blob storage keeps the log scannable. Beads' compaction is for context windows, not payload size.
**5. Message Schema**
| System | Schema Type |
|--------|-------------|
| **Ours** | Explicit message schema (id, ts, from, to, type, payload) |
| **Beads** | Issue-centric (tasks with dependencies, audit trail) |
| **Tissue** | Issue-centric (similar to Beads) |
**Our rationale**: We need general message passing (state changes, heartbeats, claims), not just issue tracking. Beads/Tissue are issue trackers first; we're building coordination primitives.
### Gaps in Our Design (Learned from Beads)
**1. Hash-Based IDs for Merge Safety**
Beads uses hash-based IDs (e.g., `bd-a1b2`) to prevent merge collisions. We should consider this for message IDs if multiple agents might create messages offline and merge later.
**2. Dirty Tracking for Incremental Export**
Beads tracks "dirty" issues for efficient JSONL export. When we add SQLite cache, we should track which messages need re-export rather than full rescans.
**3. File Hash Validation**
Beads stores JSONL file hash to detect external modifications. We could add this to detect corruption or manual edits.
### Gaps in Our Design (Learned from Tissue)
**1. FTS5 Full-Text Search**
Tissue's SQLite cache includes FTS5 for searching issue content. Useful for "find messages mentioning X" queries in Phase 2.
**2. Simpler Concurrency (Maybe)**
Tissue trusts git merge without explicit locking. For single-machine scenarios with small writes, this might be sufficient. We could offer a "simple mode" without flock for low-contention cases.
## Validation Verdict
Our design is **more complex than Tissue but simpler than Beads**, which matches our use case:
- **Tissue**: Issue tracker, optimizes for git collaboration
- **Beads**: Full workflow engine with daemon, RPC, recipes
- **Ours**: Coordination primitives for multi-agent coding
The key additions we make (heartbeats, blob storage, staging directory) are justified by our real-time coordination requirements that issue trackers don't have.
## Recommended Updates to Design
1. **Add hash-based message IDs** - Prevent merge collisions if agents work offline
2. **Add file hash validation** - Detect log corruption on startup
3. **Document "simple mode"** - No flock for single-agent or low-contention scenarios
4. **Plan for FTS5** - Add to Phase 2 SQLite cache design
## References
- Beads source: https://github.com/steveyegge/beads
- Tissue source: https://github.com/evil-mind-evil-sword/tissue
- Our design: docs/design/message-passing-layer.md