Multi-agent coordination CLI with SQLite message bus: - State machine: ASSIGNED -> WORKING -> IN_REVIEW -> APPROVED -> COMPLETED - Commands: spawn, start, done, approve, merge, cancel, fail, heartbeat - SQLite WAL mode, dedicated heartbeat thread, channel-based IPC - cligen for CLI, tiny_sqlite for DB, ORC memory management Design docs for branch-per-worker, state machine, message passing, and human observability patterns. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
141 lines
6.4 KiB
Markdown
141 lines
6.4 KiB
Markdown
# Message Passing: Design Comparison
|
|
|
|
**Purpose**: Compare our design decisions against Beads and Tissue to validate approach and identify gaps.
|
|
|
|
## Summary Comparison
|
|
|
|
| Decision | Our Design (v2) | Beads | Tissue |
|
|
|----------|-----------------|-------|--------|
|
|
| **Primary storage** | SQLite (WAL mode) | SQLite + JSONL export | JSONL (append-only) |
|
|
| **Cache/index** | N/A (SQLite is primary) | SQLite is primary | SQLite (derived) |
|
|
| **Write locking** | SQLite BEGIN IMMEDIATE | SQLite BEGIN IMMEDIATE | None (git merge) |
|
|
| **Concurrency model** | SQLite transactions | Optimistic (hash IDs) + SQLite txn | Optimistic (git merge) |
|
|
| **Crash safety** | SQLite atomic commit | SQLite transactions | Git (implicit) |
|
|
| **Heartbeats** | Yes (10s interval) | No (daemon only) | No |
|
|
| **Liveness detection** | SQL query on heartbeat timestamps | Not documented | Not documented |
|
|
| **Large payloads** | Blob storage (>4KB) | Compaction/summarization | Not addressed |
|
|
| **Coordination** | Polling + claim-check | `bd ready` queries | `tissue ready` queries |
|
|
| **Message schema** | Explicit (id, ts, from, type, payload) | Implicit (issue events) | Implicit (issue events) |
|
|
| **Human debugging** | JSONL export (read-only) | JSONL in git | JSONL primary |
|
|
|
|
**Decision (2026-01-10)**: After orch consensus with 3 models, we aligned with Beads' approach (SQLite primary) over Tissue's (JSONL primary). Key factors:
|
|
- Payloads 1-50KB exceed POSIX atomic write guarantees (~4KB)
|
|
- Crash mid-write with flock still corrupts log
|
|
- SQLite transactions provide true atomicity
|
|
- JSONL export preserves human debugging (`tail -f`)
|
|
|
|
## Detailed Analysis
|
|
|
|
### Where We Align
|
|
|
|
**1. JSONL as Source of Truth**
|
|
All three systems use append-only JSONL as the authoritative store. This is the right call:
|
|
- Git-friendly (merges cleanly)
|
|
- Human-readable (debuggable with `cat | jq`)
|
|
- Simple to implement
|
|
|
|
**2. SQLite as Derived Cache**
|
|
All three use SQLite for queries, not as primary storage:
|
|
- Beads: Always-on cache with dirty tracking
|
|
- Tissue: Derived index, gitignored
|
|
- Ours: Phase 2 optimization
|
|
|
|
**3. Pull-Based Coordination**
|
|
All use polling/queries rather than push events:
|
|
- `bd ready` / `tissue ready` / our `poll()` function
|
|
- Simpler than event-driven, works across process boundaries
|
|
|
|
### Where We Diverge
|
|
|
|
**1. Write Locking Strategy**
|
|
|
|
| System | Approach | Trade-off |
|
|
|--------|----------|-----------|
|
|
| **Ours** | flock on JSONL file | Simple, prevents interleaving, works locally |
|
|
| **Beads** | SQLite BEGIN IMMEDIATE | Stronger guarantees, more complex |
|
|
| **Tissue** | None (trust git merge) | Simplest, but can corrupt JSONL mid-write |
|
|
|
|
**Our rationale**: flock is simpler than SQLite transactions and safer than trusting git merge for mid-write crashes. Tissue's approach assumes writes complete atomically, which isn't guaranteed for large JSON lines.
|
|
|
|
**2. Crash Safety**
|
|
|
|
| System | Approach |
|
|
|--------|----------|
|
|
| **Ours** | Write to staging → validate → append under lock → delete staging |
|
|
| **Beads** | SQLite transactions (rollback on failure) |
|
|
| **Tissue** | Git recovery (implicit) |
|
|
|
|
**Our rationale**: Staging directory adds explicit crash recovery without SQLite complexity. If agent dies mid-write, staged file is recovered on restart.
|
|
|
|
**3. Heartbeats / Liveness**
|
|
|
|
| System | Approach |
|
|
|--------|----------|
|
|
| **Ours** | Mandatory heartbeats every 10s, timeout detection |
|
|
| **Beads** | Background daemon (no explicit heartbeats) |
|
|
| **Tissue** | None |
|
|
|
|
**Our rationale**: LLM API calls can hang indefinitely. Without heartbeats, a stuck agent blocks tasks forever. Beads/Tissue are issue trackers, not real-time coordination systems.
|
|
|
|
**4. Large Payload Handling**
|
|
|
|
| System | Approach |
|
|
|--------|----------|
|
|
| **Ours** | Blob storage with content-addressable hashing |
|
|
| **Beads** | Compaction (summarize old tasks) |
|
|
| **Tissue** | Not addressed |
|
|
|
|
**Our rationale**: Code diffs and agent outputs can be large. Blob storage keeps the log scannable. Beads' compaction is for context windows, not payload size.
|
|
|
|
**5. Message Schema**
|
|
|
|
| System | Schema Type |
|
|
|--------|-------------|
|
|
| **Ours** | Explicit message schema (id, ts, from, to, type, payload) |
|
|
| **Beads** | Issue-centric (tasks with dependencies, audit trail) |
|
|
| **Tissue** | Issue-centric (similar to Beads) |
|
|
|
|
**Our rationale**: We need general message passing (state changes, heartbeats, claims), not just issue tracking. Beads/Tissue are issue trackers first; we're building coordination primitives.
|
|
|
|
### Gaps in Our Design (Learned from Beads)
|
|
|
|
**1. Hash-Based IDs for Merge Safety**
|
|
Beads uses hash-based IDs (e.g., `bd-a1b2`) to prevent merge collisions. We should consider this for message IDs if multiple agents might create messages offline and merge later.
|
|
|
|
**2. Dirty Tracking for Incremental Export**
|
|
Beads tracks "dirty" issues for efficient JSONL export. When we add SQLite cache, we should track which messages need re-export rather than full rescans.
|
|
|
|
**3. File Hash Validation**
|
|
Beads stores JSONL file hash to detect external modifications. We could add this to detect corruption or manual edits.
|
|
|
|
### Gaps in Our Design (Learned from Tissue)
|
|
|
|
**1. FTS5 Full-Text Search**
|
|
Tissue's SQLite cache includes FTS5 for searching issue content. Useful for "find messages mentioning X" queries in Phase 2.
|
|
|
|
**2. Simpler Concurrency (Maybe)**
|
|
Tissue trusts git merge without explicit locking. For single-machine scenarios with small writes, this might be sufficient. We could offer a "simple mode" without flock for low-contention cases.
|
|
|
|
## Validation Verdict
|
|
|
|
Our design is **more complex than Tissue but simpler than Beads**, which matches our use case:
|
|
|
|
- **Tissue**: Issue tracker, optimizes for git collaboration
|
|
- **Beads**: Full workflow engine with daemon, RPC, recipes
|
|
- **Ours**: Coordination primitives for multi-agent coding
|
|
|
|
The key additions we make (heartbeats, blob storage, staging directory) are justified by our real-time coordination requirements that issue trackers don't have.
|
|
|
|
## Recommended Updates to Design
|
|
|
|
1. **Add hash-based message IDs** - Prevent merge collisions if agents work offline
|
|
2. **Add file hash validation** - Detect log corruption on startup
|
|
3. **Document "simple mode"** - No flock for single-agent or low-contention scenarios
|
|
4. **Plan for FTS5** - Add to Phase 2 SQLite cache design
|
|
|
|
## References
|
|
|
|
- Beads source: https://github.com/steveyegge/beads
|
|
- Tissue source: https://github.com/evil-mind-evil-sword/tissue
|
|
- Our design: docs/design/message-passing-layer.md
|