huskies: create 477_spike_distributed_build_agents_via_bft_crdts_over_websocket

This commit is contained in:
dave
2026-04-04 20:13:31 +00:00
parent 96fec31bb7
commit 78b3f4c165
@@ -0,0 +1,54 @@
---
name: "Distributed build agents via BFT CRDTs over WebSocket"
---
# Spike 477: Distributed build agents via BFT CRDTs over WebSocket
## Question
Investigate integrating the existing BFT JSON CRDT Rust crate (to be placed in crates/) as the state backend for distributing pipeline work across multiple machines.
## Goal
Replace or augment the filesystem-based pipeline state with a CRDT document synced over WebSocket between nodes. Each node (Docker container on a different laptop) sees the full pipeline state and self-assigns work autonomously. No central scheduler.
## Key Questions
1. **CRDT integration**: The BFT CRDT crate goes in `crates/`. How does it map to the current pipeline state model (stories in stage directories, agent assignments, retry counts)? Does it replace `.huskies/work/` or layer on top?
2. **Work claiming**: Two nodes see a story enter current simultaneously. Design a CRDT-native claim mechanism (e.g. node ID + timestamp in the CRDT doc) so exactly one node runs the coder. What happens on conflict?
3. **WebSocket transport**: Each node runs `huskies` and connects to peers via WebSocket. Node discovery: static config (`peers = ["ws://laptop-2:3001"]`), mDNS, or rendezvous? What's simplest for a home LAN setup?
4. **Node modes**: Single binary with a flag — `huskies /workspace` (current full mode with chat/web UI) vs `huskies agent --peers ws://host:3001` (build agent mode: syncs state, runs coders, no chat UI). What's the minimum viable agent mode?
5. **Git coordination**: Each node clones/fetches from Gitea independently. Worktrees are local per-machine. Agent pushes feature branch when done, master node handles merge. Any issues with concurrent pushes to same branch?
6. **Offline/reconnect**: Laptop closes lid mid-work. CRDT merges state on reconnect, but what about the interrupted Claude Code process? Timeout + reclaim by another node?
7. **Security**: WebSocket auth between nodes (shared secret, mTLS, or token). Prevent unauthorised nodes from joining the mesh.
## Reference
- BFT JSON CRDT paper: https://jzhao.xyz/posts/bft-json-crdt
- User has a working Rust implementation ready to integrate
## Hypothesis
- TBD
## Timebox
- TBD
## Investigation Plan
- TBD
## Findings
- TBD
## Recommendation
- TBD