From 3ab0410a821441c0e483036d8453d8093a3d225f Mon Sep 17 00:00:00 2001 From: dave Date: Sat, 4 Apr 2026 20:44:25 +0000 Subject: [PATCH] huskies: create 477_story_crdt_state_backend_replacing_filesystem_pipeline_state --- ...ild_agents_via_bft_crdts_over_websocket.md | 61 ------------------- ...end_replacing_filesystem_pipeline_state.md | 23 +++++++ 2 files changed, 23 insertions(+), 61 deletions(-) delete mode 100644 .huskies/work/1_backlog/477_spike_distributed_build_agents_via_bft_crdts_over_websocket.md create mode 100644 .huskies/work/1_backlog/477_story_crdt_state_backend_replacing_filesystem_pipeline_state.md diff --git a/.huskies/work/1_backlog/477_spike_distributed_build_agents_via_bft_crdts_over_websocket.md b/.huskies/work/1_backlog/477_spike_distributed_build_agents_via_bft_crdts_over_websocket.md deleted file mode 100644 index a31a782d..00000000 --- a/.huskies/work/1_backlog/477_spike_distributed_build_agents_via_bft_crdts_over_websocket.md +++ /dev/null @@ -1,61 +0,0 @@ ---- -name: "Distributed build agents via BFT CRDTs over WebSocket" -agent: "coder-opus" ---- - -# Spike 477: Distributed build agents via BFT CRDTs over WebSocket - -## Question - -Investigate integrating the existing BFT JSON CRDT Rust crate (in crates/) as the state backend for distributing pipeline work across multiple machines. - -## Goal - -Replace the filesystem-based pipeline state (`.huskies/work/` stage directories) with a CRDT document synced over WebSocket between nodes. The CRDT becomes the single source of truth for pipeline metadata (stage, agent assignments, retry counts, blocked flags). Each node (Docker container on a different laptop) sees the full pipeline state and self-assigns work autonomously. No central scheduler. - -Story markdown files (content/AC), worktrees, and config files remain on the filesystem. Only the pipeline orchestration state moves into the CRDT. - -## Key Questions - -1. **CRDT integration**: The BFT CRDT crate goes in `crates/`. How does the CRDT document schema map to the current pipeline state model (stages, agent assignments, retry counts, blocked flags)? The CRDT replaces `.huskies/work/` stage directories as the source of truth for pipeline state. - -2. **Work claiming**: Two nodes see a story enter current simultaneously. Each writes a claim (node ID) into the CRDT doc. The CRDT merges deterministically — one node wins, losing nodes see the merged state and kill their Claude Code process. Design the merge rule (lowest node ID? earliest timestamp?) and the "loser stops work" mechanism. Worst case: a few seconds of wasted API time on the losing node before merge propagates. - -3. **WebSocket transport**: Each node runs `huskies` and connects to a known rendezvous point via WebSocket. Static config in `project.toml` (e.g. `rendezvous = "ws://server:3001"`). - -4. **Node modes**: Single binary with a flag — `huskies /workspace` (current full mode with chat/web UI) vs `huskies agent --peers ws://host:3001` (build agent mode: syncs state, runs coders, no chat UI). What's the minimum viable agent mode? - -5. **Git coordination**: Each node clones/fetches from Gitea independently. Worktrees are local per-machine. Agent pushes feature branch when done, master node handles merge. Any issues with concurrent pushes to same branch? - -6. **Offline/reconnect**: Laptop closes lid mid-work. CRDT merges state on reconnect, but what about the interrupted Claude Code process? Timeout + reclaim by another node? - -7. **Security**: Each node has a keypair. Trusted nodes are defined by a list of known public keys. Nodes authenticate on WebSocket connect by signing a challenge with their private key. The CRDT node ID is derived from the public key, giving cryptographic identity for both auth and claim resolution. - -## Reference - -- BFT JSON CRDT paper: https://jzhao.xyz/posts/bft-json-crdt -- Rust crate at `crates/bft-json-crdt/` — Ed25519 keypairs, causal dependencies, JSON-native values -- Auth comes free: every CRDT op is signed by the author's Ed25519 key. `AuthorId` = public key. -- Causal dependency tracking built in: ops with unmet deps are queued until deps arrive (handles network partitions) -- Performance is fine for pipeline state (low op volume). BFT mode ~20x slower than basic but pipeline does maybe a few ops/minute. -- Needs: persistence layer (state survives restarts), WebSocket transport (serialize SignedOps over WS) - -## Hypothesis - -- TBD - -## Timebox - -- TBD - -## Investigation Plan - -- TBD - -## Findings - -- TBD - -## Recommendation - -- TBD diff --git a/.huskies/work/1_backlog/477_story_crdt_state_backend_replacing_filesystem_pipeline_state.md b/.huskies/work/1_backlog/477_story_crdt_state_backend_replacing_filesystem_pipeline_state.md new file mode 100644 index 00000000..68d3f332 --- /dev/null +++ b/.huskies/work/1_backlog/477_story_crdt_state_backend_replacing_filesystem_pipeline_state.md @@ -0,0 +1,23 @@ +--- +name: "CRDT state backend replacing filesystem pipeline state" +--- + +# Story 477: CRDT state backend replacing filesystem pipeline state + +## User Story + +As a developer, I want the pipeline state (stages, agent assignments, retry counts, blocked flags) stored in a BFT JSON CRDT document backed by SQLite instead of filesystem directories, so the state model is ready for multi-node distribution. + +## Acceptance Criteria + +- [ ] Pipeline state (which stage a story is in, agent assignments, retry counts, blocked flags) stored in CRDT doc, not .huskies/work/ directories +- [ ] CRDT state persisted to SQLite so it survives container restarts +- [ ] BFT CRDT crate integrated from crates/bft-json-crdt/ +- [ ] Story markdown files (content/AC), worktrees, and config remain on filesystem +- [ ] All existing pipeline operations (move story, start agent, block, unblock) work against the CRDT backend +- [ ] Single-node behaviour unchanged from the user's perspective +- [ ] Rollback commit: 5561b9c6 + +## Out of Scope + +- TBD