Adds the foundational capability to clear a story from the running
server's in-memory CRDT state without restarting the process. This is
story 521, motivated by the 2026-04-09 incident where stories 478 and
503 kept resurrecting from in-memory CRDT after every sqlite delete /
worktree removal / timers.json clear. The only previous remedy was a
full docker restart.
Changes:
- server/src/crdt_state.rs: new `pub fn evict_item(story_id: &str)`.
Looks up the item's CRDT OpId via the visible-index map, calls the
bft-json-crdt list `delete()` primitive to construct a tombstone op,
runs it through the existing `apply_and_persist` machinery (which
signs, applies to the in-memory CRDT, and queues for persistence to
crdt_ops), rebuilds the story_id → visible_index map, and drops the
in-memory CONTENT_STORE entry. The tombstone survives a restart
because it's persisted as a real CRDT op.
- server/src/http/mcp/story_tools.rs: new `tool_purge_story` MCP
handler that takes a story_id and calls evict_item. Deliberately
minimal — does NOT touch agents, worktrees, pipeline_items shadow
table, timers.json, or filesystem shadows. Compose with stop_agent,
remove_worktree, etc. for a full purge. Story 514 (delete_story
full cleanup) is the future "do it all" tool.
- server/src/http/mcp/mod.rs: registers the `purge_story` tool in the
tools list and dispatch table.
Usage:
mcp__huskies__purge_story story_id="<full_story_id>"
Returns a string confirming the eviction. The story will no longer
appear in get_pipeline_status, list_agents, or any other API that
reads from the in-memory CRDT view, and on the next server restart
the persisted tombstone op will keep it from being reconstructed.
This is a prerequisite for story 514 (delete_story full cleanup) and
useful for any "kill it with fire" operator need.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CRDT lamport seq is per-author and per-field, not globally
monotonic. Replaying by `seq ASC` causes field-update ops (which
have low per-field seq counters like 1, 2, 3) to be applied
BEFORE the list-insert ops they reference (which have higher
per-list seq counters like N for the Nth item ever inserted).
The field updates fail with ErrPathMismatch because the target
item doesn't exist yet, the field counter is never advanced,
and subsequent writes silently lose state.
Concretely on 2026-04-09 we observed: post-restart writes were
being persisted at seq=1,2,3,4,5,6,7 even though pre-restart
seq had reached 492. On the next replay, those low-seq field
updates would be applied before their seq=485+ creation ops,
silently dropping the updates. This was the load-bearing
"why does state keep flapping" bug today.
Fix: replay by `rowid ASC` (SQLite insertion order) instead.
Rowid preserves the causal order ops were originally applied
in, so field updates always come after the item insert they
reference.
Adds a regression test that constructs the exact scenario:
inserts a story (op gets seq=6), updates its stage (op gets
seq=1 because field counter starts at 0), persists both ops
in causal order, then replays both seq ASC (reproduces the
bug — stage update is lost) and rowid ASC (the fix — stage
update is preserved).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Manual squash-merge of feature/story-478_… into master after the in-pipeline
mergemaster runs failed silently. The 478 agent did substantial real work
across multiple respawn cycles before being interrupted; commits on the
feature branch were intact and verified high-quality but never merged via
the normal pipeline path due to compounding bugs:
- The first mergemaster attempt ran ($0.82 in tokens) and exited "Done"
cleanly but didn't push anything to master — likely the worktree was
briefly on master rather than the feature branch when the merge_agent_work
MCP tool ran, so it found nothing to merge.
- Subsequent timer fires defaulted to spawning coders instead of resuming
mergemaster, burning more tokens for no progress.
- Bug 510 (split-brain shadows yanking done stories back to current) and
bug 501 (timers don't cancel on stop/completion) compounded the cost.
What this commit lands:
- server/src/crdt_sync.rs (new, ~518 lines): GET /crdt-sync WebSocket
handler that subscribes to locally-applied SignedOps and streams them as
binary frames. Per-peer bounded queue (256 ops) drops slow peers.
- server/src/crdt_state.rs: new public functions subscribe_ops(),
all_ops_json(), apply_remote_op() backing the sync handler. Adds the
CRDT_OP_TX broadcast channel (capacity 1024).
- server/src/main.rs: wires up the sync subsystem at startup.
- server/src/http/mod.rs: registers the new endpoint.
- server/src/config.rs: adds optional rendezvous field for outbound peers.
- server/src/worktree.rs: minor changes from the original branch.
- server/Cargo.toml: cfg lint suppression for CrdtNode derive.
- crates/bft-json-crdt/src/debug.rs: fix unused-variable warnings.
Resolved a trivial test-mod merge conflict in crdt_state.rs (both 478 and
503 added new tests at the end of the test module — kept both sets).
Note: this is the squash of the original 478 work that the user explicitly
authorized landing. The earlier rogue commit ac9f3ecf — which added a
DIFFERENT, broken implementation of the same feature directly to master
under the user's identity without consent — was reverted earlier in this
session. The forensic tags rogue-commit-2026-04-09-ac9f3ecf and
pre-502-reset-2026-04-09 still exist for incident audit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CRDT state layer backed by SQLite for pipeline state. Integrates the
BFT JSON CRDT crate with SQLite persistence via sqlx. Ops are persisted
and replayed on startup. Node identity via Ed25519 keypair.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>