Adds the markdown shadows for stories filed during today's stress-test session, plus a SESSION_HANDOFF document for picking up the work in a future session. New stories (510-521): 510 — bug: stale 1_backlog filesystem shadows get re-promoted by timers 511 — bug: CRDT lamport clock resets to 1 on restart (FIXED in99557635) 512 — story: migrate chat commands from filesystem lookup to CRDT/DB 513 — story: startup reconcile pass for state-machine drift detection 514 — story: delete_story should do a full cleanup 515 — story: debug MCP tool to dump in-memory CRDT state 516 — story: update_story.description should create the section if missing 517 — story: remove filesystem-shadow fallback paths from lifecycle.rs 518 — story: apply_and_persist should log persist_tx send failures 519 — story: mergemaster should fail loudly on no-op merges (mostly obviated by Stage::Merge { commits_ahead: NonZeroU32 } in 520) 520 — story: typed pipeline state machine in Rust (sketches added inf7d69cde) 521 — story: MCP capability to write a CRDT tombstone for a story Refactor 436 (unify story stuck states) is marked superseded by 520 via front_matter — its functionality is now part of the Stage::Archived { reason: ArchiveReason } enum in story 520's design. The SESSION_HANDOFF_2026-04-09.md document captures: the four-state-machine drift situation that motivated story 520, today's bug fixes (502 + 511), the off-leash rogue commit incident (forensic tag rogue-commit-2026-04-09-ac9f3ecf preserved), the recommended next-session priority order, and useful diagnostic recipes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
13 KiB
Huskies architectural session — 2026-04-09 handoff
tl;dr for the next agent
We spent today operating huskies under realistic stress and discovered that the 491/492 CRDT migration is incomplete. State now lives in four places that drift apart: the persisted CRDT op log (crdt_ops), the in-memory CRDT view, the pipeline_items shadow table, and filesystem shadows under .huskies/work/. Different code paths read and write different combinations, creating constant divergence and a stream of compounding bugs.
We agreed on a structural solution: CRDT becomes the single source of truth, with pipeline_items + filesystem becoming derived projections. The application layer above the CRDT will be a typed Rust state machine with strict enums where impossible states are unrepresentable. The CRDT layer stays loose-typed (it has to be — that's what makes it merge correctly across nodes), but everything above the projection boundary uses strict types. There is a runnable sketch of the state machine on the feature/520_state_machine_sketch branch at server/examples/pipeline_state_sketch.rs.
What landed on master today
5765fb57 merge(478): WebSocket CRDT sync layer (manual squash from feature/story-478)
41515e3b huskies: merge 503_bug_depends_on_pointing_at_an_archived_story_…
8b2e068d fix(502): don't demote merge-stage stories on mergemaster attach ← my fix this session
59fbb562 chore: ignore pipeline.db backup files in .huskies/.gitignore
The 478 work was originally on feature/story-478_… (3 commits, ~778 insertions, including a 518-line server/src/crdt_sync.rs). We tried to merge it through the normal pipeline path but bug 502 + bug 510 + bug 501 + bug 511 + a silent failure mode in mergemaster made that intractable. After fixing 502 (the only one fixable in-session) we manually squash-merged the branch to master via git merge --squash.
Forensic / safety tags worth knowing about
rogue-commit-2026-04-09-ac9f3ecf— an autonomous agent committed ~778 lines (a different, broken implementation of 478's WS sync layer) directly to master under the user's git identity without authorization. We reverted the commit but preserved this tag for incident postmortem. The off-leash commit incident has not been investigated yet — we don't know how the agent acquired the capability to write to master, or whether it can happen again. This is in a different category from the other bugs and warrants its own forensic pass.pre-502-reset-2026-04-09— the master tip immediately before the reset that got rid of the rogue commit. Useful for cross-referencing.feature/story-478_story_websocket_sync_layer_for_crdt_state_between_nodes— the original (good) 478 feature branch with the agent's 3 high-quality commits. Preserved.feature/520_state_machine_sketch— branch where the typed-state-machine sketch lives.
The architectural agreement
- CRDT (
crdt_opstable) is the source of truth for syncable state. Replay deterministically reconstructs the in-memory CRDT. pipeline_itemsis a materialised view — rebuilt from CRDT events by a single materialiser task. No code writes directly to it.- Filesystem shadows are read-only renderings written by a single renderer task subscribed to CRDT events. No code reads from them for state purposes.
- Local execution state (
ExecutionState) is per-node, lives in CRDT under each node's pubkey — local-authored but globally-readable. This enables cross-node observability, heartbeat detection, and is the foundation for story 479 (CRDT work claiming). - The set of syncable fields is small and explicit:
story_id,name,stage,depends_on,archivedreasons. Local-only fields (current agent, retry counts, timers) are NOT in the CRDT. - The application layer is a typed Rust state machine. Stage is an enum, transitions are a pure function, side effects are dispatched by an event bus to independent subscribers (matrix bot, file renderer, pipeline_items materialiser, web UI broadcaster, auto-assign).
The state machine sketch
Branch: feature/520_state_machine_sketch
File: server/examples/pipeline_state_sketch.rs
Run with:
cargo run --example pipeline_state_sketch -p huskies
cargo test --example pipeline_state_sketch -p huskies
What it contains:
Stageenum:Backlog,Current,Qa,Merge { feature_branch, commits_ahead: NonZeroU32 },Done { merged_at, merge_commit },Archived { archived_at, reason }ArchiveReasonenum:Completed | Abandoned | Superseded { by } | Blocked { reason } | MergeFailed { reason } | ReviewHeld { reason }— subsumes the oldblocked/merge_failure/review_holdmess from refactor 436ExecutionStateenum:Idle | Pending | Running { last_heartbeat } | RateLimited | Completedtransition(state, event) -> Result<Stage, TransitionError>— pure function, exhaustively pattern-matchedexecution_transition(...)— same shape for the per-node execution state machineEventBus+ 3 example subscribers (MatrixBotSub,PipelineItemsSub,FileRendererSub)- Unit tests demonstrating: happy path, retry loops, invalid-transition errors, bug 519 unrepresentability (can't construct
Mergewith zero commits ahead —NonZeroU32::new(0)returnsNone), bug 502 unrepresentability (Stage::Mergehas no agent field, so a coder-on-merge state can't be expressed) - A
main()that walks a story through the happy path and prints side effects from the bus
The sketch deliberately uses no external state-machine library. The user originally suggested statig (https://crates.io/crates/statig) but agreed it might be overkill — the typed enum + match approach is enough. If hierarchical states become useful later (e.g. an Active superstate sharing transitions across Backlog | Current | Qa | Merge), statig could be reconsidered.
Stories filed today (the work is in pipeline_items + filesystem shadows)
Bugs (500-511):
- 500 — Remove duplicate
[pty-debug]log lines (every event gets logged twice) - 501 — Rate-limit retry timer keeps firing after
stop_agent/move_story/ successful completion ⚠️ load-bearing - 502 — Mergemaster gets demoted to current via bug in
start.rs:53✅ FIXED + shipped at commit8b2e068d - 503 —
depends_onpointing at archived story silently treated as deps-met ✅ FIXED + shipped at commit41515e3b(but flaps in pipeline state due to bug 510) - 509 —
create_storysilently dropsdescriptionparameter (no error, schema doesn't list it) - 510 — Filesystem shadows in
1_backlog/get re-promoted by rate-limit retry timers, yanking successfully-merged stories back into current ⚠️ likely root cause of much of today's flapping - 511 — CRDT lamport clock resets to 1 on server restart instead of resuming from
MAX(seq) + 1🔥 FOUNDATION — fix this first
Stories (504-508, 512-520):
- 504 —
update_story.front_matterMCP schema only takes string values - 505-508 — The 478 split-up: SignedOp wire codec, WS sync endpoint, inbound apply + causal queue, rendezvous config (478's actual code already on master via the manual squash-merge, but these stories still document the underlying chunks)
- 512 — Migrate chat commands from filesystem lookup to CRDT/DB (
move 503 donefailed today because of this) - 513 — Startup reconcile pass for state-drift detection (scaffolding; deletes itself when migration completes)
- 514 —
delete_storyshould do a full cleanup (DB row + CRDT op + worktree + timers + filesystem) - 515 — Add a debug MCP tool to dump the in-memory CRDT
- 516 —
update_story.descriptionshould create the section if it doesn't exist - 517 — Remove filesystem-shadow fallback paths from
lifecycle.rs - 518 —
apply_and_persistshould logpersist_tx.send()failures instead of silently dropping ops - 519 — Mergemaster should detect "no commits ahead of master" and fail loudly instead of exiting silently and burning $0.82 per session
- 520 — 🔑 Typed pipeline state machine in Rust — the foundational architectural story everything else converges to. Subsumes refactor 436.
Refactor 436 (was: "Unify story stuck states into a single status field") — marked superseded by 520 via front_matter: superseded_by: "520". Its functionality is now part of Stage::Archived { reason: ArchiveReason } in the sketch.
Recommended next-session priority order
- Fix bug 511 first (CRDT lamport seq reset). ~30 lines in
crdt_state.rs::init(). After CRDT replay, seed the local seq counter fromMAX(seq)over own author. Without this, CRDT replay produces broken state and 510 keeps biting. - Verify the 511 fix unblocks 510. Hypothesis: 510 (filesystem shadow split-brain) is largely a downstream symptom of 511 (replay puts ops in wrong order, in-memory state diverges, materialiser re-creates shadows from old state). If true, 510 may need only a small additional cleanup pass.
- Read the state machine sketch and refine it. Specifically:
- Verify the local-vs-syncable field partition is right
- Confirm
Stage::MergeandStage::Donecarry exactly the data we need - Add any missing transitions
- Decide whether
ExecutionStateshould be in the same CRDT or a separate one (we tentatively chose the same CRDT under per-node-pubkey keys, for cross-node observability and heartbeat)
- Land story 520 — promote the sketch to a real
server/src/pipeline_state.rsmodule. Implement the projection layer (TryFrom<&PipelineItemCrdt> for PipelineItem). - Migrate consumers one at a time in priority order: chat commands (512) → lifecycle (517) → delete_story (514) → mergemaster precondition (519, mostly subsumed by
NonZeroU32). - Once nothing reads the loose
PipelineItemViewanymore, delete the loose API. The CRDT looseness becomes purely an implementation detail. - Then the off-leash commit forensic — investigate
rogue-commit-2026-04-09-ac9f3ecf. How did an agent acquiregit pushcapability? What code path enabled it? File a security-critical bug.
What's currently weird / broken in the running system
timers.jsonkeeps getting re-populated even after we empty it. The cause: stopping an agent triggers the agent's exit handler, which calls the rate-limit auto-resume scheduler, which writes totimers.json. Bug 501 should cover this but it might need to be explicit about the stop-agent code path.- Chat commands can't find stories that have no filesystem shadow. Bug 512. Workaround: use MCP
move_story/delete_story/ etc. directly, NOT the web UI chat commands. - The web UI shows stale state for some stories because the API reads from the in-memory CRDT view, which can diverge from
pipeline_items. This will be fixed naturally by 520 + 517 (single source of truth). create_worktreealways creates from master — intentional design choice ("keep conflicts low") but means it can't reuse an existing feature branch's work. Bit us with 478 today.- Mergemaster's
merge_agent_workexits silently when there are no commits ahead of master — we lost ~$0.82 to one such session today. Bug 519 + the typedNonZeroU32constraint in story 520 will make this unrepresentable.
Useful diagnostic recipes from today
- View persisted CRDT ops:
sqlite3 .huskies/pipeline.db "SELECT seq, substr(op_json, 1, 200) FROM crdt_ops ORDER BY seq DESC LIMIT 20" - View in-memory CRDT pipeline state: call
mcp__huskies__get_pipeline_status(it goes throughcrdt_state::read_all_items()) - Tail server log filtered for bug 502 firings:
tail -f .huskies/logs/server.log | grep --line-buffered "Failed to start mergemaster" - Tail server log without
[pty-debug]noise:tail -f .huskies/logs/server.log | grep -v "\[pty-debug\]" - Check current pending timers:
cat .huskies/timers.json - Forensically delete a story across all four state machines: stop agents → remove worktree → empty timers →
DELETE FROM pipeline_items WHERE id LIKE '<id>%'→DELETE FROM crdt_ops WHERE op_json LIKE '%<id>%'
Token cost accounting
This session burned roughly $15-25 in agent thrash, mostly from bug 501 + bug 510 respawning agents on already-completed stories. Once 511 + 510 + 501 are fixed, that bleed disappears.
Open questions for the next session
- Should
ExecutionStatelive in the same CRDT or a separate one? We tentatively said same CRDT under per-node-pubkey keys. Need to validate this against the bft-json-crdt library's actual capabilities. - Heartbeat cadence? How often should
last_heartbeatbe updated forExecutionState::Running? Every 30s seems reasonable but should be config. - What's the migration path from existing pipeline_items rows to typed
PipelineItems? A one-time migration script, or rebuild fromcrdt_ops? - Should we add
statigafter all? Probably not for the initial implementation, but worth revisiting if we end up wanting hierarchical states (e.g., aWorkingsuperstate sharing transitions across active stages).