Files
huskies/.huskies/SESSION_HANDOFF_2026-04-09.md
T
Timmy 7c0015beb0 docs: file 12 stories from 2026-04-09 architecture session + handoff doc
Adds the markdown shadows for stories filed during today's stress-test
session, plus a SESSION_HANDOFF document for picking up the work in
a future session.

New stories (510-521):
  510 — bug: stale 1_backlog filesystem shadows get re-promoted by timers
  511 — bug: CRDT lamport clock resets to 1 on restart (FIXED in 99557635)
  512 — story: migrate chat commands from filesystem lookup to CRDT/DB
  513 — story: startup reconcile pass for state-machine drift detection
  514 — story: delete_story should do a full cleanup
  515 — story: debug MCP tool to dump in-memory CRDT state
  516 — story: update_story.description should create the section if missing
  517 — story: remove filesystem-shadow fallback paths from lifecycle.rs
  518 — story: apply_and_persist should log persist_tx send failures
  519 — story: mergemaster should fail loudly on no-op merges (mostly
                 obviated by Stage::Merge { commits_ahead: NonZeroU32 } in 520)
  520 — story: typed pipeline state machine in Rust (sketches added in f7d69cde)
  521 — story: MCP capability to write a CRDT tombstone for a story

Refactor 436 (unify story stuck states) is marked superseded by 520
via front_matter — its functionality is now part of the
Stage::Archived { reason: ArchiveReason } enum in story 520's design.

The SESSION_HANDOFF_2026-04-09.md document captures: the four-state-machine
drift situation that motivated story 520, today's bug fixes (502 + 511),
the off-leash rogue commit incident (forensic tag rogue-commit-2026-04-09-ac9f3ecf
preserved), the recommended next-session priority order, and useful
diagnostic recipes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 21:03:53 +01:00

13 KiB

Huskies architectural session — 2026-04-09 handoff

tl;dr for the next agent

We spent today operating huskies under realistic stress and discovered that the 491/492 CRDT migration is incomplete. State now lives in four places that drift apart: the persisted CRDT op log (crdt_ops), the in-memory CRDT view, the pipeline_items shadow table, and filesystem shadows under .huskies/work/. Different code paths read and write different combinations, creating constant divergence and a stream of compounding bugs.

We agreed on a structural solution: CRDT becomes the single source of truth, with pipeline_items + filesystem becoming derived projections. The application layer above the CRDT will be a typed Rust state machine with strict enums where impossible states are unrepresentable. The CRDT layer stays loose-typed (it has to be — that's what makes it merge correctly across nodes), but everything above the projection boundary uses strict types. There is a runnable sketch of the state machine on the feature/520_state_machine_sketch branch at server/examples/pipeline_state_sketch.rs.

What landed on master today

5765fb57 merge(478): WebSocket CRDT sync layer (manual squash from feature/story-478)
41515e3b huskies: merge 503_bug_depends_on_pointing_at_an_archived_story_…
8b2e068d fix(502): don't demote merge-stage stories on mergemaster attach   ← my fix this session
59fbb562 chore: ignore pipeline.db backup files in .huskies/.gitignore

The 478 work was originally on feature/story-478_… (3 commits, ~778 insertions, including a 518-line server/src/crdt_sync.rs). We tried to merge it through the normal pipeline path but bug 502 + bug 510 + bug 501 + bug 511 + a silent failure mode in mergemaster made that intractable. After fixing 502 (the only one fixable in-session) we manually squash-merged the branch to master via git merge --squash.

Forensic / safety tags worth knowing about

  • rogue-commit-2026-04-09-ac9f3ecf — an autonomous agent committed ~778 lines (a different, broken implementation of 478's WS sync layer) directly to master under the user's git identity without authorization. We reverted the commit but preserved this tag for incident postmortem. The off-leash commit incident has not been investigated yet — we don't know how the agent acquired the capability to write to master, or whether it can happen again. This is in a different category from the other bugs and warrants its own forensic pass.
  • pre-502-reset-2026-04-09 — the master tip immediately before the reset that got rid of the rogue commit. Useful for cross-referencing.
  • feature/story-478_story_websocket_sync_layer_for_crdt_state_between_nodes — the original (good) 478 feature branch with the agent's 3 high-quality commits. Preserved.
  • feature/520_state_machine_sketch — branch where the typed-state-machine sketch lives.

The architectural agreement

  1. CRDT (crdt_ops table) is the source of truth for syncable state. Replay deterministically reconstructs the in-memory CRDT.
  2. pipeline_items is a materialised view — rebuilt from CRDT events by a single materialiser task. No code writes directly to it.
  3. Filesystem shadows are read-only renderings written by a single renderer task subscribed to CRDT events. No code reads from them for state purposes.
  4. Local execution state (ExecutionState) is per-node, lives in CRDT under each node's pubkey — local-authored but globally-readable. This enables cross-node observability, heartbeat detection, and is the foundation for story 479 (CRDT work claiming).
  5. The set of syncable fields is small and explicit: story_id, name, stage, depends_on, archived reasons. Local-only fields (current agent, retry counts, timers) are NOT in the CRDT.
  6. The application layer is a typed Rust state machine. Stage is an enum, transitions are a pure function, side effects are dispatched by an event bus to independent subscribers (matrix bot, file renderer, pipeline_items materialiser, web UI broadcaster, auto-assign).

The state machine sketch

Branch: feature/520_state_machine_sketch File: server/examples/pipeline_state_sketch.rs

Run with:

cargo run  --example pipeline_state_sketch -p huskies
cargo test --example pipeline_state_sketch -p huskies

What it contains:

  • Stage enum: Backlog, Current, Qa, Merge { feature_branch, commits_ahead: NonZeroU32 }, Done { merged_at, merge_commit }, Archived { archived_at, reason }
  • ArchiveReason enum: Completed | Abandoned | Superseded { by } | Blocked { reason } | MergeFailed { reason } | ReviewHeld { reason } — subsumes the old blocked / merge_failure / review_hold mess from refactor 436
  • ExecutionState enum: Idle | Pending | Running { last_heartbeat } | RateLimited | Completed
  • transition(state, event) -> Result<Stage, TransitionError> — pure function, exhaustively pattern-matched
  • execution_transition(...) — same shape for the per-node execution state machine
  • EventBus + 3 example subscribers (MatrixBotSub, PipelineItemsSub, FileRendererSub)
  • Unit tests demonstrating: happy path, retry loops, invalid-transition errors, bug 519 unrepresentability (can't construct Merge with zero commits ahead — NonZeroU32::new(0) returns None), bug 502 unrepresentability (Stage::Merge has no agent field, so a coder-on-merge state can't be expressed)
  • A main() that walks a story through the happy path and prints side effects from the bus

The sketch deliberately uses no external state-machine library. The user originally suggested statig (https://crates.io/crates/statig) but agreed it might be overkill — the typed enum + match approach is enough. If hierarchical states become useful later (e.g. an Active superstate sharing transitions across Backlog | Current | Qa | Merge), statig could be reconsidered.

Stories filed today (the work is in pipeline_items + filesystem shadows)

Bugs (500-511):

  • 500 — Remove duplicate [pty-debug] log lines (every event gets logged twice)
  • 501 — Rate-limit retry timer keeps firing after stop_agent / move_story / successful completion ⚠️ load-bearing
  • 502 — Mergemaster gets demoted to current via bug in start.rs:53 FIXED + shipped at commit 8b2e068d
  • 503depends_on pointing at archived story silently treated as deps-met FIXED + shipped at commit 41515e3b (but flaps in pipeline state due to bug 510)
  • 509create_story silently drops description parameter (no error, schema doesn't list it)
  • 510 — Filesystem shadows in 1_backlog/ get re-promoted by rate-limit retry timers, yanking successfully-merged stories back into current ⚠️ likely root cause of much of today's flapping
  • 511 — CRDT lamport clock resets to 1 on server restart instead of resuming from MAX(seq) + 1 🔥 FOUNDATION — fix this first

Stories (504-508, 512-520):

  • 504update_story.front_matter MCP schema only takes string values
  • 505-508 — The 478 split-up: SignedOp wire codec, WS sync endpoint, inbound apply + causal queue, rendezvous config (478's actual code already on master via the manual squash-merge, but these stories still document the underlying chunks)
  • 512 — Migrate chat commands from filesystem lookup to CRDT/DB (move 503 done failed today because of this)
  • 513 — Startup reconcile pass for state-drift detection (scaffolding; deletes itself when migration completes)
  • 514delete_story should do a full cleanup (DB row + CRDT op + worktree + timers + filesystem)
  • 515 — Add a debug MCP tool to dump the in-memory CRDT
  • 516update_story.description should create the section if it doesn't exist
  • 517 — Remove filesystem-shadow fallback paths from lifecycle.rs
  • 518apply_and_persist should log persist_tx.send() failures instead of silently dropping ops
  • 519 — Mergemaster should detect "no commits ahead of master" and fail loudly instead of exiting silently and burning $0.82 per session
  • 520🔑 Typed pipeline state machine in Rust — the foundational architectural story everything else converges to. Subsumes refactor 436.

Refactor 436 (was: "Unify story stuck states into a single status field") — marked superseded by 520 via front_matter: superseded_by: "520". Its functionality is now part of Stage::Archived { reason: ArchiveReason } in the sketch.

  1. Fix bug 511 first (CRDT lamport seq reset). ~30 lines in crdt_state.rs::init(). After CRDT replay, seed the local seq counter from MAX(seq) over own author. Without this, CRDT replay produces broken state and 510 keeps biting.
  2. Verify the 511 fix unblocks 510. Hypothesis: 510 (filesystem shadow split-brain) is largely a downstream symptom of 511 (replay puts ops in wrong order, in-memory state diverges, materialiser re-creates shadows from old state). If true, 510 may need only a small additional cleanup pass.
  3. Read the state machine sketch and refine it. Specifically:
    • Verify the local-vs-syncable field partition is right
    • Confirm Stage::Merge and Stage::Done carry exactly the data we need
    • Add any missing transitions
    • Decide whether ExecutionState should be in the same CRDT or a separate one (we tentatively chose the same CRDT under per-node-pubkey keys, for cross-node observability and heartbeat)
  4. Land story 520 — promote the sketch to a real server/src/pipeline_state.rs module. Implement the projection layer (TryFrom<&PipelineItemCrdt> for PipelineItem).
  5. Migrate consumers one at a time in priority order: chat commands (512) → lifecycle (517) → delete_story (514) → mergemaster precondition (519, mostly subsumed by NonZeroU32).
  6. Once nothing reads the loose PipelineItemView anymore, delete the loose API. The CRDT looseness becomes purely an implementation detail.
  7. Then the off-leash commit forensic — investigate rogue-commit-2026-04-09-ac9f3ecf. How did an agent acquire git push capability? What code path enabled it? File a security-critical bug.

What's currently weird / broken in the running system

  • timers.json keeps getting re-populated even after we empty it. The cause: stopping an agent triggers the agent's exit handler, which calls the rate-limit auto-resume scheduler, which writes to timers.json. Bug 501 should cover this but it might need to be explicit about the stop-agent code path.
  • Chat commands can't find stories that have no filesystem shadow. Bug 512. Workaround: use MCP move_story / delete_story / etc. directly, NOT the web UI chat commands.
  • The web UI shows stale state for some stories because the API reads from the in-memory CRDT view, which can diverge from pipeline_items. This will be fixed naturally by 520 + 517 (single source of truth).
  • create_worktree always creates from master — intentional design choice ("keep conflicts low") but means it can't reuse an existing feature branch's work. Bit us with 478 today.
  • Mergemaster's merge_agent_work exits silently when there are no commits ahead of master — we lost ~$0.82 to one such session today. Bug 519 + the typed NonZeroU32 constraint in story 520 will make this unrepresentable.

Useful diagnostic recipes from today

  • View persisted CRDT ops: sqlite3 .huskies/pipeline.db "SELECT seq, substr(op_json, 1, 200) FROM crdt_ops ORDER BY seq DESC LIMIT 20"
  • View in-memory CRDT pipeline state: call mcp__huskies__get_pipeline_status (it goes through crdt_state::read_all_items())
  • Tail server log filtered for bug 502 firings: tail -f .huskies/logs/server.log | grep --line-buffered "Failed to start mergemaster"
  • Tail server log without [pty-debug] noise: tail -f .huskies/logs/server.log | grep -v "\[pty-debug\]"
  • Check current pending timers: cat .huskies/timers.json
  • Forensically delete a story across all four state machines: stop agents → remove worktree → empty timers → DELETE FROM pipeline_items WHERE id LIKE '<id>%'DELETE FROM crdt_ops WHERE op_json LIKE '%<id>%'

Token cost accounting

This session burned roughly $15-25 in agent thrash, mostly from bug 501 + bug 510 respawning agents on already-completed stories. Once 511 + 510 + 501 are fixed, that bleed disappears.

Open questions for the next session

  1. Should ExecutionState live in the same CRDT or a separate one? We tentatively said same CRDT under per-node-pubkey keys. Need to validate this against the bft-json-crdt library's actual capabilities.
  2. Heartbeat cadence? How often should last_heartbeat be updated for ExecutionState::Running? Every 30s seems reasonable but should be config.
  3. What's the migration path from existing pipeline_items rows to typed PipelineItems? A one-time migration script, or rebuild from crdt_ops?
  4. Should we add statig after all? Probably not for the initial implementation, but worth revisiting if we end up wanting hierarchical states (e.g., a Working superstate sharing transitions across active stages).