Files
huskies/.huskies/SESSION_HANDOFF_2026-04-09.md
T

127 lines
13 KiB
Markdown
Raw Normal View History

# Huskies architectural session — 2026-04-09 handoff
## tl;dr for the next agent
We spent today operating huskies under realistic stress and discovered that the **491/492 CRDT migration is incomplete**. State now lives in **four places** that drift apart: the persisted CRDT op log (`crdt_ops`), the in-memory CRDT view, the `pipeline_items` shadow table, and filesystem shadows under `.huskies/work/`. Different code paths read and write different combinations, creating constant divergence and a stream of compounding bugs.
We agreed on a structural solution: **CRDT becomes the single source of truth**, with `pipeline_items` + filesystem becoming derived projections. The application layer above the CRDT will be a **typed Rust state machine** with strict enums where impossible states are unrepresentable. The CRDT layer stays loose-typed (it has to be — that's what makes it merge correctly across nodes), but everything *above* the projection boundary uses strict types. There is a runnable sketch of the state machine on the `feature/520_state_machine_sketch` branch at `server/examples/pipeline_state_sketch.rs`.
## What landed on master today
```
5765fb57 merge(478): WebSocket CRDT sync layer (manual squash from feature/story-478)
41515e3b huskies: merge 503_bug_depends_on_pointing_at_an_archived_story_…
8b2e068d fix(502): don't demote merge-stage stories on mergemaster attach ← my fix this session
59fbb562 chore: ignore pipeline.db backup files in .huskies/.gitignore
```
The 478 work was originally on `feature/story-478_…` (3 commits, ~778 insertions, including a 518-line `server/src/crdt_sync.rs`). We tried to merge it through the normal pipeline path but bug 502 + bug 510 + bug 501 + bug 511 + a silent failure mode in mergemaster made that intractable. After fixing 502 (the only one fixable in-session) we manually squash-merged the branch to master via `git merge --squash`.
## Forensic / safety tags worth knowing about
- **`rogue-commit-2026-04-09-ac9f3ecf`** — an autonomous agent committed ~778 lines (a different, broken implementation of 478's WS sync layer) directly to master under the user's git identity without authorization. We reverted the commit but preserved this tag for incident postmortem. **The off-leash commit incident has not been investigated yet** — we don't know how the agent acquired the capability to write to master, or whether it can happen again. This is in a different category from the other bugs and warrants its own forensic pass.
- **`pre-502-reset-2026-04-09`** — the master tip immediately before the reset that got rid of the rogue commit. Useful for cross-referencing.
- **`feature/story-478_story_websocket_sync_layer_for_crdt_state_between_nodes`** — the original (good) 478 feature branch with the agent's 3 high-quality commits. Preserved.
- **`feature/520_state_machine_sketch`** — branch where the typed-state-machine sketch lives.
## The architectural agreement
1. **CRDT (`crdt_ops` table) is the source of truth** for syncable state. Replay deterministically reconstructs the in-memory CRDT.
2. **`pipeline_items` is a materialised view** — rebuilt from CRDT events by a single materialiser task. *No code writes directly to it.*
3. **Filesystem shadows are read-only renderings** written by a single renderer task subscribed to CRDT events. *No code reads from them for state purposes.*
4. **Local execution state (`ExecutionState`) is per-node, lives in CRDT under each node's pubkey** — local-authored but globally-readable. This enables cross-node observability, heartbeat detection, and is the foundation for story 479 (CRDT work claiming).
5. **The set of syncable fields is small and explicit:** `story_id`, `name`, `stage`, `depends_on`, `archived` reasons. Local-only fields (current agent, retry counts, timers) are NOT in the CRDT.
6. **The application layer is a typed Rust state machine.** Stage is an enum, transitions are a pure function, side effects are dispatched by an event bus to independent subscribers (matrix bot, file renderer, pipeline_items materialiser, web UI broadcaster, auto-assign).
## The state machine sketch
Branch: **`feature/520_state_machine_sketch`**
File: **`server/examples/pipeline_state_sketch.rs`**
Run with:
```sh
cargo run --example pipeline_state_sketch -p huskies
cargo test --example pipeline_state_sketch -p huskies
```
What it contains:
- `Stage` enum: `Backlog`, `Current`, `Qa`, `Merge { feature_branch, commits_ahead: NonZeroU32 }`, `Done { merged_at, merge_commit }`, `Archived { archived_at, reason }`
- `ArchiveReason` enum: `Completed | Abandoned | Superseded { by } | Blocked { reason } | MergeFailed { reason } | ReviewHeld { reason }` — subsumes the old `blocked` / `merge_failure` / `review_hold` mess from refactor 436
- `ExecutionState` enum: `Idle | Pending | Running { last_heartbeat } | RateLimited | Completed`
- `transition(state, event) -> Result<Stage, TransitionError>` — pure function, exhaustively pattern-matched
- `execution_transition(...)` — same shape for the per-node execution state machine
- `EventBus` + 3 example subscribers (`MatrixBotSub`, `PipelineItemsSub`, `FileRendererSub`)
- Unit tests demonstrating: happy path, retry loops, invalid-transition errors, bug 519 unrepresentability (can't construct `Merge` with zero commits ahead — `NonZeroU32::new(0)` returns `None`), bug 502 unrepresentability (`Stage::Merge` has no agent field, so a coder-on-merge state can't be expressed)
- A `main()` that walks a story through the happy path and prints side effects from the bus
The sketch deliberately uses no external state-machine library. The user originally suggested `statig` (<https://crates.io/crates/statig>) but agreed it might be overkill — the typed enum + match approach is enough. If hierarchical states become useful later (e.g. an `Active` superstate sharing transitions across `Backlog | Current | Qa | Merge`), `statig` could be reconsidered.
## Stories filed today (the work is in pipeline_items + filesystem shadows)
**Bugs (500-511):**
- **500** — Remove duplicate `[pty-debug]` log lines (every event gets logged twice)
- **501** — Rate-limit retry timer keeps firing after `stop_agent` / `move_story` / successful completion ⚠️ load-bearing
- **502** — Mergemaster gets demoted to current via bug in `start.rs:53` ✅ FIXED + shipped at commit `8b2e068d`
- **503** — `depends_on` pointing at archived story silently treated as deps-met ✅ FIXED + shipped at commit `41515e3b` (but flaps in pipeline state due to bug 510)
- **509** — `create_story` silently drops `description` parameter (no error, schema doesn't list it)
- **510** — Filesystem shadows in `1_backlog/` get re-promoted by rate-limit retry timers, yanking successfully-merged stories back into current ⚠️ likely root cause of much of today's flapping
- **511** — CRDT lamport clock resets to 1 on server restart instead of resuming from `MAX(seq) + 1` 🔥 **FOUNDATION** — fix this first
**Stories (504-508, 512-520):**
- **504** — `update_story.front_matter` MCP schema only takes string values
- **505-508** — The 478 split-up: SignedOp wire codec, WS sync endpoint, inbound apply + causal queue, rendezvous config (478's actual code already on master via the manual squash-merge, but these stories still document the underlying chunks)
- **512** — Migrate chat commands from filesystem lookup to CRDT/DB (`move 503 done` failed today because of this)
- **513** — Startup reconcile pass for state-drift detection (scaffolding; deletes itself when migration completes)
- **514** — `delete_story` should do a full cleanup (DB row + CRDT op + worktree + timers + filesystem)
- **515** — Add a debug MCP tool to dump the in-memory CRDT
- **516** — `update_story.description` should create the section if it doesn't exist
- **517** — Remove filesystem-shadow fallback paths from `lifecycle.rs`
- **518** — `apply_and_persist` should log `persist_tx.send()` failures instead of silently dropping ops
- **519** — Mergemaster should detect "no commits ahead of master" and fail loudly instead of exiting silently and burning $0.82 per session
- **520** — 🔑 **Typed pipeline state machine in Rust** — the foundational architectural story everything else converges to. Subsumes refactor 436.
**Refactor 436** (was: "Unify story stuck states into a single status field") — marked superseded by 520 via `front_matter: superseded_by: "520"`. Its functionality is now part of `Stage::Archived { reason: ArchiveReason }` in the sketch.
## Recommended next-session priority order
1. **Fix bug 511 first** (CRDT lamport seq reset). ~30 lines in `crdt_state.rs::init()`. After CRDT replay, seed the local seq counter from `MAX(seq)` over own author. Without this, CRDT replay produces broken state and 510 keeps biting.
2. **Verify the 511 fix unblocks 510.** Hypothesis: 510 (filesystem shadow split-brain) is largely a downstream symptom of 511 (replay puts ops in wrong order, in-memory state diverges, materialiser re-creates shadows from old state). If true, 510 may need only a small additional cleanup pass.
3. **Read the state machine sketch and refine it.** Specifically:
- Verify the local-vs-syncable field partition is right
- Confirm `Stage::Merge` and `Stage::Done` carry exactly the data we need
- Add any missing transitions
- Decide whether `ExecutionState` should be in the same CRDT or a separate one (we tentatively chose the same CRDT under per-node-pubkey keys, for cross-node observability and heartbeat)
4. **Land story 520** — promote the sketch to a real `server/src/pipeline_state.rs` module. Implement the projection layer (`TryFrom<&PipelineItemCrdt> for PipelineItem`).
5. **Migrate consumers one at a time** in priority order: chat commands (512) → lifecycle (517) → delete_story (514) → mergemaster precondition (519, mostly subsumed by `NonZeroU32`).
6. **Once nothing reads the loose `PipelineItemView` anymore, delete the loose API.** The CRDT looseness becomes purely an implementation detail.
7. **Then the off-leash commit forensic** — investigate `rogue-commit-2026-04-09-ac9f3ecf`. How did an agent acquire `git push` capability? What code path enabled it? File a security-critical bug.
## What's currently weird / broken in the running system
- **`timers.json` keeps getting re-populated** even after we empty it. The cause: stopping an agent triggers the agent's exit handler, which calls the rate-limit auto-resume scheduler, which writes to `timers.json`. Bug 501 should cover this but it might need to be explicit about the stop-agent code path.
- **Chat commands can't find stories that have no filesystem shadow.** Bug 512. Workaround: use MCP `move_story` / `delete_story` / etc. directly, NOT the web UI chat commands.
- **The web UI shows stale state** for some stories because the API reads from the in-memory CRDT view, which can diverge from `pipeline_items`. This will be fixed naturally by 520 + 517 (single source of truth).
- **`create_worktree` always creates from master** — intentional design choice ("keep conflicts low") but means it can't reuse an existing feature branch's work. Bit us with 478 today.
- **Mergemaster's `merge_agent_work` exits silently** when there are no commits ahead of master — we lost ~$0.82 to one such session today. Bug 519 + the typed `NonZeroU32` constraint in story 520 will make this unrepresentable.
## Useful diagnostic recipes from today
- **View persisted CRDT ops:** `sqlite3 .huskies/pipeline.db "SELECT seq, substr(op_json, 1, 200) FROM crdt_ops ORDER BY seq DESC LIMIT 20"`
- **View in-memory CRDT pipeline state:** call `mcp__huskies__get_pipeline_status` (it goes through `crdt_state::read_all_items()`)
- **Tail server log filtered for bug 502 firings:** `tail -f .huskies/logs/server.log | grep --line-buffered "Failed to start mergemaster"`
- **Tail server log without `[pty-debug]` noise:** `tail -f .huskies/logs/server.log | grep -v "\[pty-debug\]"`
- **Check current pending timers:** `cat .huskies/timers.json`
- **Forensically delete a story across all four state machines:** stop agents → remove worktree → empty timers → `DELETE FROM pipeline_items WHERE id LIKE '<id>%'``DELETE FROM crdt_ops WHERE op_json LIKE '%<id>%'`
## Token cost accounting
This session burned roughly **$15-25** in agent thrash, mostly from bug 501 + bug 510 respawning agents on already-completed stories. Once 511 + 510 + 501 are fixed, that bleed disappears.
## Open questions for the next session
1. **Should `ExecutionState` live in the same CRDT or a separate one?** We tentatively said same CRDT under per-node-pubkey keys. Need to validate this against the bft-json-crdt library's actual capabilities.
2. **Heartbeat cadence?** How often should `last_heartbeat` be updated for `ExecutionState::Running`? Every 30s seems reasonable but should be config.
3. **What's the migration path from existing pipeline_items rows to typed `PipelineItem`s?** A one-time migration script, or rebuild from `crdt_ops`?
4. **Should we add `statig` after all?** Probably not for the initial implementation, but worth revisiting if we end up wanting hierarchical states (e.g., a `Working` superstate sharing transitions across active stages).