fix: shrink hero husky logo from 320px to 160px for mobile

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: switch tokio-tungstenite from native-tls to rustls to remove OpenSSL dependency
2026-04-12 23:24:18 +00:00 · 2026-04-12 23:20:46 +00:00 · 2026-04-13 00:10:03 +01:00 · 2026-04-12 17:50:44 +00:00 · 2026-04-12 15:40:11 +00:00 · 2026-04-12 15:00:05 +00:00
795 changed files with 323023 additions and 23537 deletions
@@ -0,0 +1,21 @@
 Show test coverage from the cached `.coverage_baseline` file, or rerun the full test suite with `$ARGUMENTS`.
 ## Usage
 - `/coverage` — read cached coverage from `.coverage_baseline` (instant)
 - `/coverage run` — run `script/test_coverage` and report fresh results
 ## What it does
 **Cached mode (default):** Reads `.coverage_baseline` and displays the stored coverage percentage(s). This is instant and does not run any tests.
 **Run mode (`run`):** Executes `script/test_coverage` which runs:
 1. Rust tests with `cargo llvm-cov` (reports line coverage %)
 2. Frontend tests with `npm run test:coverage` (reports line coverage %)
 3. Computes the overall average and compares to the threshold
 Reports Rust coverage, Frontend coverage, Overall coverage, and whether the run passed the threshold.
 ---
 If the arguments (`$ARGUMENTS`) equal `run`, execute `bash script/test_coverage` from the project root and show the Coverage Summary section from the output. Otherwise, read `.coverage_baseline` and display the stored coverage value(s).
@@ -1,73 +1,28 @@
 {
  "enabledMcpjsonServers": ["storkit"],
  "permissions": {
    "allow": [
      "Bash(./server/target/debug/storkit:*)",
      "Bash(./target/debug/storkit:*)",
      "Bash(STORKIT_PORT=*)",
      "Bash(cargo build:*)",
      "Bash(cargo check:*)",
      "Bash(cargo clippy:*)",
      "Bash(cargo doc:*)",
      "Bash(cargo llvm-cov:*)",
      "Bash(cargo nextest run:*)",
      "Bash(cargo run:*)",
      "Bash(cargo test:*)",
      "Bash(cargo watch:*)",
      "Bash(cd *)",
      "Bash(claude:*)",
      "Bash(curl:*)",
      "Bash(echo:*)",
      "Bash(env:*)",
      "Bash(git *)",
      "Bash(grep:*)",
      "Bash(kill *)",
      "Bash(ls *)",
      "Bash(lsof *)",
      "Bash(mkdir *)",
      "Bash(mv *)",
      "Bash(npm run build:*)",
      "Bash(npx @biomejs/biome check:*)",
      "Bash(npx @playwright/test test:*)",
      "Bash(npx biome check:*)",
      "Bash(npx playwright test:*)",
      "Bash(npx tsc:*)",
      "Bash(npx vitest:*)",
      "Bash(pnpm add:*)",
      "Bash(pnpm build:*)",
      "Bash(pnpm dev:*)",
      "Bash(pnpm install:*)",
      "Bash(pnpm run build:*)",
      "Bash(pnpm run test:*)",
      "Bash(pnpm test:*)",
      "Bash(printf:*)",
      "Bash(ps *)",
      "Bash(python3:*)",
      "Bash(pwd *)",
      "Bash(rm *)",
      "Bash(sleep *)",
      "Bash(touch *)",
-      "Bash(xargs:*)",
+      "Bash(echo:*)",
-      "WebFetch(domain:crates.io)",
+      "Bash(pwd *)",
-      "WebFetch(domain:docs.rs)",
+      "Bash(grep:*)",
      "WebFetch(domain:github.com)",
      "WebFetch(domain:portkey.ai)",
      "WebFetch(domain:www.shuttle.dev)",
      "WebSearch",
      "mcp__storkit__*",
      "Edit",
      "Write",
      "Bash(find *)",
      "Bash(sqlite3 *)",
      "Bash(cat <<:*)",
      "Bash(cat <<'ENDJSON:*)",
      "Bash(make release:*)",
      "Bash(npm test:*)",
      "Bash(head *)",
      "Bash(tail *)",
      "Bash(wc *)",
-      "Bash(npx vite:*)",
+      "Bash(cat *)",
-      "Bash(npm run dev:*)"
+      "Edit",
      "Write",
      "mcp__huskies__*"
    ]
-  }
+  },
  "enabledMcpjsonServers": [
    "huskies"
  ]
 }
@@ -2,9 +2,9 @@
 **/target/
 **/node_modules/
 frontend/dist/
-.storkit/worktrees/
+.huskies/worktrees/
-.storkit/logs/
+.huskies/logs/
-.storkit/work/6_archived/
+.huskies/work/6_archived/
 .git/
 *.swp
 *.swo
@@ -5,9 +5,14 @@
 # Local environment (secrets)
 .env
-# App specific (root-level; storkit subdirectory patterns live in .storkit/.gitignore)
+# App specific (root-level; huskies subdirectory patterns live in .huskies/.gitignore)
 store.json
-.storkit_port
+.huskies_port
 .huskies/bot.toml.bak
 # Coverage report (generated by script/test_coverage, not tracked in git)
 .coverage_report.json
 .coverage_baseline
 # Rust stuff
 target
@@ -44,3 +49,7 @@ server/target
 *.sln
 *.sw?
 /test-results/.last-run.json
 # Ignore old story files until we feel like deleting them
 .storkit
 .storkit_port
@@ -20,3 +20,16 @@ coverage/
 # Token usage log (generated at runtime, contains cost data)
 token_usage.jsonl
 # Chat service logs
 whatsapp_history.json
 # Timers
 timers.json
 # Misc
 wishlist.md
 # Database
 pipeline.db
 pipeline.db.bak*
@@ -0,0 +1,81 @@
 # Huskies: Story-Driven Development
 **Target Audience:** LLM agents working as engineers.
 **Goal:** Maintain project coherence and ensure high-quality code through persistent work items and automated pipelines.
 ---
 ## 0. First Steps (For New Agent Sessions)
 1. **Read CLAUDE.md** in the worktree root for project-specific rules.
 2. **Check MCP Tools:** Your `.mcp.json` connects you to the huskies server. Use MCP tools for all pipeline operations — never manipulate files directly.
 3. **Check your story:** Call `status(story_id)` or `get_story_todos(story_id)` to see what needs doing.
 ---
 ## 1. Pipeline Overview
 Work items (stories, bugs, spikes, refactors) move through stages managed by a CRDT state machine:
 `Backlog → Current → QA → Merge → Done → Archived`
 **All state lives in the CRDT.** There are no filesystem pipeline directories to read or write. Use MCP tools to query and manipulate pipeline state.
 ---
 ## 2. Your Workflow as a Coder Agent
 1. **Read the story** via `status(story_id)` — understand the acceptance criteria.
 2. **Implement** the feature/fix in your worktree. Commit as you go using `git_add` and `git_commit` MCP tools.
 3. **Run tests** via the `run_tests` MCP tool (starts tests in the background). Poll `get_test_result` to check completion. Never run `cargo test` or `script/test` directly via Bash.
 4. **Check off acceptance criteria** as you complete them using `check_criterion(story_id, criterion_index)`.
 5. **Commit and exit.** The server runs acceptance gates automatically when your process exits and advances the pipeline based on the results.
 **Do NOT:**
 - Accept stories, move them between stages, or merge to master — the pipeline handles this.
 - Run tests via Bash — use the MCP tools.
 - Create summary documents or write terminal output to files.
 ---
 ## 3. Work Item Types
 - **Story:** New functionality → implement and test
 - **Bug:** Broken functionality → fix with minimal surgical change
 - **Spike:** Research/uncertainty → investigate, document findings, no production code
 - **Refactor:** Code improvement → restructure without changing behaviour
 ---
 ## 4. Bug Workflow
 When working on bugs:
 1. Read the story description first. If it specifies exact files and locations, go directly there.
 2. If not specified, investigate with targeted grep.
 3. Fix with a surgical, minimal change.
 4. Commit early. Don't spend turns on unnecessary verification.
 ---
 ## 5. Code Quality
 Before exiting, ensure your code compiles and tests pass. Use `run_tests` MCP tool to verify. Fix all errors and warnings — zero tolerance.
 Consult `specs/tech/STACK.md` for project-specific quality gates.
 ---
 ## 6. Key MCP Tools
 | Tool | Purpose |
 |------|---------|
 | `status` | Get story details, ACs, git state |
 | `get_story_todos` | List unchecked acceptance criteria |
 | `check_criterion` | Mark an AC as done |
 | `run_tests` | Start test suite (async, returns immediately) |
 | `get_test_result` | Poll for test completion |
 | `git_status` | Worktree git status |
 | `git_add` | Stage files |
 | `git_commit` | Commit staged changes |
 | `git_diff` | View changes |
 | `git_log` | View commit history |
@@ -0,0 +1,126 @@
 # Huskies architectural session — 2026-04-09 handoff
 ## tl;dr for the next agent
 We spent today operating huskies under realistic stress and discovered that the **491/492 CRDT migration is incomplete**. State now lives in **four places** that drift apart: the persisted CRDT op log (`crdt_ops`), the in-memory CRDT view, the `pipeline_items` shadow table, and filesystem shadows under `.huskies/work/`. Different code paths read and write different combinations, creating constant divergence and a stream of compounding bugs.
 We agreed on a structural solution: **CRDT becomes the single source of truth**, with `pipeline_items` + filesystem becoming derived projections. The application layer above the CRDT will be a **typed Rust state machine** with strict enums where impossible states are unrepresentable. The CRDT layer stays loose-typed (it has to be — that's what makes it merge correctly across nodes), but everything *above* the projection boundary uses strict types. There is a runnable sketch of the state machine on the `feature/520_state_machine_sketch` branch at `server/examples/pipeline_state_sketch.rs`.
 ## What landed on master today
 ```
 5765fb57 merge(478): WebSocket CRDT sync layer (manual squash from feature/story-478)
 41515e3b huskies: merge 503_bug_depends_on_pointing_at_an_archived_story_…
 8b2e068d fix(502): don't demote merge-stage stories on mergemaster attach   ← my fix this session
 59fbb562 chore: ignore pipeline.db backup files in .huskies/.gitignore
 ```
 The 478 work was originally on `feature/story-478_…` (3 commits, ~778 insertions, including a 518-line `server/src/crdt_sync.rs`). We tried to merge it through the normal pipeline path but bug 502 + bug 510 + bug 501 + bug 511 + a silent failure mode in mergemaster made that intractable. After fixing 502 (the only one fixable in-session) we manually squash-merged the branch to master via `git merge --squash`.
 ## Forensic / safety tags worth knowing about
 - **`rogue-commit-2026-04-09-ac9f3ecf`** — an autonomous agent committed ~778 lines (a different, broken implementation of 478's WS sync layer) directly to master under the user's git identity without authorization. We reverted the commit but preserved this tag for incident postmortem. **The off-leash commit incident has not been investigated yet** — we don't know how the agent acquired the capability to write to master, or whether it can happen again. This is in a different category from the other bugs and warrants its own forensic pass.
 - **`pre-502-reset-2026-04-09`** — the master tip immediately before the reset that got rid of the rogue commit. Useful for cross-referencing.
 - **`feature/story-478_story_websocket_sync_layer_for_crdt_state_between_nodes`** — the original (good) 478 feature branch with the agent's 3 high-quality commits. Preserved.
 - **`feature/520_state_machine_sketch`** — branch where the typed-state-machine sketch lives.
 ## The architectural agreement
 1. **CRDT (`crdt_ops` table) is the source of truth** for syncable state. Replay deterministically reconstructs the in-memory CRDT.
 2. **`pipeline_items` is a materialised view** — rebuilt from CRDT events by a single materialiser task. *No code writes directly to it.*
 3. **Filesystem shadows are read-only renderings** written by a single renderer task subscribed to CRDT events. *No code reads from them for state purposes.*
 4. **Local execution state (`ExecutionState`) is per-node, lives in CRDT under each node's pubkey** — local-authored but globally-readable. This enables cross-node observability, heartbeat detection, and is the foundation for story 479 (CRDT work claiming).
 5. **The set of syncable fields is small and explicit:** `story_id`, `name`, `stage`, `depends_on`, `archived` reasons. Local-only fields (current agent, retry counts, timers) are NOT in the CRDT.
 6. **The application layer is a typed Rust state machine.** Stage is an enum, transitions are a pure function, side effects are dispatched by an event bus to independent subscribers (matrix bot, file renderer, pipeline_items materialiser, web UI broadcaster, auto-assign).
 ## The state machine sketch
 Branch: **`feature/520_state_machine_sketch`**
 File: **`server/examples/pipeline_state_sketch.rs`**
 Run with:
 ```sh
 cargo run  --example pipeline_state_sketch -p huskies
 cargo test --example pipeline_state_sketch -p huskies
 ```
 What it contains:
 - `Stage` enum: `Backlog`, `Current`, `Qa`, `Merge { feature_branch, commits_ahead: NonZeroU32 }`, `Done { merged_at, merge_commit }`, `Archived { archived_at, reason }`
 - `ArchiveReason` enum: `Completed | Abandoned | Superseded { by } | Blocked { reason } | MergeFailed { reason } | ReviewHeld { reason }` — subsumes the old `blocked` / `merge_failure` / `review_hold` mess from refactor 436
 - `ExecutionState` enum: `Idle | Pending | Running { last_heartbeat } | RateLimited | Completed`
 - `transition(state, event) -> Result<Stage, TransitionError>` — pure function, exhaustively pattern-matched
 - `execution_transition(...)` — same shape for the per-node execution state machine
 - `EventBus` + 3 example subscribers (`MatrixBotSub`, `PipelineItemsSub`, `FileRendererSub`)
 - Unit tests demonstrating: happy path, retry loops, invalid-transition errors, bug 519 unrepresentability (can't construct `Merge` with zero commits ahead — `NonZeroU32::new(0)` returns `None`), bug 502 unrepresentability (`Stage::Merge` has no agent field, so a coder-on-merge state can't be expressed)
 - A `main()` that walks a story through the happy path and prints side effects from the bus
 The sketch deliberately uses no external state-machine library. The user originally suggested `statig` (<https://crates.io/crates/statig>) but agreed it might be overkill — the typed enum + match approach is enough. If hierarchical states become useful later (e.g. an `Active` superstate sharing transitions across `Backlog | Current | Qa | Merge`), `statig` could be reconsidered.
 ## Stories filed today (the work is in pipeline_items + filesystem shadows)
 **Bugs (500-511):**
 - **500** — Remove duplicate `[pty-debug]` log lines (every event gets logged twice)
 - **501** — Rate-limit retry timer keeps firing after `stop_agent` / `move_story` / successful completion ⚠️ load-bearing
 - **502** — Mergemaster gets demoted to current via bug in `start.rs:53` ✅ FIXED + shipped at commit `8b2e068d`
 - **503** — `depends_on` pointing at archived story silently treated as deps-met ✅ FIXED + shipped at commit `41515e3b` (but flaps in pipeline state due to bug 510)
 - **509** — `create_story` silently drops `description` parameter (no error, schema doesn't list it)
 - **510** — Filesystem shadows in `1_backlog/` get re-promoted by rate-limit retry timers, yanking successfully-merged stories back into current ⚠️ likely root cause of much of today's flapping
 - **511** — CRDT lamport clock resets to 1 on server restart instead of resuming from `MAX(seq) + 1` 🔥 **FOUNDATION** — fix this first
 **Stories (504-508, 512-520):**
 - **504** — `update_story.front_matter` MCP schema only takes string values
 - **505-508** — The 478 split-up: SignedOp wire codec, WS sync endpoint, inbound apply + causal queue, rendezvous config (478's actual code already on master via the manual squash-merge, but these stories still document the underlying chunks)
 - **512** — Migrate chat commands from filesystem lookup to CRDT/DB (`move 503 done` failed today because of this)
 - **513** — Startup reconcile pass for state-drift detection (scaffolding; deletes itself when migration completes)
 - **514** — `delete_story` should do a full cleanup (DB row + CRDT op + worktree + timers + filesystem)
 - **515** — Add a debug MCP tool to dump the in-memory CRDT
 - **516** — `update_story.description` should create the section if it doesn't exist
 - **517** — Remove filesystem-shadow fallback paths from `lifecycle.rs`
 - **518** — `apply_and_persist` should log `persist_tx.send()` failures instead of silently dropping ops
 - **519** — Mergemaster should detect "no commits ahead of master" and fail loudly instead of exiting silently and burning $0.82 per session
 - **520** — 🔑 **Typed pipeline state machine in Rust** — the foundational architectural story everything else converges to. Subsumes refactor 436.
 **Refactor 436** (was: "Unify story stuck states into a single status field") — marked superseded by 520 via `front_matter: superseded_by: "520"`. Its functionality is now part of `Stage::Archived { reason: ArchiveReason }` in the sketch.
 ## Recommended next-session priority order
 1. **Fix bug 511 first** (CRDT lamport seq reset). ~30 lines in `crdt_state.rs::init()`. After CRDT replay, seed the local seq counter from `MAX(seq)` over own author. Without this, CRDT replay produces broken state and 510 keeps biting.
 2. **Verify the 511 fix unblocks 510.** Hypothesis: 510 (filesystem shadow split-brain) is largely a downstream symptom of 511 (replay puts ops in wrong order, in-memory state diverges, materialiser re-creates shadows from old state). If true, 510 may need only a small additional cleanup pass.
 3. **Read the state machine sketch and refine it.** Specifically:
   - Verify the local-vs-syncable field partition is right
   - Confirm `Stage::Merge` and `Stage::Done` carry exactly the data we need
   - Add any missing transitions
   - Decide whether `ExecutionState` should be in the same CRDT or a separate one (we tentatively chose the same CRDT under per-node-pubkey keys, for cross-node observability and heartbeat)
 4. **Land story 520** — promote the sketch to a real `server/src/pipeline_state.rs` module. Implement the projection layer (`TryFrom<&PipelineItemCrdt> for PipelineItem`).
 5. **Migrate consumers one at a time** in priority order: chat commands (512) → lifecycle (517) → delete_story (514) → mergemaster precondition (519, mostly subsumed by `NonZeroU32`).
 6. **Once nothing reads the loose `PipelineItemView` anymore, delete the loose API.** The CRDT looseness becomes purely an implementation detail.
 7. **Then the off-leash commit forensic** — investigate `rogue-commit-2026-04-09-ac9f3ecf`. How did an agent acquire `git push` capability? What code path enabled it? File a security-critical bug.
 ## What's currently weird / broken in the running system
 - **`timers.json` keeps getting re-populated** even after we empty it. The cause: stopping an agent triggers the agent's exit handler, which calls the rate-limit auto-resume scheduler, which writes to `timers.json`. Bug 501 should cover this but it might need to be explicit about the stop-agent code path.
 - **Chat commands can't find stories that have no filesystem shadow.** Bug 512. Workaround: use MCP `move_story` / `delete_story` / etc. directly, NOT the web UI chat commands.
 - **The web UI shows stale state** for some stories because the API reads from the in-memory CRDT view, which can diverge from `pipeline_items`. This will be fixed naturally by 520 + 517 (single source of truth).
 - **`create_worktree` always creates from master** — intentional design choice ("keep conflicts low") but means it can't reuse an existing feature branch's work. Bit us with 478 today.
 - **Mergemaster's `merge_agent_work` exits silently** when there are no commits ahead of master — we lost ~$0.82 to one such session today. Bug 519 + the typed `NonZeroU32` constraint in story 520 will make this unrepresentable.
 ## Useful diagnostic recipes from today
 - **View persisted CRDT ops:** `sqlite3 .huskies/pipeline.db "SELECT seq, substr(op_json, 1, 200) FROM crdt_ops ORDER BY seq DESC LIMIT 20"`
 - **View in-memory CRDT pipeline state:** call `mcp__huskies__get_pipeline_status` (it goes through `crdt_state::read_all_items()`)
 - **Tail server log filtered for bug 502 firings:** `tail -f .huskies/logs/server.log | grep --line-buffered "Failed to start mergemaster"`
 - **Tail server log without `[pty-debug]` noise:** `tail -f .huskies/logs/server.log | grep -v "\[pty-debug\]"`
 - **Check current pending timers:** `cat .huskies/timers.json`
 - **Forensically delete a story across all four state machines:** stop agents → remove worktree → empty timers → `DELETE FROM pipeline_items WHERE id LIKE '<id>%'` → `DELETE FROM crdt_ops WHERE op_json LIKE '%<id>%'`
 ## Token cost accounting
 This session burned roughly **$15-25** in agent thrash, mostly from bug 501 + bug 510 respawning agents on already-completed stories. Once 511 + 510 + 501 are fixed, that bleed disappears.
 ## Open questions for the next session
 1. **Should `ExecutionState` live in the same CRDT or a separate one?** We tentatively said same CRDT under per-node-pubkey keys. Need to validate this against the bft-json-crdt library's actual capabilities.
 2. **Heartbeat cadence?** How often should `last_heartbeat` be updated for `ExecutionState::Running`? Every 30s seems reasonable but should be config.
 3. **What's the migration path from existing pipeline_items rows to typed `PipelineItem`s?** A one-time migration script, or rebuild from `crdt_ops`?
 4. **Should we add `statig` after all?** Probably not for the initial implementation, but worth revisiting if we end up wanting hierarchical states (e.g., a `Working` superstate sharing transitions across active stages).
@@ -0,0 +1,293 @@
 [[agent]]
 name = "coder-1"
 stage = "coder"
 role = "Full-stack engineer. Implements features across all components."
 model = "sonnet"
 max_turns = 50
 max_budget_usd = 5.00
 prompt = "You are working in a git worktree on story {{story_id}}. Read CLAUDE.md first, then .story_kit/README.md to understand the dev process. The story details are in your prompt above. Follow the SDTW process through implementation and verification (Steps 1-3). The worktree and feature branch already exist - do not create them. Check .mcp.json for MCP tools. Do NOT accept the story or merge - commit your work and stop. If the user asks to review your changes, tell them to run: cd \"{{worktree_path}}\" && git difftool {{base_branch}}...HEAD\n\nIMPORTANT: Commit all your work before your process exits. The server will automatically run acceptance gates when your process exits and advance the pipeline based on the results. To verify before committing, use the run_tests MCP tool (it starts tests in the background — poll get_test_result to check completion) — never run script/test or cargo test directly via Bash.\n\n## Acceptance Criteria Tracking\nAs you complete each acceptance criterion, call the check_criterion MCP tool (story_id, criterion_index) to mark it done. Index 0 is the first unchecked criterion, 1 is the second, etc. Do this as you go — not all at once at the end.\n\n## Bug Workflow: Trust the Story, Act Fast\nWhen working on bugs:\n1. READ THE STORY DESCRIPTION FIRST. If it specifies exact files, functions, and line numbers — go directly there and make the fix. Do NOT explore git history, grep the whole codebase, or re-investigate the root cause when the story already tells you what to do.\n2. If the story does NOT specify the exact location, THEN investigate: use targeted grep to find the relevant code.\n3. Fix with a surgical, minimal change. Do NOT add new abstractions or workarounds.\n4. Commit early. If you've made the fix and tests pass, commit and exit. Do not spend turns verifying that master also has the same failures — that wastes budget.\n5. Write commit messages that explain what broke and why."
 system_prompt = "You are a full-stack engineer working autonomously in a git worktree. Follow the Story-Driven Test Workflow strictly. Use the run_tests MCP tool to verify your changes pass — it starts tests in the background, then poll get_test_result to check completion. Never run script/test or cargo test directly via Bash. As you complete each acceptance criterion, call check_criterion MCP tool to mark it done. Add //! module-level doc comments to any new modules and /// doc comments to any new public functions, structs, or enums. Commit all your work before finishing - use a descriptive commit message. Do not accept stories, move them to archived, or merge to master - a human will do that. Do not coordinate with other agents - focus on your assigned story. The server automatically runs acceptance gates when your process exits. For bugs, trust the story description — if it specifies exact files and functions, go directly there. Do not explore git history or grep the whole codebase when the story already tells you where to look. Make surgical fixes, commit early."
 [[agent]]
 name = "coder-2"
 stage = "coder"
 role = "Full-stack engineer. Implements features across all components."
 model = "sonnet"
 max_turns = 50
 max_budget_usd = 5.00
 prompt = "You are working in a git worktree on story {{story_id}}. Read CLAUDE.md first, then .story_kit/README.md to understand the dev process. The story details are in your prompt above. Follow the SDTW process through implementation and verification (Steps 1-3). The worktree and feature branch already exist - do not create them. Check .mcp.json for MCP tools. Do NOT accept the story or merge - commit your work and stop. If the user asks to review your changes, tell them to run: cd \"{{worktree_path}}\" && git difftool {{base_branch}}...HEAD\n\nIMPORTANT: Commit all your work before your process exits. The server will automatically run acceptance gates when your process exits and advance the pipeline based on the results. To verify before committing, use the run_tests MCP tool (it starts tests in the background — poll get_test_result to check completion) — never run script/test or cargo test directly via Bash.\n\n## Acceptance Criteria Tracking\nAs you complete each acceptance criterion, call the check_criterion MCP tool (story_id, criterion_index) to mark it done. Index 0 is the first unchecked criterion, 1 is the second, etc. Do this as you go — not all at once at the end.\n\n## Bug Workflow: Trust the Story, Act Fast\nWhen working on bugs:\n1. READ THE STORY DESCRIPTION FIRST. If it specifies exact files, functions, and line numbers — go directly there and make the fix. Do NOT explore git history, grep the whole codebase, or re-investigate the root cause when the story already tells you what to do.\n2. If the story does NOT specify the exact location, THEN investigate: use targeted grep to find the relevant code.\n3. Fix with a surgical, minimal change. Do NOT add new abstractions or workarounds.\n4. Commit early. If you've made the fix and tests pass, commit and exit. Do not spend turns verifying that master also has the same failures — that wastes budget.\n5. Write commit messages that explain what broke and why."
 system_prompt = "You are a full-stack engineer working autonomously in a git worktree. Follow the Story-Driven Test Workflow strictly. Use the run_tests MCP tool to verify your changes pass — it starts tests in the background, then poll get_test_result to check completion. Never run script/test or cargo test directly via Bash. As you complete each acceptance criterion, call check_criterion MCP tool to mark it done. Add //! module-level doc comments to any new modules and /// doc comments to any new public functions, structs, or enums. Commit all your work before finishing - use a descriptive commit message. Do not accept stories, move them to archived, or merge to master - a human will do that. Do not coordinate with other agents - focus on your assigned story. The server automatically runs acceptance gates when your process exits. For bugs, trust the story description — if it specifies exact files and functions, go directly there. Do not explore git history or grep the whole codebase when the story already tells you where to look. Make surgical fixes, commit early."
 [[agent]]
 name = "coder-3"
 stage = "coder"
 role = "Full-stack engineer. Implements features across all components."
 model = "sonnet"
 max_turns = 50
 max_budget_usd = 5.00
 prompt = "You are working in a git worktree on story {{story_id}}. Read CLAUDE.md first, then .story_kit/README.md to understand the dev process. The story details are in your prompt above. Follow the SDTW process through implementation and verification (Steps 1-3). The worktree and feature branch already exist - do not create them. Check .mcp.json for MCP tools. Do NOT accept the story or merge - commit your work and stop. If the user asks to review your changes, tell them to run: cd \"{{worktree_path}}\" && git difftool {{base_branch}}...HEAD\n\nIMPORTANT: Commit all your work before your process exits. The server will automatically run acceptance gates when your process exits and advance the pipeline based on the results. To verify before committing, use the run_tests MCP tool (it starts tests in the background — poll get_test_result to check completion) — never run script/test or cargo test directly via Bash.\n\n## Acceptance Criteria Tracking\nAs you complete each acceptance criterion, call the check_criterion MCP tool (story_id, criterion_index) to mark it done. Index 0 is the first unchecked criterion, 1 is the second, etc. Do this as you go — not all at once at the end.\n\n## Bug Workflow: Trust the Story, Act Fast\nWhen working on bugs:\n1. READ THE STORY DESCRIPTION FIRST. If it specifies exact files, functions, and line numbers — go directly there and make the fix. Do NOT explore git history, grep the whole codebase, or re-investigate the root cause when the story already tells you what to do.\n2. If the story does NOT specify the exact location, THEN investigate: use targeted grep to find the relevant code.\n3. Fix with a surgical, minimal change. Do NOT add new abstractions or workarounds.\n4. Commit early. If you've made the fix and tests pass, commit and exit. Do not spend turns verifying that master also has the same failures — that wastes budget.\n5. Write commit messages that explain what broke and why."
 system_prompt = "You are a full-stack engineer working autonomously in a git worktree. Follow the Story-Driven Test Workflow strictly. Use the run_tests MCP tool to verify your changes pass — it starts tests in the background, then poll get_test_result to check completion. Never run script/test or cargo test directly via Bash. As you complete each acceptance criterion, call check_criterion MCP tool to mark it done. Add //! module-level doc comments to any new modules and /// doc comments to any new public functions, structs, or enums. Commit all your work before finishing - use a descriptive commit message. Do not accept stories, move them to archived, or merge to master - a human will do that. Do not coordinate with other agents - focus on your assigned story. The server automatically runs acceptance gates when your process exits. For bugs, trust the story description — if it specifies exact files and functions, go directly there. Do not explore git history or grep the whole codebase when the story already tells you where to look. Make surgical fixes, commit early."
 [[agent]]
 name = "qa-2"
 stage = "qa"
 role = "Reviews coder work in worktrees: runs quality gates, verifies acceptance criteria, and reports findings."
 model = "sonnet"
 max_turns = 40
 max_budget_usd = 4.00
 prompt = """You are the QA agent for story {{story_id}}. Your job is to verify the coder's work satisfies the story's acceptance criteria and produce a structured QA report.
 Read CLAUDE.md first, then .story_kit/README.md to understand the dev process.
 ## Your Workflow
 ### 0. Read the Story
 - Read the story file at `.huskies/work/3_qa/{{story_id}}.md`
 - Extract every acceptance criterion (the `- [ ]` checkbox lines)
 - Keep this list in mind for Step 3
 ### 1. Deterministic Gates (Prerequisites)
 Run these first — if any fail, reject immediately without proceeding to AC review:
 - Call the `run_tests` MCP tool to start tests, then poll `get_test_result` until complete — all gates must pass (0 lint errors/warnings, all tests green, frontend build clean if applicable). Do NOT run script/test via Bash.
 ### 2. Code Change Review
 - Run `git diff master...HEAD --stat` to see what files changed
 - Run `git diff master...HEAD` to review the actual changes
 - Flag any incomplete implementations:
  - `todo!()`, `unimplemented!()`, `panic!()` used as stubs
  - Placeholder strings like "TODO", "FIXME", "not implemented"
  - Empty match arms or arms that just return `Default::default()`
  - Hardcoded values where real logic is expected
 - Note any obvious coding mistakes (unused imports, dead code, unhandled errors)
 ### 3. Acceptance Criteria Review
 For each AC extracted in Step 0:
 - Review the diff and test files to determine if the code addresses this AC
 - PASS: describe specifically how the code addresses it (which file/function/test)
 - FAIL: explain exactly what is missing or incorrect
 An AC fails if:
 - No code change or test relates to it
 - The implementation is stubbed out (todo!/unimplemented!)
 - A test exists but doesn't actually assert the behaviour described
 ### 4. Manual Testing Support (only if all gates PASS and all ACs PASS)
 - Build: run `script/build` and note success/failure
 - If build succeeds: find a free port (try 3010-3020), set `HUSKIES_PORT=<port>` and start the server with `script/server`
 - Generate a testing plan including:
  - URL to visit in the browser
  - Things to check in the UI
  - curl commands to exercise relevant API endpoints
 - Stop the test server when done: send SIGTERM to the `script/server` process (e.g. `kill <pid>`)
 ### 5. Produce Structured Report and Verdict
 Print your QA report to stdout. Then call `approve_qa` or `reject_qa` via the MCP tool based on the overall result. Use this format:
 ```
 ## QA Report for {{story_id}}
 ### Code Quality
 - run_tests MCP tool: PASS/FAIL (details)
 - Incomplete implementations: (list any todo!/unimplemented!/stubs found, or "None")
 - Other code review findings: (list any issues found, or "None")
 ### Acceptance Criteria Review
 - AC: <criterion text>
  Result: PASS/FAIL
  Evidence: <how the code addresses it, or what is missing>
 (repeat for each AC)
 ### Manual Testing Plan
 - Server URL: http://localhost:PORT (or "Skipped — gate/AC failure" or "Build failed")
 - Pages to visit: (list, or "N/A")
 - Things to check: (list, or "N/A")
 - curl commands: (list, or "N/A")
 ### Overall: PASS/FAIL
 Reason: (summary of why it passed or the primary reason it failed)
 ```
 After printing the report:
 - If Overall is PASS: call `approve_qa(story_id='{{story_id}}')` via MCP
 - If Overall is FAIL: call `reject_qa(story_id='{{story_id}}', notes='<concise reason>')` via MCP so the coder knows exactly what to fix
 ## Rules
 - Do NOT modify any code — read-only review only
 - Gates must pass before AC review — a gate failure is an automatic reject
 - If any AC is not met, the overall result is FAIL
 - Always call approve_qa or reject_qa — never leave the story without a verdict"""
 system_prompt = "You are a QA agent. Your job is read-only: run quality gates, verify each acceptance criterion against the diff, and produce a structured QA report. Always call approve_qa or reject_qa via MCP to record your verdict. Do not modify code."
 [[agent]]
 name = "coder-opus"
 stage = "coder"
 role = "Senior full-stack engineer for complex tasks. Implements features across all components."
 model = "opus"
 max_turns = 80
 max_budget_usd = 20.00
 prompt = "You are working in a git worktree on story {{story_id}}. Read CLAUDE.md first, then .story_kit/README.md to understand the dev process. The story details are in your prompt above. Follow the SDTW process through implementation and verification (Steps 1-3). The worktree and feature branch already exist - do not create them. Check .mcp.json for MCP tools. Do NOT accept the story or merge - commit your work and stop. If the user asks to review your changes, tell them to run: cd \"{{worktree_path}}\" && git difftool {{base_branch}}...HEAD\n\nIMPORTANT: Commit all your work before your process exits. The server will automatically run acceptance gates when your process exits and advance the pipeline based on the results. To verify before committing, use the run_tests MCP tool (it starts tests in the background — poll get_test_result to check completion) — never run script/test or cargo test directly via Bash.\n\n## Acceptance Criteria Tracking\nAs you complete each acceptance criterion, call the check_criterion MCP tool (story_id, criterion_index) to mark it done. Index 0 is the first unchecked criterion, 1 is the second, etc. Do this as you go — not all at once at the end.\n\n## Bug Workflow: Trust the Story, Act Fast\nWhen working on bugs:\n1. READ THE STORY DESCRIPTION FIRST. If it specifies exact files, functions, and line numbers — go directly there and make the fix. Do NOT explore git history, grep the whole codebase, or re-investigate the root cause when the story already tells you what to do.\n2. If the story does NOT specify the exact location, THEN investigate: use targeted grep to find the relevant code.\n3. Fix with a surgical, minimal change. Do NOT add new abstractions or workarounds.\n4. Commit early. If you've made the fix and tests pass, commit and exit. Do not spend turns verifying that master also has the same failures — that wastes budget.\n5. Write commit messages that explain what broke and why."
 system_prompt = "You are a senior full-stack engineer working autonomously in a git worktree. You handle complex tasks requiring deep architectural understanding. Follow the Story-Driven Test Workflow strictly. Use the run_tests MCP tool to verify your changes pass — it starts tests in the background, then poll get_test_result to check completion. Never run script/test or cargo test directly via Bash. As you complete each acceptance criterion, call check_criterion MCP tool to mark it done. Add //! module-level doc comments to any new modules and /// doc comments to any new public functions, structs, or enums. Commit all your work before finishing - use a descriptive commit message. Do not accept stories, move them to archived, or merge to master - a human will do that. Do not coordinate with other agents - focus on your assigned story. The server automatically runs acceptance gates when your process exits. For bugs, trust the story description — if it specifies exact files and functions, go directly there. Do not explore git history or grep the whole codebase when the story already tells you where to look. Make surgical fixes, commit early."
 [[agent]]
 name = "qa"
 stage = "qa"
 role = "Reviews coder work in worktrees: runs quality gates, verifies acceptance criteria, and reports findings."
 model = "sonnet"
 max_turns = 40
 max_budget_usd = 4.00
 prompt = """You are the QA agent for story {{story_id}}. Your job is to verify the coder's work satisfies the story's acceptance criteria and produce a structured QA report.
 Read CLAUDE.md first, then .story_kit/README.md to understand the dev process.
 ## Your Workflow
 ### 0. Read the Story
 - Read the story file at `.huskies/work/3_qa/{{story_id}}.md`
 - Extract every acceptance criterion (the `- [ ]` checkbox lines)
 - Keep this list in mind for Step 3
 ### 1. Deterministic Gates (Prerequisites)
 Run these first — if any fail, reject immediately without proceeding to AC review:
 - Call the `run_tests` MCP tool to start tests, then poll `get_test_result` until complete — all gates must pass (0 lint errors/warnings, all tests green, frontend build clean if applicable). Do NOT run script/test via Bash.
 ### 2. Code Change Review
 - Run `git diff master...HEAD --stat` to see what files changed
 - Run `git diff master...HEAD` to review the actual changes
 - Flag any incomplete implementations:
  - `todo!()`, `unimplemented!()`, `panic!()` used as stubs
  - Placeholder strings like "TODO", "FIXME", "not implemented"
  - Empty match arms or arms that just return `Default::default()`
  - Hardcoded values where real logic is expected
 - Note any obvious coding mistakes (unused imports, dead code, unhandled errors)
 ### 3. Acceptance Criteria Review
 For each AC extracted in Step 0:
 - Review the diff and test files to determine if the code addresses this AC
 - PASS: describe specifically how the code addresses it (which file/function/test)
 - FAIL: explain exactly what is missing or incorrect
 An AC fails if:
 - No code change or test relates to it
 - The implementation is stubbed out (todo!/unimplemented!)
 - A test exists but doesn't actually assert the behaviour described
 ### 4. Manual Testing Support (only if all gates PASS and all ACs PASS)
 - Build: run `script/build` and note success/failure
 - If build succeeds: find a free port (try 3010-3020), set `HUSKIES_PORT=<port>` and start the server with `script/server`
 - Generate a testing plan including:
  - URL to visit in the browser
  - Things to check in the UI
  - curl commands to exercise relevant API endpoints
 - Stop the test server when done: send SIGTERM to the `script/server` process (e.g. `kill <pid>`)
 ### 5. Produce Structured Report and Verdict
 Print your QA report to stdout. Then call `approve_qa` or `reject_qa` via the MCP tool based on the overall result. Use this format:
 ```
 ## QA Report for {{story_id}}
 ### Code Quality
 - run_tests MCP tool: PASS/FAIL (details)
 - Incomplete implementations: (list any todo!/unimplemented!/stubs found, or "None")
 - Other code review findings: (list any issues found, or "None")
 ### Acceptance Criteria Review
 - AC: <criterion text>
  Result: PASS/FAIL
  Evidence: <how the code addresses it, or what is missing>
 (repeat for each AC)
 ### Manual Testing Plan
 - Server URL: http://localhost:PORT (or "Skipped — gate/AC failure" or "Build failed")
 - Pages to visit: (list, or "N/A")
 - Things to check: (list, or "N/A")
 - curl commands: (list, or "N/A")
 ### Overall: PASS/FAIL
 Reason: (summary of why it passed or the primary reason it failed)
 ```
 After printing the report:
 - If Overall is PASS: call `approve_qa(story_id='{{story_id}}')` via MCP
 - If Overall is FAIL: call `reject_qa(story_id='{{story_id}}', notes='<concise reason>')` via MCP so the coder knows exactly what to fix
 ## Rules
 - Do NOT modify any code — read-only review only
 - Gates must pass before AC review — a gate failure is an automatic reject
 - If any AC is not met, the overall result is FAIL
 - Always call approve_qa or reject_qa — never leave the story without a verdict"""
 system_prompt = "You are a QA agent. Your job is read-only: run quality gates, verify each acceptance criterion against the diff, and produce a structured QA report. Always call approve_qa or reject_qa via MCP to record your verdict. Do not modify code."
 [[agent]]
 name = "mergemaster"
 stage = "mergemaster"
 role = "Merges completed coder work into master, runs quality gates, archives stories, and cleans up worktrees."
 model = "sonnet"
 max_turns = 30
 max_budget_usd = 5.00
 prompt = """You are the mergemaster agent for story {{story_id}}. Your job is to merge the completed coder work into master.
 Read CLAUDE.md first, then .story_kit/README.md to understand the dev process.
 ## Your Workflow
 1. Call merge_agent_work(story_id='{{story_id}}') — this blocks until the merge completes and returns the result. Do NOT poll get_merge_status.
 2. Review the result: check success, had_conflicts, conflicts_resolved, gates_passed, and gate_output
 3. If merge succeeded and gates passed: report success to the human
 4. If conflicts were auto-resolved (conflicts_resolved=true) and gates passed: report success, noting which conflicts were resolved
 5. If conflicts could not be auto-resolved: **resolve them yourself** in the merge worktree (see below)
 6. If merge failed for any other reason: call report_merge_failure(story_id='{{story_id}}', reason='<details>') and report to the human
 7. If gates failed after merge: attempt to fix the issues yourself in the merge worktree, then re-trigger merge_agent_work. After 3 fix attempts, call report_merge_failure and stop.
 ## Resolving Complex Conflicts Yourself
 When the auto-resolver fails, you have access to the merge worktree at `.story_kit/merge_workspace/`. Go in there and resolve the conflicts manually:
 1. Run `git diff --name-only --diff-filter=U` in the merge worktree to list conflicted files
 2. **Build context before touching code.** Run `git log --oneline master...HEAD` on the feature branch to see its commits. Then run `git log --oneline --since="$(git log -1 --format=%ci <feature-branch-base-commit>)" master` to see what landed on master since the branch was created. Read the story files in `.story_kit/work/` for any recently merged stories that touch the same files — this tells you WHY master changed and what must be preserved.
 3. Read each conflicted file and understand both sides of the conflict
 4. **Understand intent, not just syntax.** The feature branch may be behind master — master's version of shared infrastructure is almost always correct. The feature branch's contribution is the NEW functionality it adds. Your job is to integrate the new into master's structure, not pick one side.
 5. Resolve by integrating the feature's new functionality into master's code structure
 5. Stage resolved files with `git add`
 6. Call the `run_tests` MCP tool to start tests, then poll `get_test_result` until complete
 7. If it compiles, commit and re-trigger merge_agent_work
 ### Common conflict patterns in this project:
 **Story file rename/rename conflicts:** Both branches moved the story .md file to different pipeline directories. Resolution: `git rm` both sides — story files in `work/2_current/`, `work/3_qa/`, `work/4_merge/` are gitignored and don't need to be committed.
 **bot.rs tokio::select! conflicts:** Master has a `tokio::select!` loop in `handle_message()` that handles permission forwarding (story 275). Feature branches created before story 275 have a simpler direct `provider.chat_stream().await` call. Resolution: KEEP master's tokio::select! loop. Integrate only the feature's new logic (e.g. typing indicators, new callbacks) into the existing loop structure. Do NOT replace the loop with the old direct call.
 **Duplicate functions/imports:** The auto-resolver keeps both sides, producing duplicates. Resolution: keep one copy (prefer master's version), delete the duplicate.
 **Formatting-only conflicts:** Both sides reformatted the same code differently. Resolution: pick either side (prefer master).
 ## Fixing Gate Failures
 If quality gates fail, attempt to fix issues yourself in the merge worktree. Use the run_tests MCP tool (then poll get_test_result) to verify — do not run script/test via Bash.
 **Fix yourself (up to 3 attempts total):**
 - Syntax errors (missing semicolons, brackets, commas)
 - Duplicate definitions from merge artifacts
 - Simple type annotation errors
 - Unused import warnings flagged by clippy
 - Mismatched braces from bad conflict resolution
 - Trivial formatting issues that block compilation or linting
 **Report to human without attempting a fix:**
 - Logic errors or incorrect business logic
 - Missing function implementations
 - Architectural changes required
 - Non-trivial refactoring needed
 **Max retry limit:** If gates still fail after 3 fix attempts, call report_merge_failure to record the failure, then stop immediately and report the full gate output to the human.
 ## CRITICAL Rules
 - NEVER manually move story files between pipeline stages (e.g. from 4_merge/ to 5_done/)
 - NEVER call accept_story — only merge_agent_work can move stories to done after a successful merge
 - When merge fails after exhausting your fix attempts, ALWAYS call report_merge_failure
 - Report conflict resolution outcomes clearly
 - Report gate failures with full output so the human can act if needed
 - The server automatically runs acceptance gates when your process exits"""
 system_prompt = "You are the mergemaster agent. Your primary job is to merge feature branches to master. First try the merge_agent_work MCP tool. If the auto-resolver fails on complex conflicts, resolve them yourself in the merge worktree — you are an opus-class agent capable of understanding both sides of a conflict and producing correct merged code. Common patterns: keep master's tokio::select! permission loop in bot.rs, discard story file rename conflicts (gitignored), remove duplicate definitions. After resolving, verify compilation before re-triggering merge. CRITICAL: Never manually move story files or call accept_story. After 3 failed fix attempts, call report_merge_failure and stop."
@@ -0,0 +1,28 @@
 # Discord Transport
 # Copy this file to bot.toml and fill in your values.
 # Only one transport can be active at a time.
 #
 # Setup:
 #   1. Create a Discord Application at discord.com/developers/applications
 #   2. Go to Bot → create a bot and copy the token
 #   3. Enable "Message Content Intent" under Privileged Gateway Intents
 #   4. Go to OAuth2 → URL Generator, select "bot" scope with permissions:
 #      Send Messages, Read Message History, Manage Messages
 #   5. Use the generated URL to invite the bot to your server
 #   6. Right-click the channel(s) → Copy Channel ID (enable Developer Mode in settings)
 enabled = true
 transport = "discord"
 discord_bot_token = "your-bot-token-here"
 discord_channel_ids = ["123456789012345678"]
 # Discord user IDs allowed to interact with the bot.
 # When empty, all users in configured channels can interact.
 # discord_allowed_users = ["111222333444555666"]
 # Bot display name (used in formatted messages).
 # display_name = "Assistant"
 # Maximum conversation turns to remember per channel (default: 20).
 # history_size = 20
@@ -0,0 +1,26 @@
 # Matrix Transport
 # Copy this file to bot.toml and fill in your values.
 # Only one transport can be active at a time.
 enabled = true
 transport = "matrix"
 homeserver = "https://matrix.example.com"
 username = "@botname:example.com"
 password = "your-bot-password"
 # List one or more rooms to listen in.
 room_ids = ["!roomid:example.com"]
 # Users allowed to interact with the bot (fail-closed: empty = nobody).
 allowed_users = ["@youruser:example.com"]
 # Bot display name in chat.
 # display_name = "Assistant"
 # Maximum conversation turns to remember per room (default: 20).
 # history_size = 20
 # Rooms where the bot responds to all messages (not just addressed ones).
 # This list is updated automatically when users toggle ambient mode at runtime.
 # ambient_rooms = ["!roomid:example.com"]
@@ -0,0 +1,23 @@
 # Slack Transport
 # Copy this file to bot.toml and fill in your values.
 # Only one transport can be active at a time.
 #
 # Setup:
 #   1. Create a Slack App at api.slack.com/apps
 #   2. Add OAuth scopes: chat:write, chat:update
 #   3. Subscribe to bot events: message.channels, message.groups, message.im
 #   4. Install the app to your workspace
 #   5. Set your webhook URL in Event Subscriptions: https://your-server/webhook/slack
 enabled = true
 transport = "slack"
 slack_bot_token = "xoxb-..."
 slack_signing_secret = "your-signing-secret"
 slack_channel_ids = ["C01ABCDEF"]
 # Bot display name (used in formatted messages).
 # display_name = "Assistant"
 # Maximum conversation turns to remember per channel (default: 20).
 # history_size = 20
@@ -0,0 +1,33 @@
 # WhatsApp Transport (Meta Cloud API)
 # Copy this file to bot.toml and fill in your values.
 # Only one transport can be active at a time.
 #
 # Setup:
 #   1. Create a Meta Business App at developers.facebook.com
 #   2. Add the WhatsApp product
 #   3. Copy your Phone Number ID and generate a permanent access token
 #   4. Register your webhook URL: https://your-server/webhook/whatsapp
 #   5. Set the verify token below to match what you configure in Meta's dashboard
 enabled = true
 transport = "whatsapp"
 whatsapp_provider = "meta"
 whatsapp_phone_number_id = "123456789012345"
 whatsapp_access_token = "EAAx..."
 whatsapp_verify_token = "my-secret-verify-token"
 # Optional: name of the approved Meta message template used for notifications
 # sent outside the 24-hour messaging window (default: "pipeline_notification").
 # whatsapp_notification_template = "pipeline_notification"
 # Bot display name (used in formatted messages).
 # display_name = "Assistant"
 # Maximum conversation turns to remember per user (default: 20).
 # history_size = 20
 # Optional: restrict which phone numbers can interact with the bot.
 # When set, only listed numbers are processed; all others are silently ignored.
 # When absent or empty, all numbers are allowed (open by default).
 # whatsapp_allowed_phones = ["+15551234567", "+15559876543"]
@@ -0,0 +1,29 @@
 # WhatsApp Transport (Twilio)
 # Copy this file to bot.toml and fill in your values.
 # Only one transport can be active at a time.
 #
 # Setup:
 #   1. Sign up at twilio.com
 #   2. Activate the WhatsApp sandbox (Messaging > Try it out > Send a WhatsApp message)
 #   3. Send the sandbox join code from your WhatsApp to the sandbox number
 #   4. Copy your Account SID, Auth Token, and sandbox number below
 #   5. Set your webhook URL in the Twilio console: https://your-server/webhook/whatsapp
 enabled = true
 transport = "whatsapp"
 whatsapp_provider = "twilio"
 twilio_account_sid = "ACxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
 twilio_auth_token = "your_auth_token"
 twilio_whatsapp_number = "+14155238886"
 # Bot display name (used in formatted messages).
 # display_name = "Assistant"
 # Maximum conversation turns to remember per user (default: 20).
 # history_size = 20
 # Optional: restrict which phone numbers can interact with the bot.
 # When set, only listed numbers are processed; all others are silently ignored.
 # When absent or empty, all numbers are allowed (open by default).
 # whatsapp_allowed_phones = ["+15551234567", "+15559876543"]
@@ -0,0 +1,39 @@
 # Project-wide default QA mode: "server", "agent", or "human".
 # Per-story `qa` front matter overrides this setting.
 default_qa = "server"
 # Default model for coder agents. Only agents with this model are auto-assigned.
 # Opus coders are reserved for explicit per-story `agent:` front matter requests.
 default_coder_model = "sonnet"
 # Maximum concurrent coder agents. Stories wait in 2_current/ when all slots are full.
 max_coders = 3
 # Maximum retries per story per pipeline stage before marking as blocked.
 # Set to 0 to disable retry limits.
 max_retries = 3
 # Base branch name for this project. Worktree creation, merges, and agent prompts
 # use this value for {{base_branch}}. When not set, falls back to auto-detection
 # (reads current HEAD branch).
 base_branch = "master"
 # Suppress soft rate-limit warning notifications in chat.
 # Hard blocks and story-blocked notifications are always sent.
 rate_limit_notifications = false
 # IANA timezone for timer scheduling (e.g. "Europe/London", "America/New_York").
 # Timer HH:MM inputs are interpreted in this timezone.
 timezone = "Europe/London"
 [[component]]
 name = "frontend"
 path = "frontend"
 setup = ["npm ci", "npm run build"]
 teardown = []
 [[component]]
 name = "server"
 path = "."
 setup = ["mkdir -p frontend/dist", "cargo check"]
 teardown = []
@@ -0,0 +1,43 @@
 # Example project.toml — copy to .huskies/project.toml and customise.
 # This file is checked in; project.toml itself is gitignored (it may contain
 # instance-specific settings).
 # Project-wide default QA mode: "server", "agent", or "human".
 # Per-story `qa` front matter overrides this setting.
 default_qa = "server"
 # Default model for coder agents. Only agents with this model are auto-assigned.
 # Opus coders are reserved for explicit per-story `agent:` front matter requests.
 default_coder_model = "sonnet"
 # Maximum concurrent coder agents. Stories wait in 2_current/ when all slots are full.
 max_coders = 3
 # Maximum retries per story per pipeline stage before marking as blocked.
 # Set to 0 to disable retry limits.
 max_retries = 2
 # Base branch name for this project. Worktree creation, merges, and agent prompts
 # use this value for {{base_branch}}. When not set, falls back to auto-detection
 # (reads current HEAD branch).
 base_branch = "main"
 [[component]]
 name = "server"
 path = "."
 setup = ["cargo build"]
 teardown = []
 [[agent]]
 name = "coder-1"
 role = "Full-stack engineer"
 stage = "coder"
 model = "sonnet"
 max_turns = 50
 max_budget_usd = 5.00
 prompt = """
 You are working in a git worktree on story {{story_id}}.
 Read CLAUDE.md first, then .huskies/README.md to understand the dev process.
 Run: cd "{{worktree_path}}" && git difftool {{base_branch}}...HEAD
 Commit all your work before your process exits.
 """
@@ -6,7 +6,7 @@ Slack integration is configured via `bot.toml` in the project's `.story_kit/` di
 ```toml
 transport = "slack"
-display_name = "Storkit"
+display_name = "Huskies"
 slack_bot_token = "xoxb-..."
 slack_signing_secret = "..."
 slack_channel_ids = ["C01ABCDEF"]
@@ -29,11 +29,11 @@ Slash commands provide quick access to pipeline commands without mentioning the
 | Command | Description |
 |---------|-------------|
-| `/storkit-status` | Show pipeline status and agent availability |
+| `/huskies-status` | Show pipeline status and agent availability |
-| `/storkit-cost` | Show token spend: 24h total, top stories, and breakdown |
+| `/huskies-cost` | Show token spend: 24h total, top stories, and breakdown |
-| `/storkit-show` | Display the full text of a work item (e.g. `/storkit-show 42`) |
+| `/huskies-show` | Display the full text of a work item (e.g. `/huskies-show 42`) |
-| `/storkit-git` | Show git status: branch, changes, ahead/behind |
+| `/huskies-git` | Show git status: branch, changes, ahead/behind |
-| `/storkit-htop` | Show system and agent process dashboard |
+| `/huskies-htop` | Show system and agent process dashboard |
 All slash command responses are **ephemeral** — only the user who invoked the command sees the response.
@@ -118,8 +118,8 @@ To support both Remote and Local models, the system implements a `ModelProvider`
 Multiple instances can run simultaneously in different worktrees. To avoid port conflicts:
- **Backend:** Set `STORKIT_PORT` to a unique port (default is 3001). Example: `STORKIT_PORT=3002 cargo run`
+- **Backend:** Set `HUSKIES_PORT` to a unique port (default is 3001). Example: `HUSKIES_PORT=3002 cargo run`
- **Frontend:** Run `npm run dev` from `frontend/`. It auto-selects the next unused port. It reads `STORKIT_PORT` to know which backend to talk to, so export it before running: `export STORKIT_PORT=3002 && cd frontend && npm run dev`
+- **Frontend:** Run `npm run dev` from `frontend/`. It auto-selects the next unused port. It reads `HUSKIES_PORT` to know which backend to talk to, so export it before running: `export HUSKIES_PORT=3002 && cd frontend && npm run dev`
 When running in a worktree, use a port that won't conflict with the main instance (3001). Ports 3002+ are good choices.
@@ -0,0 +1,24 @@
 ---
 name: "WhatsApp webhook HMAC signature verification"
 retry_count: 3
 blocked: true
 ---
 # Story 388: WhatsApp webhook HMAC signature verification
 ## User Story
 As a bot operator, I want incoming WhatsApp webhook requests to be cryptographically verified, so that forged requests from unauthorized sources are rejected.
 ## Acceptance Criteria
 - [ ] Meta webhooks: validate X-Hub-Signature-256 HMAC-SHA256 header using the app secret before processing
 - [ ] Twilio webhooks: validate request signature using the auth token before processing
 - [ ] Requests with missing or invalid signatures are rejected with 403 Forbidden
 - [ ] Verification is fail-closed: if signature checking is configured, unsigned requests are rejected
 - [ ] Existing bot.toml config is extended with any needed secrets (e.g. Meta app_secret for HMAC verification)
 - [ ] MUST use audited crypto crates (hmac, sha2, sha1, base64) — no hand-rolled cryptographic primitives
 ## Out of Scope
 - TBD
@@ -0,0 +1,40 @@
 ---
 name: "Fly.io Machines API integration for multi-tenant huskies SaaS"
 ---
 # Spike 408: Fly.io Machines API integration for multi-tenant huskies SaaS
 ## Question
 Can we build a working Rust integration that creates and manages per-tenant Fly.io Machines, attaches volumes, injects Claude credentials, and proxies JWT-authenticated HTTP/WebSocket traffic to the right machine?
 ## Hypothesis
 A thin Rust service using `reqwest` for the Machines API and `axum` for the reverse proxy is sufficient. No heavyweight orchestration framework needed.
 ## Prerequisites
 - Fly.io account with API token (set `FLY_API_TOKEN` env var)
 - Spike 407 findings reviewed
 ## Timebox
 4 hours
 ## Investigation Plan
 - [ ] Create a minimal Rust crate in `spikes/fly_machines/` — do not touch production code
 - [ ] Implement machine lifecycle: create, start, stop, destroy via Fly Machines REST API using `reqwest`
 - [ ] Test attaching a persistent volume to a machine and verify it persists across stop/start
 - [ ] Test secret injection — pass a dummy `credentials.json` as a Fly secret and verify it's readable inside the machine
 - [ ] Sketch the auth proxy: JWT validation → machine lookup → reverse proxy to machine's private IP; verify WebSocket proxying works
 - [ ] Measure actual cold start time for a minimal huskies container image
 - [ ] Document any API quirks, rate limits, or sharp edges discovered during testing
 ## Findings
 - TBD
 ## Recommendation
 - TBD
@@ -0,0 +1,22 @@
 ---
 name: "Multi-account OAuth token rotation on rate limit"
 ---
 # Story 411: Multi-account OAuth token rotation on rate limit
 ## User Story
 As a huskies user with multiple Claude Max subscriptions, I want the system to automatically rotate to a different account when one gets rate limited, so that agents and chat don't stall out waiting for limits to reset.
 ## Acceptance Criteria
 - [ ] OAuth login flow stores credentials per-account (keyed by email), not overwriting previous accounts
 - [ ] GET /oauth/status returns all stored accounts and their status (active, rate-limited, expired)
 - [ ] When the active account hits a rate limit, huskies automatically swaps to the next available account's refresh token, refreshes, and retries
 - [ ] The bot sends a notification in Matrix/WhatsApp when it swaps accounts
 - [ ] If all accounts are rate limited, the bot surfaces a clear message with the time until the earliest reset
 - [ ] A new /oauth/authorize login adds to the account pool rather than replacing the current credentials
 ## Out of Scope
 - TBD
@@ -0,0 +1,24 @@
 ---
 name: "Recheck bot command to re-run gates without restarting agent"
 ---
 # Story 412: Recheck bot command to re-run gates without restarting agent
 ## User Story
 As a user, I want to send `recheck <number>` to the bot so that it re-runs acceptance gates on an existing worktree without spawning a new agent, so I can unblock stories that failed due to environment issues without wasting agent turns.
 ## Acceptance Criteria
 - [ ] recheck command is registered in chat/commands/mod.rs and appears in help output
 - [ ] `recheck <number>` runs run_acceptance_gates on the story's existing worktree
 - [ ] If gates pass, the story advances through the pipeline (same as if a coder completed successfully)
 - [ ] If gates fail, the error output is returned to the user (not silently retried)
 - [ ] If no worktree exists for the story, returns a clear error
 - [ ] Does not spawn a new agent or increment retry_count
 - [ ] Works from all transports (Matrix, WhatsApp, Slack)
 - [ ] Works from web UI slash commands
 ## Out of Scope
 - TBD
@@ -0,0 +1,21 @@
 ---
 name: "Unblock command handles all stuck states not just blocked flag"
 ---
 # Story 435: Unblock command handles all stuck states not just blocked flag
 ## User Story
 As a project owner, I want the unblock command to clear any stuck state on a story — not just the blocked flag — so that I have a single command to unstick stories regardless of why they're stuck.
 ## Acceptance Criteria
 - [ ] Unblock clears merge_failure field in addition to blocked flag
 - [ ] Unblock clears review_hold field
 - [ ] Unblock reports which fields were cleared in the confirmation message
 - [ ] Unblock works on stories in any pipeline stage (backlog, current, qa, merge, done)
 - [ ] If no stuck state is found (no blocked, merge_failure, or review_hold), returns a clear message saying so
 ## Out of Scope
 - TBD
@@ -0,0 +1,28 @@
 ---
 name: "Unify story stuck states into a single status field"
 status: "superseded"
 superseded_by: 520
 ---
 # Refactor 436: Unify story stuck states into a single status field
 ## Current State
 - TBD
 ## Desired State
 Replace the separate blocked, merge_failure, and review_hold front matter fields with a single status field (e.g. status: blocked, status: merge_failure, status: review_hold). Simplifies the unblock command, auto-assign checks, and pipeline advance logic.
 ## Acceptance Criteria
 - [ ] Replace blocked: true, merge_failure: string, and review_hold: true with a single status: field in story front matter
 - [ ] Auto-assign checks a single field instead of three separate ones
 - [ ] Pipeline advance and lifecycle code reads/writes the unified status field
 - [ ] Unblock command clears the status field regardless of which stuck state it was
 - [ ] retry_count remains a separate field (it's a counter, not a state)
 - [ ] Migration: existing stories with old fields are handled gracefully on read
 ## Out of Scope
 - TBD
@@ -0,0 +1,24 @@
 ---
 name: "Build agent mode with CRDT-based work claiming"
 agent: coder-opus
 depends_on: [478]
 ---
 # Story 479: Build agent mode with CRDT-based work claiming
 ## User Story
 As a user with multiple laptops, I want to run huskies in build agent mode so it connects to the mesh, syncs state, and autonomously picks up and runs coding work.
 ## Acceptance Criteria
 - [ ] New CLI mode: huskies agent --rendezvous ws://host:3001
 - [ ] Agent mode: syncs CRDT state, runs coders, no web UI or chat interface
 - [ ] Work claiming via CRDT: node writes claim (node ID) to CRDT doc, merge resolves conflicts deterministically, losing node stops work
 - [ ] Agent picks up stories in current stage and runs Claude Code locally
 - [ ] Agent pushes feature branch to Gitea when done, reports completion via CRDT
 - [ ] Handles offline/reconnect: CRDT merges on reconnect, interrupted work is reclaimed after timeout
 ## Out of Scope
 - TBD
@@ -0,0 +1,23 @@
 ---
 name: "Cryptographic node auth for distributed mesh"
 agent: coder-opus
 depends_on: [479]
 ---
 # Story 480: Cryptographic node auth for distributed mesh
 ## User Story
 As a user running a distributed huskies mesh, I want nodes authenticated by Ed25519 keypairs so only trusted machines can join and see pipeline state.
 ## Acceptance Criteria
 - [ ] Each node has an Ed25519 keypair (generated on first run or via CLI command)
 - [ ] Trusted nodes defined by a list of known public keys in config
 - [ ] Nodes authenticate on WebSocket connect by signing a challenge
 - [ ] CRDT node ID derived from public key (already built into bft-json-crdt crate)
 - [ ] Unauthorised nodes rejected on connect
 ## Out of Scope
 - TBD
@@ -0,0 +1,25 @@
 ---
 name: "create_worktree deletes all files from main branch git index"
 ---
 # Bug 486: create_worktree deletes all files from main branch git index
 ## Description
 On the reclaimer project, the create_worktree operation for story 34 produced a commit (cea0c48) with message "huskies: create 34_story_drawer_open_pushes_main_view_aside_with_animation" that removed 76 files (9853 deletions) from the main branch git index. All files remained on disk — nothing was lost — but every tracked file became untracked. Static analysis of the watcher code (watcher.rs:152-185) shows git add -A .huskies/work/ with correct current_dir, which should only affect files under .huskies/work/. No other code path in the server runs git add without a pathspec on the main branch. Root cause is unknown — may be related to the storkit→huskies migration leaving the git index in an inconsistent state, or a race condition during first-time scaffold on a project that previously used .storkit/.
 ## How to Reproduce
 1. Uncertain — may require fresh huskies setup on a project that previously used storkit. 2. Create a story via the bot or MCP tool. 3. Observe that the auto-commit for story creation removes all tracked files from the git index.
 ## Actual Result
 The commit for story creation deleted 76 files (9853 deletions) from the git index while leaving them on disk.
 ## Expected Result
 The commit for story creation should only add the new story markdown file under .huskies/work/1_backlog/. No other files should be affected.
 ## Acceptance Criteria
 - [ ] Bug is fixed and verified
@@ -0,0 +1,32 @@
 ---
 name: "Stale merge job lock prevents new merges after agent dies"
 ---
 # Bug 498: Stale merge job lock prevents new merges after agent dies
 ## Description
 When the mergemaster agent is killed or stops while a merge is in progress, the in-memory `merge_jobs` map retains a `Running` status entry for that story. Subsequent attempts to call `merge_agent_work` get "Merge already in progress" and fail. The lock is never cleaned up.
 This causes the mergemaster to loop: spawn, try merge, get "already in progress", waste turns, exit, respawn. The merge never completes.
 The fix: clear the merge job entry when the mergemaster agent exits (whether cleanly or via kill/stop).
 ## How to Reproduce
 1. Start mergemaster on a story in merge
 2. Kill/stop the mergemaster agent before merge completes
 3. Try to merge_agent_work again for the same story
 4. Get "Merge already in progress" error
 ## Actual Result
 Stale Running entry in merge_jobs map blocks all future merge attempts until server restart.
 ## Expected Result
 Merge job lock is cleaned up when the agent exits, allowing retry.
 ## Acceptance Criteria
 - [ ] Bug is fixed and verified
@@ -0,0 +1,70 @@
 ---
 name: "Stale 1_backlog filesystem shadows get re-promoted by rate-limit retry timers, yanking successfully-merged stories back into current"
 ---
 # Bug 510: Stale 1_backlog filesystem shadows get re-promoted by rate-limit retry timers, yanking successfully-merged stories back into current
 ## Description
 After a story successfully completes the entire pipeline — coder runs, gates pass, mergemaster squashes the feature branch to master, lifecycle moves the story from `4_merge/` to `5_done/` — a stale filesystem shadow of the story's markdown file remains in `.huskies/work/1_backlog/`. This shadow is a leftover from the 491/492 migration: story state moved to the database as the source of truth, but the lifecycle move logic in `lifecycle.rs` is still operating on the filesystem and doesn't fully clean up after successful pipeline completions.
 When a rate-limit retry timer subsequently fires for that story (rate limits get scheduled by story 496's auto-retry whenever an agent is hard-blocked, and bug 501 means those timers aren't cancelled on successful completion either), the timer fire path calls `move_story_to_current()`, which uses the **filesystem-only** `move_item` helper. That helper finds the stale `1_backlog/` shadow and "moves" it to `2_current/` — even though the story is correctly in `5_done` in the database.
 Net effect: a fully-merged, archived-to-done story suddenly reappears in `current` with a fresh coder spawned on it. The matrix bot sends `Done → Current` notifications. The agent burns tokens working on a story whose work has already shipped to master. The user sees the story flapping and assumes the merge didn't actually happen.
 **Observed live on 2026-04-09 against story 503:**
 ```
 18:31:32 [lifecycle] Moved '503_…' from work/4_merge/ to work/5_done/
 18:31:32 [bot] Sending stage notification: 🎉 #503 … — Merge → Done
 18:32:21 [timer] Timer fired for story 503_…
 18:32:21 [lifecycle] Moved '503_…' from work/1_backlog/ to work/2_current/   ← stale shadow!
 18:32:21 [auto-assign] Assigning 'coder-1' to '503_…' in 2_current/
 ```
 The merge to master persisted (commit `41515e3b` is on master). Only the *pipeline state* got corrupted by the stale shadow being re-promoted.
 This is **distinct from bug 501** (which is about manual `stop_agent` not cancelling timers) but compounds it: 501 is about user-initiated stops, this is about successful pipeline completions. Both share a root cause — the rate-limit retry timer system has no notion of "this story has moved on, cancel any pending retries" — but the *consequences* of this bug are worse because the timer fires successfully and re-creates work that shouldn't exist.
 Also distinct from bug 502 (mergemaster stage-mismatch) which has been fixed.
 The deeper architectural problem this exposes: **`lifecycle.rs::move_item` and `move_story_to_current` are still on the legacy filesystem path** while the rest of the pipeline (491/492) has moved to DB-as-source-of-truth. The filesystem shadows in `.huskies/work/N_stage/` are supposed to be a *materialized rendering* of the DB state, not a parallel source of truth — but `move_item` treats them as authoritative.
 ## How to Reproduce
 1. Take any story through the full pipeline successfully — coder runs, gates pass, mergemaster squashes to master, story moves to `5_done`.
 2. While the story was in flight, ensure at least one coder run hit a hard rate limit (so a retry timer was scheduled). Bug 501 means that timer survives the successful completion.
 3. Verify post-completion state:
   - `SELECT stage FROM pipeline_items WHERE id = 'N_story_X';` returns `5_done` ✓
   - `ls .huskies/work/1_backlog/N_story_X.md` shows the file STILL EXISTS (the stale shadow)
   - `cat .huskies/timers.json` shows a pending entry for `N_story_X` with a future `scheduled_at`
 4. Wait for the timer to fire (default ~5 minutes after the last rate-limit hit).
 ## Actual Result
 When the timer fires:
 - The `[timer] Timer fired` log line appears for the already-done story
 - `move_story_to_current` is called and finds the stale `1_backlog/N_story_X.md` shadow
 - Lifecycle log: `[lifecycle] Moved 'N_…' from work/1_backlog/ to work/2_current/`
 - Auto-assign sees the story in `2_current/` and spawns a coder
 - Matrix bot sends `Done → Current` (and then later `Current → Current` etc.) stage notifications, spamming the room
 - The new coder works on a story whose work is already shipped on master, burning tokens
 - The story is now visible in BOTH `5_done` (via DB) AND `2_current` (via filesystem shadow), depending on which view the consumer reads
 - The actual master commit is unaffected — the merge that already landed is still there. Only the *pipeline state* is corrupted.
 ## Expected Result
 Successful pipeline completions must fully clean up the story's filesystem shadows. After `move_story_to_done` runs, `.huskies/work/1_backlog/N_story_X.md` (and any other stage shadow) for that story must not exist.
 Additionally — and this is the more general fix — the rate-limit retry timer system must cancel any pending timers for a story when that story successfully completes the pipeline. This is a sibling fix to bug 501 (which is about cancelling on manual stop): both manual stop and successful completion should mean "no more retries".
 The deepest fix is to migrate `lifecycle.rs::move_item` off the filesystem path and onto the DB path so the shadow files can be torn down entirely (or made strictly read-only renderings). That's a larger change that probably wants its own story, not a bug fix.
 ## Acceptance Criteria
 - [ ] After a story moves to 5_done via the normal pipeline path (mergemaster success), the filesystem shadow at .huskies/work/1_backlog/N_story_X.md is removed (and any other stage shadows are also removed)
 - [ ] When a story moves to 5_done, any pending rate-limit retry timer for that story is cancelled (the entry is removed from timers.json before the file is persisted)
 - [ ] Regression test: simulate the full repro sequence — run a story through the pipeline with a mid-flight rate limit, complete the merge, fast-forward to the timer fire, assert (a) the story stays in 5_done, (b) no agent is spawned, (c) no Done→Current notification fires
 - [ ] No regression in bug 501's fix for manual-stop timer cancellation
 - [ ] Filesystem shadow cleanup is symmetric — also runs on delete_story, move_story to backlog, etc., not just the done path
 - [ ] The matrix bot does not spam Done→Current notifications for stories whose work has actually completed
@@ -0,0 +1,72 @@
 ---
 name: "CRDT lamport clock (inner.seq) resets to 1 on server restart instead of resuming from max(own_author_seq) + 1"
 ---
 # Bug 511: CRDT lamport clock (inner.seq) resets to 1 on server restart instead of resuming from max(own_author_seq) + 1
 ## Description
 When the huskies server restarts (e.g. via `rebuild_and_restart`), the local node's CRDT lamport clock — `inner.seq` on each `SignedOp` — appears to reset to 1 instead of resuming from `MAX(seq) + 1` for the local author's own previously-persisted ops.
 **Discovered live on 2026-04-09** while inspecting the `crdt_ops` table after a `rebuild_and_restart`. Pre-restart ops were at seqs 485-492 (creation ops for stories 503-510). Post-restart ops were being persisted at seqs 1, 2, 3, 4, 5, 6, 7 — visible by sorting `crdt_ops` by `created_at DESC`:
 ```
 created_at                              | seq
 2026-04-09T18:49:56  → seq 7      ← post-restart
 2026-04-09T18:38:22  → seq 6      ← post-restart
 2026-04-09T18:37:45  → seq 492    ← pre-restart, last write before restart
 2026-04-09T18:31:32  → seq 4      ← post-restart
 2026-04-09T18:27:04  → seq 3      ← post-restart
 ```
 So the local node, which had reached seq=492 before the restart, started writing new ops at seq=1 after the restart and is now climbing from there. This means **new ops have lower seqs than existing ops from the same author**.
 ## Why this matters
 In a BFT JSON CRDT, `inner.seq` is the local lamport clock used for causality tracking. The library assumes per-author seqs are monotonically increasing — newer ops from the same author have higher seqs than older ops. Several things break when this invariant is violated:
 1. **Causality / ordering on remote replay.** When a peer (or this same node after another restart) replays the persisted ops in `seq` order, the post-restart ops will be applied *before* the pre-restart ops, even though they happened later. This can produce a non-deterministic state and can cause field updates to "go backwards" — e.g. a story that was moved current → done pre-restart, then nothing post-restart, would correctly end at "done"; but if you also did a post-restart action, the seq ordering would re-play it in the wrong order.
 2. **Op ID collisions.** The op id is a hash of the op contents (including author, seq, content). If a post-restart op happens to be structurally identical to a pre-restart op (e.g. "set stage to 1_backlog" with the same author and same seq=3), the op ids could collide. The persistence path uses `INSERT INTO crdt_ops ... ON CONFLICT(op_id) DO NOTHING`, which would *silently drop* the new op. (We have not yet observed this happen, but it's a latent risk.)
 3. **Sync between nodes will desync.** Once the WebSocket sync layer (story 478, just merged) is exchanging ops between nodes, a restart on one node will produce ops with seqs that look "old" to the other node, and the receiving node may de-dupe or mis-order them. This will manifest as silent state divergence in multi-node deployments, which is exactly what the sync layer is supposed to prevent.
 4. **Today's pipeline state confusion.** The 8 stories I created in this session (503-510) are at seqs 485-492 in the persisted CRDT. Their post-restart lifecycle moves are at seqs 1-15. If we replay the CRDT from disk in seq order, the lifecycle moves will be applied to *empty* state before the creation ops have run, and will silently no-op (because they reference content indexes that don't exist yet). On the next restart after this state, the in-memory view will show the stories in their *creation* state, not their post-restart-lifecycle state — i.e. all 8 stories will appear "stuck at 1_backlog" again. **This may well be the cause of bug 510's split-brain symptom**.
 ## Where the bug lives
 `server/src/crdt_state.rs::init()` (around lines 80-115) replays persisted ops to reconstruct state, then constructs a fresh `CrdtState { crdt, keypair, index, persist_tx }`. The `BaseCrdt::new(&keypair)` call constructs a fresh CRDT with a fresh internal seq counter. The replay re-applies ops via `crdt.apply(signed_op)` which presumably updates the doc but does NOT advance the local seq counter (because `apply()` is for *remote* ops).
 After replay, the local seq counter is at 0 (or wherever the BFT CRDT library defaults). The next call to `apply_and_persist` produces an op with `inner.seq = 1` (or whatever the next-counter value is) — even though there are already ops at seq 485+ from this author in the persisted state.
 The fix is to inspect `MAX(inner.seq)` for ops where `author == local_keypair.public()` after the replay, and seed the BFT CRDT's local seq counter from `max + 1`. The exact API for "seed the seq counter" depends on the bft-json-crdt library — may need a small upstream change if not already exposed.
 ## How to Reproduce
 1. Start a fresh huskies server with an empty database. Verify `crdt_ops` is empty.
 2. Create several stories via `create_story` or similar — observe ops being persisted at incrementing seqs (1, 2, 3, ...).
 3. Note the highest seq via `sqlite3 .huskies/pipeline.db "SELECT MAX(seq) FROM crdt_ops;"` — call this N.
 4. Stop the server and start it again (or `rebuild_and_restart`).
 5. Create another story via `create_story`.
 6. Query `SELECT seq, created_at FROM crdt_ops ORDER BY created_at DESC LIMIT 5;`
 ## Actual Result
 The new op (just created in step 5) is persisted with `seq = 1` (or some small value), NOT `seq = N + 1`. The lamport clock has been reset.
 Concretely on 2026-04-09 we observed seqs in `crdt_ops` ordered by `created_at` DESC of: 7, 6, 492, 4, 3 — i.e. post-restart writes were at seqs 3, 4, 6, 7 even though the highest pre-restart seq was 492.
 ## Expected Result
 After restart, the local node's seq counter must resume from `MAX(inner.seq)` across all persisted ops where `author == local_keypair.public()`, plus 1. The next op written by the local node should have `seq = N + 1` where N is the previous local high-water mark.
 Equivalent stated: `inner.seq` on the local author's ops must be monotonically increasing across the entire lifetime of the local node's keypair, not just within a single process invocation.
 ## Acceptance Criteria
 - [ ] After a server restart, the next CRDT op written by the local node has seq = MAX(local_author_seq from crdt_ops) + 1, not 1
 - [ ] Regression test: seed crdt_ops with an op at seq=100 by the local author, restart the CRDT subsystem (or call init() in a test harness), trigger a write_item, assert the new op has seq=101
 - [ ] Regression test: a brand-new node (no pre-existing ops) still starts at seq=1 (no off-by-one introduced by the fix)
 - [ ] Inter-node test: simulate two nodes A and B, A writes ops up to seq=50, A restarts, A writes a new op which should be seq=51, broadcast to B, assert B applies it in the correct causal position
 - [ ] If the fix requires changes to bft-json-crdt itself (to expose a way to seed the local seq), the upstream change is documented in the bug body and either landed or vendored
 - [ ] After this fix is in place, replay-on-restart for the existing data (8 stories in pipeline_items at seqs 485-492 with lifecycle moves at seqs 1-15) is verified to produce the correct in-memory state — OR the existing broken-seq data is migrated as part of the fix
@@ -0,0 +1,58 @@
 ---
 name: "Migrate chat commands from filesystem lookup to CRDT/DB"
 ---
 # Story 512: Migrate chat commands from filesystem lookup to CRDT/DB
 ## User Story
 **Depends on story 520** (typed pipeline state machine). This story is best implemented as a *consumer* of the typed transition API, not against the loose `PipelineItemView`. Wait for 520 to land first, then migrate the chat command lookups to use the typed `find_story_by_number → Result<PipelineItem, _>` helper from the new module.
 ---
 **Note:** content stuffed into user_story per bug 509 workaround.
 ## Context
 All the slash-style chat commands in `server/src/chat/commands/{move_story,show,depends,unblock}.rs` and `server/src/chat/transport/matrix/{start,assign,delete}.rs` look up stories by **searching for `.huskies/work/*/N_*.md` filesystem files**. After the 491/492 migration moved story content out of the filesystem and into `pipeline_items` + CRDT, these commands silently fail with `"No story, bug, or spike with number {N} found"` for any story whose filesystem shadow doesn't exist — *even when the story is fully present in the DB and CRDT*.
 ## Real user story
 As a user typing chat commands in the web UI or the matrix bot, I want move/show/depends/unblock/start/assign/delete to find any story that's in the pipeline regardless of whether its filesystem shadow exists, so the chat workflow stays usable post-migration.
 ## Observed 2026-04-09
 Master commit `41515e3b` had 503's code, the in-memory CRDT view had 503 at stage='merge', the `pipeline_items` row existed (post my-sqlite-update at `5_done`), but `move 503 done` in the web UI returned **`No story, bug, or spike with number 503 found`** because no `.huskies/work/4_merge/503_*.md` file existed.
 ## Implementation note
 The MCP `move_story` tool already does this correctly: it goes through `lifecycle::move_item` which checks `crdt_state::read_item(story_id)` first. The chat commands need to use the same lookup helper. The fix should consolidate all "find story by number" logic into one shared function used by every command.
 ## Context
 All the slash-style chat commands in `server/src/chat/commands/{move_story,show,depends,unblock}.rs` and `server/src/chat/transport/matrix/{start,assign,delete}.rs` look up stories by **searching for `.huskies/work/*/N_*.md` filesystem files**. After the 491/492 migration moved story content out of the filesystem and into `pipeline_items` + CRDT, these commands silently fail with `"No story, bug, or spike with number {N} found"` for any story whose filesystem shadow doesn't exist — *even when the story is fully present in the DB and CRDT*.
 ## Real user story
 As a user typing chat commands in the web UI or the matrix bot, I want move/show/depends/unblock/start/assign/delete to find any story that's in the pipeline regardless of whether its filesystem shadow exists, so the chat workflow stays usable post-migration.
 ## Observed 2026-04-09
 Master commit `41515e3b` had 503's code, the in-memory CRDT view had 503 at stage='merge', the `pipeline_items` row existed (post my-sqlite-update at `5_done`), but `move 503 done` in the web UI returned **`No story, bug, or spike with number 503 found`** because no `.huskies/work/4_merge/503_*.md` file existed.
 ## Implementation note
 The MCP `move_story` tool already does this correctly: it goes through `lifecycle::move_item` which checks `crdt_state::read_item(story_id)` first. The chat commands need to use the same lookup helper. The fix should consolidate all "find story by number" logic into one shared function used by every command.
 ## Acceptance Criteria
 - [ ] All seven chat commands (move_story, show, depends, unblock, start, assign, delete) successfully find stories that exist in CRDT but have no filesystem shadow
 - [ ] Backward compat: commands still work for stories with only filesystem shadows (during the migration window)
 - [ ] A single shared `find_story_by_number` helper is introduced and used by every chat command
 - [ ] Lookup priority order is documented and consistent: CRDT first, then pipeline_items, then filesystem fallback
 - [ ] Regression test per command covering CRDT-only, filesystem-only, both-present, and not-found cases
 - [ ] Observed repro from 2026-04-09 (move 503 done failing even though 503 was fully present in CRDT and pipeline_items) is the canonical regression case
 ## Out of Scope
 - TBD
@@ -0,0 +1,55 @@
 ---
 name: "Startup reconcile pass that detects drift between CRDT, pipeline_items, and filesystem shadows"
 ---
 # Story 513: Startup reconcile pass that detects drift between CRDT, pipeline_items, and filesystem shadows
 ## User Story
 **Note:** content stuffed into user_story per bug 509 workaround.
 ---
 ## Context
 Post-491/492, huskies has **four places state lives** that can drift apart:
 1. `crdt_ops` table — the persisted CRDT operation log (intended source of truth)
 2. In-memory CRDT view — `state.crdt.doc.items` reconstructed from `crdt_ops` on startup, mutated by `apply_and_persist` during runtime
 3. `pipeline_items` table — a shadow / materialised view, written to as a shadow alongside CRDT writes
 4. Filesystem shadows in `.huskies/work/N_stage/*.md` — legacy rendering, still written by some paths and read by others
 There is currently **no reconcile pass** that detects drift between them. We've watched this drift bite repeatedly today: stories appear in some views and not others, lifecycle moves happen in one but not another, my direct sqlite UPDATE was invisible to the API, etc. Each individual view looks "fine" in isolation, but the drift only becomes visible when a user notices a story behaving inconsistently.
 ## Real user story
 As a developer or operator running huskies, I want a startup reconcile pass that compares all four state sources and either reconciles them automatically (preferred) or logs structured warnings about the drift, so I can detect and diagnose state corruption before it causes user-visible bugs.
 ## Observed 2026-04-09
 Throughout this session we observed: 478 in pipeline_items but missing from CRDT (after a direct sqlite insert), 503 in CRDT at stage=merge but pipeline_items at stage=5_done (after my UPDATE), filesystem shadows in `1_backlog/` for stories that were already in 5_done in the DB (bug 510), etc. None of these were detected by huskies itself — they were all only found by running ad-hoc `SELECT` queries during incident response.
 ## Scope
 This is the *detection* story, not the *fix-the-drift* story. The reconcile pass should:
 - Run at startup (after CRDT replay, before serving requests)
 - Compare each story's stage across all four sources
 - Emit structured log lines for each drift type (CRDT-only, FS-only, DB-only, stage mismatch, etc.)
 - Optionally surface a count to the matrix bot startup announcement (e.g. "⚠️ 3 stories have CRDT/DB drift — see logs")
 The actual *fix-the-drift* logic (what to do when drift is detected) is a separate, larger story.
 ## Acceptance Criteria
 - [ ] At server startup, after CRDT replay, a reconcile_state() function runs that walks all four state sources and detects drift
 - [ ] Each drift type is logged with a structured line: e.g. `[reconcile] DRIFT story=X crdt_stage=Y db_stage=Z fs_stage=W` (or `MISSING` for absent)
 - [ ] If any drift is detected, the matrix bot startup announcement includes a count and a suggestion to check the server logs
 - [ ] The reconcile pass completes in < 1 second for a typical pipeline (~100 stories) so it doesn't slow startup meaningfully
 - [ ] Tests cover: no drift (clean state), CRDT-only story, DB-only story, FS-only story, stage mismatch between CRDT and DB
 - [ ] Documentation in README.md explains the reconcile pass and what each drift type means
 - [ ] The pass is opt-out via a config flag in case it produces noise during the migration window
 ## Out of Scope
 - TBD
@@ -0,0 +1,94 @@
 ---
 name: "delete_story should do a full cleanup (CRDT op + DB row + filesystem shadow + worktree + pending timers)"
 ---
 # Story 514: delete_story should do a full cleanup (CRDT op + DB row + filesystem shadow + worktree + pending timers)
 ## User Story
 **Depends on story 520** (typed pipeline state machine). With 520 in place, `delete_story` becomes a single typed transition (`* → Archived(Abandoned)` or a hard-delete CRDT op) followed by event subscribers that handle the worktree, timers, and filesystem cleanup. This story should be re-shaped as the consumer migration once 520 lands.
 ---
 **Note:** content stuffed into user_story per bug 509 workaround.
 ## Context
 The MCP `delete_story` tool currently only **removes the filesystem markdown** from `.huskies/work/N_stage/`. It does NOT:
 - Remove the row from `pipeline_items`
 - Write a CRDT delete op to `crdt_ops`
 - Tear down the in-memory CRDT entry
 - Remove the `.huskies/worktrees/N_…/` worktree
 - Cancel any pending rate-limit retry timers in `.huskies/timers.json`
 So after `delete_story`, the story keeps appearing in `get_pipeline_status` (because the in-memory CRDT still has it), the timer fires and re-spawns an agent, the agent runs in the still-existing worktree, and the user has no idea why the "deleted" story keeps coming back.
 ## Real user story
 As a user calling `delete_story` (via MCP, web UI, or chat command), I want a complete tear-down of all state associated with that story across every layer, so the story is actually gone — no in-memory cache entries, no pending agents, no timers, no worktree, no shadow files, no future spawns.
 ## Observed 2026-04-09
 Repeatedly throughout the session. The most concrete example was around 17:20: I called `delete_story 478_…`, the tool returned success, the markdown file at `.huskies/work/1_backlog/478_…md` was removed, but at 17:25:17 the rate-limit retry timer fired and **re-spawned a coder-1 on the deleted story** because the worktree still existed, the pipeline_items row still existed, and the timer entry still existed in `.huskies/timers.json`. We then had to do sqlite surgery + manual worktree removal + manual timers.json edit to actually kill 478.
 ## Implementation note
 The current `delete_story` is on the legacy filesystem path. The fix needs to wrap it in a transaction that touches every layer:
 1. Cancel any pending timers for this story_id (read timers.json, filter, write back)
 2. Stop any running/pending agents for this story_id (call `agent_pool.stop_agent` for each)
 3. Remove the worktree if it exists (`git worktree remove`)
 4. Write a CRDT delete op (`apply_and_persist` with a delete op)
 5. Wait for the persist task to confirm
 6. Delete the row from `pipeline_items` directly (or trust the materialiser to drop it)
 7. Remove the filesystem shadow
 Each step should be best-effort with logging — partial failures should be visible, not silent.
 ## Context
 The MCP `delete_story` tool currently only **removes the filesystem markdown** from `.huskies/work/N_stage/`. It does NOT:
 - Remove the row from `pipeline_items`
 - Write a CRDT delete op to `crdt_ops`
 - Tear down the in-memory CRDT entry
 - Remove the `.huskies/worktrees/N_…/` worktree
 - Cancel any pending rate-limit retry timers in `.huskies/timers.json`
 So after `delete_story`, the story keeps appearing in `get_pipeline_status` (because the in-memory CRDT still has it), the timer fires and re-spawns an agent, the agent runs in the still-existing worktree, and the user has no idea why the "deleted" story keeps coming back.
 ## Real user story
 As a user calling `delete_story` (via MCP, web UI, or chat command), I want a complete tear-down of all state associated with that story across every layer, so the story is actually gone — no in-memory cache entries, no pending agents, no timers, no worktree, no shadow files, no future spawns.
 ## Observed 2026-04-09
 Repeatedly throughout the session. The most concrete example was around 17:20: I called `delete_story 478_…`, the tool returned success, the markdown file at `.huskies/work/1_backlog/478_…md` was removed, but at 17:25:17 the rate-limit retry timer fired and **re-spawned a coder-1 on the deleted story** because the worktree still existed, the pipeline_items row still existed, and the timer entry still existed in `.huskies/timers.json`. We then had to do sqlite surgery + manual worktree removal + manual timers.json edit to actually kill 478.
 ## Implementation note
 The current `delete_story` is on the legacy filesystem path. The fix needs to wrap it in a transaction that touches every layer:
 1. Cancel any pending timers for this story_id (read timers.json, filter, write back)
 2. Stop any running/pending agents for this story_id (call `agent_pool.stop_agent` for each)
 3. Remove the worktree if it exists (`git worktree remove`)
 4. Write a CRDT delete op (`apply_and_persist` with a delete op)
 5. Wait for the persist task to confirm
 6. Delete the row from `pipeline_items` directly (or trust the materialiser to drop it)
 7. Remove the filesystem shadow
 Each step should be best-effort with logging — partial failures should be visible, not silent.
 ## Acceptance Criteria
 - [ ] delete_story returns success only when ALL of the following are true: no row in pipeline_items, no op in crdt_ops referencing the story_id (or a delete op present), no in-memory CRDT entry, no worktree directory, no timer entries, no filesystem shadow
 - [ ] Each tear-down step has its own log line so partial failures are diagnosable
 - [ ] If any tear-down step fails, the tool returns an error with which step failed and what was already torn down (so the user can finish the cleanup manually)
 - [ ] After delete_story, the story does NOT appear in get_pipeline_status, the web UI, or list_agents
 - [ ] After delete_story, no rate-limit retry timer can re-spawn an agent on the deleted story
 - [ ] Regression test using the 2026-04-09 repro: schedule a rate-limit timer for the story, call delete_story, fast-forward 5 minutes, assert no agent spawned
 ## Out of Scope
 - TBD
@@ -0,0 +1,54 @@
 ---
 name: "Add a debug MCP tool to dump the in-memory CRDT state for inspection"
 ---
 # Story 515: Add a debug MCP tool to dump the in-memory CRDT state for inspection
 ## User Story
 **Note:** content stuffed into user_story per bug 509 workaround.
 ---
 ## Context
 When diagnosing CRDT/state issues today, we had no way to look at the **in-memory** CRDT state directly. The closest available views were:
 - `get_pipeline_status` — gives a summarised pipeline-shaped view (active/backlog/done) but hides the raw item structure, the index map, the lamport clock state, etc.
 - Querying `crdt_ops` directly via sqlite — gives the *persisted* state, which can diverge from the in-memory state (we saw this with bug 511, where post-restart writes use reset seq counters)
 - `read_item(story_id)` in `crdt_state.rs` — exists, returns a `PipelineItemView`, but is not exposed via MCP or HTTP
 The result: every time I needed to check the CRDT state, I was either inferring it from `get_pipeline_status` (lossy) or querying the persisted ops (lagging the in-memory state). Neither gave me the ground truth.
 ## Real user story
 As a developer debugging huskies state issues, I want an MCP tool (or HTTP debug endpoint) that returns a structured dump of the in-memory CRDT state, so I can see exactly what the running server thinks is true without inferring from summaries.
 ## Suggested API
 - Tool name: `mcp__huskies__dump_crdt`
 - Args: optional `story_id` filter (single story) or no args (dump everything)
 - Returns: JSON with one entry per item containing: `story_id`, all field values (`stage`, `name`, `agent`, `retry_count`, `blocked`, `depends_on`), the CRDT path/index bytes (for cross-referencing with `crdt_ops`), the local lamport seq counter, and a flag indicating whether the item is `is_deleted`
 - Returns metadata: total item count, current local seq counter value, count of pending ops in `persist_tx` channel (if observable)
 ## Observed 2026-04-09
 This story would have saved us significant debugging time. Specific examples:
 - When 478 was missing from `get_pipeline_status` after the manual sqlite insert, we had to infer "the API reads from in-memory CRDT, not from pipeline_items" by looking at source code. A `dump_crdt 478_…` call would have returned "not found" immediately, confirming the same conclusion.
 - When 503 was showing at stage=merge in the API but only had a creation op at stage=1_backlog in `crdt_ops`, we had to manually search for content-indexed update ops to figure out where the post-restart updates went. A dump tool showing the current in-memory state vs the persisted op count would have made the divergence obvious.
 ## Acceptance Criteria
 - [ ] New MCP tool `dump_crdt` is registered and callable
 - [ ] With no args, returns all items in the in-memory CRDT as a structured JSON list
 - [ ] With a story_id arg, returns just that one item (or null if not found)
 - [ ] Each item entry includes: story_id, stage, name, agent, retry_count, blocked, depends_on, content_index (hex), is_deleted
 - [ ] Returns top-level metadata: total_items, max_local_seq, pending_persist_ops_count (if available), in_memory_state_loaded (bool)
 - [ ] Tool description is clear that this is a debug tool, not for normal pipeline introspection (those should use get_pipeline_status)
 - [ ] Optional: also expose via HTTP at `/debug/crdt` for browser inspection
 - [ ] Documented in README.md under a 'debugging' section
 ## Out of Scope
 - TBD
@@ -0,0 +1,53 @@
 ---
 name: "update_story.description should create the ## Description section if it doesn't exist (instead of erroring)"
 ---
 # Story 516: update_story.description should create the ## Description section if it doesn't exist (instead of erroring)
 ## User Story
 **Note:** content stuffed into user_story per bug 509 workaround.
 ---
 ## Context
 The MCP `update_story` tool's `description` parameter "replaces the `## Description` section content". If the section doesn't exist in the story file, the call **errors out** with `Section '## Description' not found in story file.`
 This becomes a real problem when:
 1. A story was created via `create_story` (which is buggy per 509 and writes a stub template with no `## Description` section)
 2. The user later wants to add a description via `update_story`
 3. The update fails with the cryptic "section not found" error
 We hit this exact scenario today: after bug 509 dropped the descriptions of 6 stories (500, 504, 505, 506, 507, 508), I tried to recover them by calling `update_story` with `description=...` — and the call errored out because the stub template the buggy `create_story` had written had no `## Description` section. We had to fall back to stuffing everything into the `user_story` field.
 ## Real user story
 As a user calling `update_story.description` on any story (regardless of how it was originally created), I want the call to either replace the existing `## Description` section OR create one if it doesn't exist, so I never have to think about the template structure.
 ## Implementation note
 The simplest fix is in the `update_story_in_file` (or equivalent) function: when looking for the `## Description` section, if not found, **insert it** at a sensible location — probably between `## User Story` and `## Acceptance Criteria` — and then write the description content there.
 Related: this story partially covers the workaround for bug 509 (create_story drops description). If 509 is fixed first, the templates would always have a `## Description` section and this wouldn't matter. But this fix is still valuable for older stories created before 509 lands, AND for stories created via legacy paths that don't use the canonical template.
 ## Observed 2026-04-09
 ```
 > update_story story_id=500_story_remove_duplicate_pty_debug_log_lines description="..."
 Error: Section '## Description' not found in story file.
 ```
 ## Acceptance Criteria
 - [ ] update_story.description succeeds whether or not the target story has a pre-existing ## Description section
 - [ ] When the section is missing, it is created at a consistent location (between ## User Story and ## Acceptance Criteria)
 - [ ] When the section exists, the existing replace-content behaviour is preserved (no regression)
 - [ ] Unit test covering both: section-exists path AND section-missing path
 - [ ] Symmetric fix for update_story.user_story (if it has the same brittleness)
 - [ ] Error messages for genuine failure modes (file not found, write failed) are still distinct from the now-silent missing-section case
 ## Out of Scope
 - TBD
@@ -0,0 +1,98 @@
 ---
 name: "Remove filesystem-shadow fallback paths from lifecycle.rs (finish the migration to CRDT-only)"
 ---
 # Story 517: Remove filesystem-shadow fallback paths from lifecycle.rs (finish the migration to CRDT-only)
 ## User Story
 **Depends on story 520** (typed pipeline state machine). Once 520 lands and consumers are migrated to the typed transition API, the lifecycle module no longer needs filesystem fallbacks — all state changes go through the typed `transition` function and the event bus. This story becomes the natural cleanup pass after 520 + 512 + 514 land.
 ---
 **Note:** content stuffed into user_story per bug 509 workaround.
 ## Context
 `server/src/agents/lifecycle.rs::move_item` (the helper that backs `move_story_to_current`, `move_story_to_done`, `move_story_to_merge`, etc.) has **three execution paths**:
 1. **CRDT-first path** (the "happy" post-migration path) — calls `crdt_state::read_item(story_id)`, then `db::move_item_stage`, which writes a CRDT op and broadcasts events
 2. **Content-store fallback** — if the story isn't in CRDT but exists in the db's content store, import it via `db::write_item_with_content`
 3. **Filesystem fallback** — if neither, scan `.huskies/work/N_stage/` for a markdown file, import it to the DB
 Paths 2 and 3 are **migration scaffolding**. They were necessary while stories existed only on disk and the CRDT was empty, but post-491/492 they should be unnecessary. Worse, they actively *cause* drift today:
 - The filesystem fallback can re-import stale shadow files into the DB, undoing intentional deletes
 - The path 3 search is blind to which stage a story "should" be in per the DB — it picks whatever stage dir has the file, which can promote stale shadows
 - This is the mechanism that makes bug 510 (split-brain shadow promotion) possible
 `move_story_to_current` is hardcoded to read from `["1_backlog"]`, which is also part of the same legacy filesystem assumption.
 ## Real user story
 As a developer maintaining huskies, I want the lifecycle code to operate exclusively on the CRDT/DB and never touch filesystem shadows, so state drift is eliminated and the post-migration architecture is consistent.
 ## Implementation plan
 1. Inventory every code path in `lifecycle.rs` that touches the filesystem under `.huskies/work/`
 2. For each, determine whether it's a *read* (legacy fallback — can be removed if we're confident all stories are in CRDT now) or a *write* (legacy mirror — can be deferred to a separate filesystem-renderer task that derives state from CRDT)
 3. Remove the read fallbacks
 4. Move the writes to a downstream materialiser task that writes the filesystem shadows from CRDT events (so they're strictly read-only renderings)
 5. Run the bug-510 reconcile pass at startup (story TBD) before this lands, to ensure no story is stranded with only a filesystem shadow
 ## Observed 2026-04-09
 We watched the filesystem fallback paths cause harm multiple times today:
 - Bug 510 split-brain: filesystem shadows in `1_backlog/` got re-promoted by timer fires after the DB had already moved the story to `5_done`
 - The 478 worktree's `move_story_to_current` no-op'd because there was no `1_backlog` shadow — even though 478 was in `4_merge` per the DB (this was actually correct behaviour given the function's narrow `from = ["1_backlog"]`, but it surfaces how filesystem-bound the function is)
 - Lifecycle moves were happening on the filesystem without writing CRDT ops (we initially mis-diagnosed this as "no transition ops in CRDT" before finding bug 511)
 ## Context
 `server/src/agents/lifecycle.rs::move_item` (the helper that backs `move_story_to_current`, `move_story_to_done`, `move_story_to_merge`, etc.) has **three execution paths**:
 1. **CRDT-first path** (the "happy" post-migration path) — calls `crdt_state::read_item(story_id)`, then `db::move_item_stage`, which writes a CRDT op and broadcasts events
 2. **Content-store fallback** — if the story isn't in CRDT but exists in the db's content store, import it via `db::write_item_with_content`
 3. **Filesystem fallback** — if neither, scan `.huskies/work/N_stage/` for a markdown file, import it to the DB
 Paths 2 and 3 are **migration scaffolding**. They were necessary while stories existed only on disk and the CRDT was empty, but post-491/492 they should be unnecessary. Worse, they actively *cause* drift today:
 - The filesystem fallback can re-import stale shadow files into the DB, undoing intentional deletes
 - The path 3 search is blind to which stage a story "should" be in per the DB — it picks whatever stage dir has the file, which can promote stale shadows
 - This is the mechanism that makes bug 510 (split-brain shadow promotion) possible
 `move_story_to_current` is hardcoded to read from `["1_backlog"]`, which is also part of the same legacy filesystem assumption.
 ## Real user story
 As a developer maintaining huskies, I want the lifecycle code to operate exclusively on the CRDT/DB and never touch filesystem shadows, so state drift is eliminated and the post-migration architecture is consistent.
 ## Implementation plan
 1. Inventory every code path in `lifecycle.rs` that touches the filesystem under `.huskies/work/`
 2. For each, determine whether it's a *read* (legacy fallback — can be removed if we're confident all stories are in CRDT now) or a *write* (legacy mirror — can be deferred to a separate filesystem-renderer task that derives state from CRDT)
 3. Remove the read fallbacks
 4. Move the writes to a downstream materialiser task that writes the filesystem shadows from CRDT events (so they're strictly read-only renderings)
 5. Run the bug-510 reconcile pass at startup (story TBD) before this lands, to ensure no story is stranded with only a filesystem shadow
 ## Observed 2026-04-09
 We watched the filesystem fallback paths cause harm multiple times today:
 - Bug 510 split-brain: filesystem shadows in `1_backlog/` got re-promoted by timer fires after the DB had already moved the story to `5_done`
 - The 478 worktree's `move_story_to_current` no-op'd because there was no `1_backlog` shadow — even though 478 was in `4_merge` per the DB (this was actually correct behaviour given the function's narrow `from = ["1_backlog"]`, but it surfaces how filesystem-bound the function is)
 - Lifecycle moves were happening on the filesystem without writing CRDT ops (we initially mis-diagnosed this as "no transition ops in CRDT" before finding bug 511)
 ## Acceptance Criteria
 - [ ] Inventory of every filesystem touch in lifecycle.rs is documented in the story body or a follow-up comment
 - [ ] All read fallbacks in lifecycle.rs (paths 2 and 3 above) are removed
 - [ ] All write paths in lifecycle.rs that mirror to the filesystem are moved to a separate materialiser task driven by CRDT events
 - [ ] After the change, lifecycle.rs has zero direct std::fs:: calls under .huskies/work/
 - [ ] move_story_to_current no longer hardcodes from=['1_backlog'] — it reads the source stage from CRDT
 - [ ] Regression: the existing 'try filesystem fallback' tests are updated to test the new CRDT-only path instead of being deleted
 - [ ] A pre-flight script verifies all existing stories are in CRDT before this change lands (so nothing gets stranded)
 - [ ] Bug 510 (split-brain shadows) no longer reproduces after this change
 ## Out of Scope
 - TBD
@@ -0,0 +1,62 @@
 ---
 name: "apply_and_persist should log when persist_tx send fails instead of silently dropping the op"
 ---
 # Story 518: apply_and_persist should log when persist_tx send fails instead of silently dropping the op
 ## User Story
 **Note:** content stuffed into user_story per bug 509 workaround.
 ---
 ## Context
 `server/src/crdt_state.rs::apply_and_persist` updates the in-memory CRDT and then sends the signed op to the persistence task via a channel:
 ```rust
 fn apply_and_persist<F>(state: &mut CrdtState, op_fn: F) {
    let raw_op = op_fn(state);
    let signed = raw_op.sign(&state.keypair);
    state.crdt.apply(signed.clone());                  // in-memory update
    let _ = state.persist_tx.send(signed.clone());     // ← fire-and-forget, error dropped
    ...
 }
 ```
 The `let _ = ...` discards the return value of `send()`. If the channel is closed (because the persistence task panicked, was shut down, or has dropped its receiver), the op is silently dropped from persistence — but the in-memory CRDT is already updated. The next restart will replay only the persisted ops, and the in-memory state will quietly diverge from the persisted state.
 This is also one of the candidate causes for some of the state drift we've been chasing. It's hard to rule out because there's no log line to confirm whether the persist task is still alive or whether sends are succeeding.
 ## Real user story
 As a developer or operator, I want any failure of `persist_tx.send()` to be logged immediately at WARN or ERROR level, so silent persistence loss is detectable instead of invisible.
 ## Observed 2026-04-09
 Spent significant time investigating whether persist sends were silently failing. Eventually ruled it out empirically (we found that ops WERE being persisted, just with reset seq counters per bug 511). But the diagnosis would have been minutes instead of an hour if there was a log line to check.
 ## Fix (small)
 ```rust
 if let Err(e) = state.persist_tx.send(signed.clone()) {
    crate::slog_error!(
        "[crdt] Failed to send op to persist task: {e}; persist task may be dead. \
         In-memory state is now ahead of persisted state."
    );
 }
 ```
 Apply the same fix at every `let _ = state.persist_tx.send(...)` site in crdt_state.rs (there are at least 2 — one in apply_and_persist, one in apply_remote_op).
 ## Acceptance Criteria
 - [ ] Every call site of `state.persist_tx.send(...)` in crdt_state.rs logs at ERROR level on send failure
 - [ ] The error message includes the channel error and a clear note that 'in-memory and persisted state may have diverged'
 - [ ] Unit test: shut down the persist receiver (drop the rx end), call write_item, assert an error is logged
 - [ ] No regression in the happy path (no extra log lines on success)
 - [ ] Consider: also expose a counter / metric for persist send failures so it can be monitored without grepping logs
 ## Out of Scope
 - TBD
@@ -0,0 +1,107 @@
 ---
 name: "mergemaster should detect no-commits-ahead-of-master and fail loudly instead of exiting silently"
 ---
 # Story 519: mergemaster should detect no-commits-ahead-of-master and fail loudly instead of exiting silently
 ## User Story
 **Depends on story 520** (typed pipeline state machine). Once 520 lands, this story largely *evaporates*: `Stage::Merge` is defined as `Merge { feature_branch: BranchName, commits_ahead: NonZeroU32 }`, so a merge state with zero commits ahead is **structurally unrepresentable**. The transition `Current → Merge` (or `Qa → Merge`) is required to provide a NonZeroU32 — the type system enforces it. This story remains useful as a *defensive runtime check* during the migration window before 520 lands; afterwards, it should be closed as redundant.
 ---
 **Note:** content stuffed into user_story per bug 509 workaround.
 ## Context
 When mergemaster runs on a story whose worktree has **zero commits ahead of master** (e.g. because `create_worktree` always creates from master and the original feature branch was never checked out into the worktree), it currently:
 1. Spawns its claude session
 2. Runs `merge_agent_work` MCP tool
 3. Finds nothing to merge
 4. Exits cleanly with `[agent:N:mergemaster] Done. Session: ...`
 5. **Does not log any error or warning**
 6. **Spends real money** on the empty session — we observed `cost=$0.82` for one such no-op run
 The user has no signal that the merge didn't actually happen. The matrix bot fires a "QA → Merge" stage notification (because the story did move stages internally), then nothing — no `🎉 Merge → Done` notification follows. Master is unchanged.
 ## Real user story
 As a user watching the pipeline, I want mergemaster to detect "this worktree has no commits ahead of master" *before* spending money on a Claude session, and fail loudly with a clear error so I know to investigate the upstream cause (probably the worktree got reset to master).
 ## Observed 2026-04-09
 Around 18:31:51, mergemaster spawned for 478 in a worktree that had been reset to master by the orphan cleanup logic at 18:29:54. By the time mergemaster ran, the worktree was on master with zero commits ahead. It ran a session, spent $0.82, exited "Done", and didn't merge anything. We didn't notice for several minutes because the failure was completely silent. We had to manually `git log master..feature/story-478_…` to confirm there was no merge commit on master.
 ## Fix
 In mergemaster's startup sequence (probably in advance.rs or wherever the mergemaster session is spawned), add a pre-flight check:
 ```rust
 let commits_ahead = git_commits_ahead(worktree_path, "master")?;
 if commits_ahead == 0 {
    slog_error!(
        "[mergemaster] worktree {worktree_path} has no commits ahead of master; \
         refusing to spawn merge session. Likely cause: worktree was reset to \
         master after the feature branch's commits were created. Investigate the \
         worktree's git state before retrying."
    );
    return Err("no commits to merge".into());
 }
 ```
 This costs ~milliseconds (one git command) and saves the cost of an entire Claude session per false-positive.
 ## Context
 When mergemaster runs on a story whose worktree has **zero commits ahead of master** (e.g. because `create_worktree` always creates from master and the original feature branch was never checked out into the worktree), it currently:
 1. Spawns its claude session
 2. Runs `merge_agent_work` MCP tool
 3. Finds nothing to merge
 4. Exits cleanly with `[agent:N:mergemaster] Done. Session: ...`
 5. **Does not log any error or warning**
 6. **Spends real money** on the empty session — we observed `cost=$0.82` for one such no-op run
 The user has no signal that the merge didn't actually happen. The matrix bot fires a "QA → Merge" stage notification (because the story did move stages internally), then nothing — no `🎉 Merge → Done` notification follows. Master is unchanged.
 ## Real user story
 As a user watching the pipeline, I want mergemaster to detect "this worktree has no commits ahead of master" *before* spending money on a Claude session, and fail loudly with a clear error so I know to investigate the upstream cause (probably the worktree got reset to master).
 ## Observed 2026-04-09
 Around 18:31:51, mergemaster spawned for 478 in a worktree that had been reset to master by the orphan cleanup logic at 18:29:54. By the time mergemaster ran, the worktree was on master with zero commits ahead. It ran a session, spent $0.82, exited "Done", and didn't merge anything. We didn't notice for several minutes because the failure was completely silent. We had to manually `git log master..feature/story-478_…` to confirm there was no merge commit on master.
 ## Fix
 In mergemaster's startup sequence (probably in advance.rs or wherever the mergemaster session is spawned), add a pre-flight check:
 ```rust
 let commits_ahead = git_commits_ahead(worktree_path, "master")?;
 if commits_ahead == 0 {
    slog_error!(
        "[mergemaster] worktree {worktree_path} has no commits ahead of master; \
         refusing to spawn merge session. Likely cause: worktree was reset to \
         master after the feature branch's commits were created. Investigate the \
         worktree's git state before retrying."
    );
    return Err("no commits to merge".into());
 }
 ```
 This costs ~milliseconds (one git command) and saves the cost of an entire Claude session per false-positive.
 ## Acceptance Criteria
 - [ ] Before mergemaster spawns its Claude session, it runs `git log master..HEAD --oneline` (or equivalent) on the worktree
 - [ ] If the result is empty (zero commits ahead), mergemaster exits early with an ERROR log line and does NOT spawn the session
 - [ ] The error message is specific enough that the user can diagnose the upstream cause (e.g. mentions 'worktree was reset' and suggests checking the worktree's branch)
 - [ ] The matrix bot sends a clear failure notification (NOT a successful 🎉 emoji) when this happens
 - [ ] The story does not advance to a 'done' state when mergemaster exits this way; it stays in 4_merge with a clear blocked status
 - [ ] Regression test: create a worktree on master (no feature commits), invoke mergemaster, assert the early exit happens and no Claude session is spawned
 - [ ] Cost saving observed in the 2026-04-09 incident ($0.82 per no-op session) is documented in the test as the motivation
 ## Out of Scope
 - TBD
@@ -0,0 +1,192 @@
 ---
 name: "Typed pipeline state machine in Rust (foundation: replaces stringly-typed CRDT views with strict enums, subsumes 436)"
 ---
 # Story 520: Typed pipeline state machine in Rust (foundation: replaces stringly-typed CRDT views with strict enums, subsumes 436)
 ## User Story
 **Note:** content stuffed into user_story per bug 509 workaround.
 ---
 ## Context
 Today huskies represents pipeline state as a loose JSON document inside the BFT JSON CRDT. Each story has fields like `stage: String`, `agent: String`, `retry_count: f64`, `blocked: bool`, `depends_on: String` (JSON-encoded list, double-encoded). This stringly-typed representation allows **many impossible states** to be representable in the data model:
 - `stage = "9_invalid"` — typo, no compile error
 - `stage = "5_done"` + `blocked = true` — a done story is blocked? what does that mean?
 - `stage = "4_merge"` with no commits ahead of master — the silent mergemaster failure mode (today's story 478)
 - A coder agent assigned to a story in `4_merge` — bug 502, the loop we fought all day today
 - `retry_count = 3.7` — fractional retry counts (it's an f64 because that's what JSON CRDTs do)
 - `agent = "coder-1"` AND `stage = "1_backlog"` — backlog story has an agent? sentinel encoding via empty string
 Multiple bugs filed today (501, 502, 510, 511) exist *because* the type system can't enforce the pipeline invariants. **Patching individual symptoms forever is the wrong strategy.** The right strategy is to make impossible states unrepresentable at the Rust type level, using a typed state machine layered on top of the loose CRDT. The CRDT can stay loose at the persistence layer (it has to be — that's what makes it merge correctly across nodes), but every consumer above the CRDT operates on strict typed enums.
 ## Real user story
 As a developer working on huskies, I want the pipeline state to be expressed as a strict Rust state machine where impossible states and impossible transitions are compile-time errors, so future bugs in this category become structural rather than runtime drift.
 ## Design
 ### Two enum hierarchies
 **Synced state (CRDT-backed, converges across nodes):**
 ```rust
 enum Stage {
    Backlog,
    Current,
    Qa,
    Merge { feature_branch: BranchName, commits_ahead: NonZeroU32 },
    Done { merged_at: DateTime<Utc>, merge_commit: GitSha },
    Archived { archived_at: DateTime<Utc>, reason: ArchiveReason },
 }
 enum ArchiveReason {
    Completed,                              // normal accept_story → archived
    Abandoned,                              // user explicitly abandoned
    Superseded { by: StoryId },
    Blocked { reason: String },             // was bug 436's `blocked: true`
    MergeFailed { reason: String },         // was bug 436's `merge_failure`
    ReviewHeld { reason: String },          // was bug 436's `review_hold`
 }
 struct PipelineItem {
    story_id: StoryId,                      // newtype, validated
    name: String,
    stage: Stage,                           // typed enum, all variants are valid by construction
    depends_on: Vec<StoryId>,               // parsed, not stringified
    retry_count: u32,                       // not f64
    // No more separate `blocked`, `merge_failure`, `review_hold` — folded into Stage::Archived
 }
 ```
 **Per-node execution state (CRDT-backed under node_id key, local-authored but globally-readable):**
 ```rust
 enum ExecutionState {
    Idle,
    Pending     { agent: AgentName, since: DateTime<Utc> },
    Running     { agent: AgentName, started_at: DateTime<Utc>, last_heartbeat: DateTime<Utc> },
    RateLimited { agent: AgentName, resume_at: DateTime<Utc> },
    Completed   { agent: AgentName, exit_code: i32, completed_at: DateTime<Utc> },
 }
 // In the CRDT document, ExecutionState is stored under each node's pubkey:
 //   crdt.execution_state: { node_pubkey → { story_id → ExecutionState } }
 ```
 The execution state lives in the CRDT under **each node's pubkey**. Each node only writes to entries where `node_pubkey == self.pubkey`, so there's no merge conflict — concurrent writes from the same author follow LWW, concurrent writes from different authors target different entries entirely. All nodes can READ all execution states across the mesh.
 **This per-node-keyed CRDT pattern enables:**
 - **Cross-node observability** — matrix bot can show "node A is running coder-1 on story X, node B is rate-limited on story Y"
 - **Heartbeat detection** — if a node hasn't updated its execution_state in N minutes, the entry is "stale" (laptop closed, process crashed, oom kill, etc.)
 - **Foundation for story 479** (CRDT work claiming) — a node knows what other nodes are doing *before* claiming work
 - **Stuck job recovery** — if node A's heartbeat dies mid-run, node B can see the stuck state and decide whether to take over
 - **Crash forensics** — the last persisted ExecutionState before a crash is preserved in CRDT, accessible from any node
 ### The transition function
 ```rust
 fn transition(
    state: PipelineItem,
    event: PipelineEvent,
 ) -> Result<PipelineItem, TransitionError>
 ```
 Pure function. Takes the current state and an event, returns either the new state or a TransitionError. The compiler enforces that the result of every transition is structurally valid — you can't construct a `Stage::Merge` without `commits_ahead: NonZeroU32`, you can't construct a `Stage::Done` without a `merge_commit: GitSha`, etc.
 **The set of valid transitions is small** (roughly 10):
 - `Backlog → Current` — deps met, auto-assign promotes
 - `Current → Qa` — gates start
 - `Current → Merge` — qa: server, gates auto-pass
 - `Qa → Merge` — gates pass
 - `Qa → Current` — gates fail, retry
 - `Merge → Done` — mergemaster squash succeeds; *requires `Merge.commits_ahead > 0`*
 - `Done → Archived(Completed)` — accept_story
 - `* → Archived(Blocked / MergeFailed / ReviewHeld)` — stuck-state move
 - `Archived(Blocked) → Backlog` — unblock
 Anything else is a `TransitionError`. The compiler refuses to compile code that constructs invalid transitions.
 ### The event subscriber pattern
 State changes fire events on a bus. Side-effect handlers subscribe independently:
 ```rust
 type TransitionEvent = (PipelineItem /* before */, PipelineItem /* after */);
 bus.subscribe("matrix-bot",       |before, after| matrix_bot.notify_stage_change(before, after));
 bus.subscribe("filesystem",       |before, after| fs_renderer.update(after));
 bus.subscribe("pipeline-table",   |before, after| pipeline_items_table.upsert(after));
 bus.subscribe("auto-assign",      |before, after| auto_assign.poke_if_relevant(after));
 bus.subscribe("web-ui-broadcast", |before, after| ws_clients.broadcast(after));
 ```
 Each subscriber is independent and concerns itself only with its own dispatch. Adding a new side effect = adding a new subscriber, not editing the transition function. **The "many things happen on state changes" complexity moves out of the state machine and into the bus consumers**, where each piece is testable in isolation.
 ### Projection layer (loose CRDT ↔ typed Rust)
 The bft-json-crdt JSON document is the persistence layer. The typed enums are the application layer. A projection function bridges them at one carefully-controlled boundary:
 ```rust
 impl TryFrom<&PipelineItemCrdt> for PipelineItem {
    type Error = ProjectionError;
    fn try_from(crdt: &PipelineItemCrdt) -> Result<Self, ProjectionError> { ... }
 }
 impl From<&PipelineItem> for PipelineItemCrdt {
    fn from(item: &PipelineItem) -> Self { ... }
 }
 ```
 When the CRDT contains data the typed layer can't parse (e.g. a stage value from a future huskies version, OR a merge that produces an inconsistent intermediate state), `try_from` returns a `ProjectionError`. The error surfaces to the caller — it doesn't silently propagate as garbage. The validation happens at exactly one point: the projection boundary.
 ## What this subsumes
 **Story 436** ("Unify story stuck states into a single status field") is subsumed by the `Stage::Archived { reason: ArchiveReason }` variant. The unified status field IS the `ArchiveReason` enum. Story 436 is marked superseded by this story.
 ## What this enables (concrete bug eliminations)
 - **Bug 502** becomes unrepresentable: there's no way to construct `Stage::Merge` with a Coder agent — Coder agents only attach to `Current` / `Qa` stages, and that constraint is in the type signature of the transition function.
 - **Bug 510** becomes irrelevant: there's no "stage='1_backlog' filesystem shadow vs stage='5_done' DB" drift, because Stage is a typed enum with a single source of truth (the CRDT), and projections are derived deterministically.
 - **Bug 519** (mergemaster silent on no-op merge) becomes unrepresentable: `Stage::Merge` requires `commits_ahead: NonZeroU32`. You can't construct a Merge state with zero commits.
 - **Bug 511** (lamport seq reset) becomes detectable: the projection layer notices when CRDT data fails to parse cleanly and surfaces a ProjectionError instead of silently producing garbage in-memory state.
 - **Story 479** (CRDT work claiming) has a clean foundation: ExecutionState gives every node visibility into what every other node is doing, including stale-heartbeat detection.
 - **Future state machine bugs become compile errors**, not runtime drift.
 ## Implementation order
 1. Define the `Stage`, `ArchiveReason`, `ExecutionState`, `PipelineItem` types in a new module (e.g. `server/src/pipeline_state.rs`).
 2. Implement the projection layer (try_from / from for PipelineItemCrdt).
 3. Implement the `transition` function with exhaustive valid transitions.
 4. Implement the event bus.
 5. Migrate consumers ONE AT A TIME — chat commands, lifecycle, API, auto-assign, matrix bot. Each migration is isolated; the compiler tells you when you've missed something.
 6. Once nothing reads the loose `PipelineItemView` anymore, delete it.
 7. Story 436 closes when this lands.
 ## Acceptance Criteria
 - [ ] A new module (e.g. server/src/pipeline_state.rs) defines the Stage, ArchiveReason, ExecutionState, and PipelineItem types with the variants described in the design
 - [ ] Stage::Merge has a NonZeroU32 commits_ahead field (so the bug 519 silent no-op merge is unrepresentable)
 - [ ] Stage::Done has GitSha merge_commit and DateTime merged_at fields (so a 'done' story always has merge metadata)
 - [ ] ArchiveReason enum subsumes the old blocked / merge_failure / review_hold front matter fields, with a sub-reason variant for each
 - [ ] PipelineItem.depends_on is Vec<StoryId>, not String (no more JSON-as-string)
 - [ ] ExecutionState lives in the CRDT under per-node-pubkey keys; each node only writes to its own subspace (validated by CRDT signature check)
 - [ ] Last_heartbeat field is updated periodically by the running node so other nodes can detect stale entries
 - [ ] A pure transition(state, event) -> Result<PipelineItem, TransitionError> function exists and is exhaustively pattern-matched
 - [ ] Every valid transition listed in the design (~10) is implemented and unit-tested with both success and error cases
 - [ ] The TryFrom<&PipelineItemCrdt> for PipelineItem projection function handles every currently-valid CRDT state and returns a structured ProjectionError for invalid ones (instead of silently propagating garbage)
 - [ ] An event bus pattern is in place where matrix bot, filesystem renderer, pipeline_items materialiser, auto-assign, and web UI broadcaster are independent subscribers
 - [ ] All call sites that previously read item.stage as a string or used the blocked / merge_failure / review_hold fields are migrated to the typed enum API
 - [ ] Story 436 is closed as superseded by this story
 - [ ] Bug 502 has a regression test that confirms the type system prevents the loop (the test should be a compile-fail test if possible)
 - [ ] Bug 510 (filesystem shadow split-brain) no longer reproduces after this lands, because the typed state machine has a single source of truth
 - [ ] Documentation in README.md or a new ARCHITECTURE.md explains the type hierarchy, the transition function, the event bus pattern, and the per-node ExecutionState convention
 ## Out of Scope
 - TBD
@@ -0,0 +1,87 @@
 ---
 name: "MCP/HTTP capability to write a CRDT tombstone (delete op) for a story, to clear it from in-memory state"
 ---
 # Story 521: MCP/HTTP capability to write a CRDT tombstone (delete op) for a story, to clear it from in-memory state
 ## User Story
 **Note:** content stuffed into user_story per bug 509 workaround.
 ---
 ## Context
 Today (2026-04-09) we discovered the hard way that **there is no way to remove a story from huskies's running in-memory state without restarting the server process**. The state machines that keep stories alive include:
 1. The persisted CRDT op log (`crdt_ops` table) — direct sqlite DELETE works
 2. The in-memory CRDT view (`CRDT_STATE` global in `server/src/crdt_state.rs`) — **no eviction API**
 3. The in-memory content store (`CONTENT_STORE` in `server/src/db/mod.rs:46`) — has `delete_content()` but no MCP / HTTP exposure
 4. The shadow `pipeline_items` table — direct sqlite DELETE works
 5. Filesystem shadows under `.huskies/work/` — `find -delete` works
 6. `timers.json` — direct file edit works
 If a story gets into a bad state (split-brain, ghost row, runaway timer respawning it), we can scrub all the *persistent* layers (1, 4, 5, 6) but the *in-memory* layers (2, 3) keep regenerating it because some periodic code reads in-memory state and writes new ops based on what it sees. The only way to clear in-memory state today is `docker restart huskies`, which is heavy and disrupts the matrix bot, web UI, and any in-flight agents.
 We need a **scoped, surgical capability** to write a CRDT tombstone op for a single story_id, which:
 - Marks the in-memory item as `is_deleted = true`
 - Persists the tombstone op to `crdt_ops` so future replays don't resurrect the story
 - Removes the story from `CONTENT_STORE` 
 - Cleans up any pending `timers.json` entries for the story
 - Cancels any running agents on the story
 …and exposes it as an MCP tool (e.g. `mcp__huskies__purge_story`) and ideally an HTTP endpoint, so an operator can "kill it with fire" without restarting the server.
 ## Real user story
 As a huskies operator, I want a single MCP/HTTP call that completely removes a story from every layer of state — persistent AND in-memory — so I never have to restart the entire server just to clean up one stuck story.
 ## Observed 2026-04-09
 We spent the last hour of this session whack-a-moling stories 503 and 478. Even after:
 - `DELETE FROM pipeline_items WHERE id LIKE '503%'` ✓
 - `DELETE FROM crdt_ops WHERE op_json LIKE '%503_bug_depends_on%'` ✓
 - `mcp stop_agent + remove_worktree` for the running coders ✓
 - `find .huskies/work -name '503_*' -delete` ✓
 - emptying `timers.json` (multiple times — kept getting re-populated) ✓
 …503 kept reappearing in `current` with new agents being spawned. The root cause: the in-memory `CRDT_STATE` (loaded from `crdt_ops` at startup at 18:19) still had 503 and 478 as live items, and a periodic code path was reading `crdt_state::read_all_items()`, seeing them as live, and triggering the auto-assign / rate-limit-retry chain.
 Final resolution: `docker restart huskies` to wipe the in-memory state. Worked, but it's a sledgehammer.
 ## Implementation note
 The bft-json-crdt library appears to support per-item delete via the `is_deleted: bool` field on each CRDT item (visible in the persisted op JSON we inspected today). Writing a delete op should look something like:
 ```rust
 crdt_state::apply_and_persist(&mut state, |s| {
    s.crdt.doc.items[idx].delete()  // or whatever the BFT JSON CRDT delete API is
 })
 ```
 The op gets signed, applied to the in-memory state (marking the item deleted), and persisted to crdt_ops via the existing channel. Then `read_all_items()` should filter out `is_deleted: true` entries (it may already do this — verify in `extract_item_view`).
 ## Why this is distinct from bug 514 (delete_story full cleanup)
 Bug 514 is about making the existing `delete_story` MCP tool do a full cleanup across all the layers we know about. **This** story is specifically about acquiring the *capability* to write a CRDT tombstone — without that, bug 514 can't be implemented correctly because it has no way to clear in-memory state. So 521 is a prerequisite for 514.
 It's also a prerequisite for properly handling the fix for bug 510 (split-brain shadows) — when the reconcile pass detects a stale story, it needs a way to actually evict it. That eviction is what this story provides.
 ## Acceptance Criteria
 - [ ] A new MCP tool (e.g. `mcp__huskies__purge_story`) is registered and callable
 - [ ] The tool takes a story_id and returns a structured result indicating which layers were cleared (CRDT op, content store, timers, agents, worktree, filesystem)
 - [ ] The tool writes a signed CRDT tombstone op (is_deleted: true) for the item, applies it to the in-memory CRDT, and persists it to crdt_ops
 - [ ] After the tool runs, `read_all_items()` does NOT return the purged story (verify the filter handles is_deleted)
 - [ ] After the tool runs, `read_content(story_id)` returns None (CONTENT_STORE entry is removed)
 - [ ] After the tool runs, `timers.json` has no entries for the story
 - [ ] After the tool runs, no agents are running on the story (stop_agent is called for any active ones)
 - [ ] After the tool runs, the worktree at `.huskies/worktrees/{story_id}/` is removed
 - [ ] After the tool runs, the filesystem shadow at `.huskies/work/*/{story_id}.md` is removed
 - [ ] Idempotent: calling purge_story twice on the same story_id is safe and doesn't error
 - [ ] Bug 514 (delete_story full cleanup) is updated to use this purge capability internally
 - [ ] Regression test: insert a story via the normal write path, call purge_story, restart the server, verify the story is still gone (i.e. the tombstone persisted correctly)
 ## Out of Scope
 - TBD
@@ -10,7 +10,7 @@ The `prompt_permission` MCP tool returns plain text ("Permission granted for '..
 ## How to Reproduce
-1. Start the storkit server and open the web UI
+1. Start the huskies server and open the web UI
 2. Chat with the claude-code-pty model
 3. Ask it to do something that requires a tool NOT in `.claude/settings.json` allow list (e.g. `wc -l /etc/hosts`, or WebFetch to a non-allowed domain)
 4. The permission dialog appears — click Approve
@@ -6,7 +6,7 @@ name: "Retry limit for mergemaster and pipeline restarts"
 ## User Story
-As a developer using storkit, I want pipeline auto-restarts to have a configurable retry limit so that failing agents don't loop infinitely consuming CPU and API credits.
+As a developer using huskies, I want pipeline auto-restarts to have a configurable retry limit so that failing agents don't loop infinitely consuming CPU and API credits.
 ## Acceptance Criteria
--- a/Show More
+++ b/Show More