fix: add --all to cargo fmt in script/test and autoformat codebase

cargo fmt without --all fails with "Failed to find targets" in
workspace repos. This was blocking every story's gates. Also ran
cargo fmt --all to fix all existing formatting issues.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
dave
2026-04-13 14:07:08 +00:00
parent ed2526ce41
commit 845b85e7a7
128 changed files with 3566 additions and 2395 deletions
@@ -1,70 +0,0 @@
---
name: "Stale 1_backlog filesystem shadows get re-promoted by rate-limit retry timers, yanking successfully-merged stories back into current"
---
# Bug 510: Stale 1_backlog filesystem shadows get re-promoted by rate-limit retry timers, yanking successfully-merged stories back into current
## Description
After a story successfully completes the entire pipeline — coder runs, gates pass, mergemaster squashes the feature branch to master, lifecycle moves the story from `4_merge/` to `5_done/` — a stale filesystem shadow of the story's markdown file remains in `.huskies/work/1_backlog/`. This shadow is a leftover from the 491/492 migration: story state moved to the database as the source of truth, but the lifecycle move logic in `lifecycle.rs` is still operating on the filesystem and doesn't fully clean up after successful pipeline completions.
When a rate-limit retry timer subsequently fires for that story (rate limits get scheduled by story 496's auto-retry whenever an agent is hard-blocked, and bug 501 means those timers aren't cancelled on successful completion either), the timer fire path calls `move_story_to_current()`, which uses the **filesystem-only** `move_item` helper. That helper finds the stale `1_backlog/` shadow and "moves" it to `2_current/` — even though the story is correctly in `5_done` in the database.
Net effect: a fully-merged, archived-to-done story suddenly reappears in `current` with a fresh coder spawned on it. The matrix bot sends `Done → Current` notifications. The agent burns tokens working on a story whose work has already shipped to master. The user sees the story flapping and assumes the merge didn't actually happen.
**Observed live on 2026-04-09 against story 503:**
```
18:31:32 [lifecycle] Moved '503_…' from work/4_merge/ to work/5_done/
18:31:32 [bot] Sending stage notification: 🎉 #503 … — Merge → Done
18:32:21 [timer] Timer fired for story 503_…
18:32:21 [lifecycle] Moved '503_…' from work/1_backlog/ to work/2_current/ ← stale shadow!
18:32:21 [auto-assign] Assigning 'coder-1' to '503_…' in 2_current/
```
The merge to master persisted (commit `41515e3b` is on master). Only the *pipeline state* got corrupted by the stale shadow being re-promoted.
This is **distinct from bug 501** (which is about manual `stop_agent` not cancelling timers) but compounds it: 501 is about user-initiated stops, this is about successful pipeline completions. Both share a root cause — the rate-limit retry timer system has no notion of "this story has moved on, cancel any pending retries" — but the *consequences* of this bug are worse because the timer fires successfully and re-creates work that shouldn't exist.
Also distinct from bug 502 (mergemaster stage-mismatch) which has been fixed.
The deeper architectural problem this exposes: **`lifecycle.rs::move_item` and `move_story_to_current` are still on the legacy filesystem path** while the rest of the pipeline (491/492) has moved to DB-as-source-of-truth. The filesystem shadows in `.huskies/work/N_stage/` are supposed to be a *materialized rendering* of the DB state, not a parallel source of truth — but `move_item` treats them as authoritative.
## How to Reproduce
1. Take any story through the full pipeline successfully — coder runs, gates pass, mergemaster squashes to master, story moves to `5_done`.
2. While the story was in flight, ensure at least one coder run hit a hard rate limit (so a retry timer was scheduled). Bug 501 means that timer survives the successful completion.
3. Verify post-completion state:
- `SELECT stage FROM pipeline_items WHERE id = 'N_story_X';` returns `5_done`
- `ls .huskies/work/1_backlog/N_story_X.md` shows the file STILL EXISTS (the stale shadow)
- `cat .huskies/timers.json` shows a pending entry for `N_story_X` with a future `scheduled_at`
4. Wait for the timer to fire (default ~5 minutes after the last rate-limit hit).
## Actual Result
When the timer fires:
- The `[timer] Timer fired` log line appears for the already-done story
- `move_story_to_current` is called and finds the stale `1_backlog/N_story_X.md` shadow
- Lifecycle log: `[lifecycle] Moved 'N_…' from work/1_backlog/ to work/2_current/`
- Auto-assign sees the story in `2_current/` and spawns a coder
- Matrix bot sends `Done → Current` (and then later `Current → Current` etc.) stage notifications, spamming the room
- The new coder works on a story whose work is already shipped on master, burning tokens
- The story is now visible in BOTH `5_done` (via DB) AND `2_current` (via filesystem shadow), depending on which view the consumer reads
- The actual master commit is unaffected — the merge that already landed is still there. Only the *pipeline state* is corrupted.
## Expected Result
Successful pipeline completions must fully clean up the story's filesystem shadows. After `move_story_to_done` runs, `.huskies/work/1_backlog/N_story_X.md` (and any other stage shadow) for that story must not exist.
Additionally — and this is the more general fix — the rate-limit retry timer system must cancel any pending timers for a story when that story successfully completes the pipeline. This is a sibling fix to bug 501 (which is about cancelling on manual stop): both manual stop and successful completion should mean "no more retries".
The deepest fix is to migrate `lifecycle.rs::move_item` off the filesystem path and onto the DB path so the shadow files can be torn down entirely (or made strictly read-only renderings). That's a larger change that probably wants its own story, not a bug fix.
## Acceptance Criteria
- [ ] After a story moves to 5_done via the normal pipeline path (mergemaster success), the filesystem shadow at .huskies/work/1_backlog/N_story_X.md is removed (and any other stage shadows are also removed)
- [ ] When a story moves to 5_done, any pending rate-limit retry timer for that story is cancelled (the entry is removed from timers.json before the file is persisted)
- [ ] Regression test: simulate the full repro sequence — run a story through the pipeline with a mid-flight rate limit, complete the merge, fast-forward to the timer fire, assert (a) the story stays in 5_done, (b) no agent is spawned, (c) no Done→Current notification fires
- [ ] No regression in bug 501's fix for manual-stop timer cancellation
- [ ] Filesystem shadow cleanup is symmetric — also runs on delete_story, move_story to backlog, etc., not just the done path
- [ ] The matrix bot does not spam Done→Current notifications for stories whose work has actually completed