The merge_jobs cleanup encoded the server's pid in the CRDT and checked
`kill(pid, 0)` to decide whether a "running" entry was stale. Two problems:
1. The cleanup runs *inside* the server, so checking whether the
server's own pid is alive is tautological — kill(self_pid, 0)
always succeeds.
2. `rebuild_and_restart` does an `execve()` re-exec, which keeps the
same pid. After re-exec, merge_jobs from the previous server
instance still encode "the current pid" — so the cleanup never
fires, and stories like 799/800 sit forever with status="running"
while no actual merge runs.
Switch to a per-process server-start-time captured lazily in a
`OnceLock<f64>` (reset by execve, so the new instance sees a fresh
boot-time). A merge_job's recorded start-time < current boot-time means
it came from a previous instance: stale, delete it.
Legacy pid-encoded entries decode to None and are also treated as stale.
MergeJob.pid → MergeJob.server_start_time. Tests updated.
The 13-file refactor pass (commits db00a5d4 through eca15b4e) introduced
~89 clippy errors and 38 cargo fmt issues — every agent in every worktree
hit them on script/test, burning their turn budget on cleanup before doing
real story work. This is the silent kill behind 644, 652, 655, 664, 667
all hitting watchdog limits this round.
Changes:
- cargo fmt --all across 37 files (formatting normalisation only)
- #![allow(unused_imports, dead_code)] on 24 split modules where the
python-script splitter imported liberally to be safe; tighter cleanup
per-import will happen as agents touch each module
- Removed truly-dead re-exports (cleanup_merge_workspace, slog_warn from
http/mcp/mod.rs, CliArgs/print_help from main.rs)
- Prefixed _auth_msg in crdt_sync/server.rs (handshake helper return is
bound but not consumed)
- Converted dangling /// doc block in crdt_sync/mod.rs to //! so it
attaches to the module
- Removed empty lines after doc comments in 4 spots (clippy lint)
All 2636 tests pass; clippy --all-targets -- -D warnings clean.
The 1353-line advance.rs is split into:
- mod.rs: impl AgentPool with run_pipeline_advance + start_mergemaster_or_block + tests (1244 lines)
- helpers.rs: spawn_pipeline_advance, resolve_qa_mode_from_store, write_review_hold_to_store, should_block_story (128 lines)
Tests stay co-located with run_pipeline_advance which they exercise.
No behaviour change. All 10 advance tests pass; full suite green.
cargo fmt without --all fails with "Failed to find targets" in
workspace repos. This was blocking every story's gates. Also ran
cargo fmt --all to fix all existing formatting issues.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The completion handler now pgrep+kills any cargo processes targeting
the worktree's Cargo.toml before running gates. This prevents the
run_tests MCP child from holding the build lock and blocking gates.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename all references from storkit to huskies across the codebase:
- .storkit/ directory → .huskies/
- Binary name, Cargo package name, Docker image references
- Server code, frontend code, config files, scripts
- Fix script/test to build frontend before cargo clippy/test
so merge worktrees have frontend/dist available for RustEmbed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After the cherry-pick step in run_squash_merge, verify:
1. project_root is on the base branch (not a merge-queue branch)
2. HEAD commit has actual code changes (not an empty/story-only diff)
If either check fails, return success=false so the story stays in merge
stage for retry instead of being phantom-advanced to done.
Also rename move_story_to_archived → move_story_to_done.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>