The merge_jobs cleanup encoded the server's pid in the CRDT and checked
`kill(pid, 0)` to decide whether a "running" entry was stale. Two problems:
1. The cleanup runs *inside* the server, so checking whether the
server's own pid is alive is tautological — kill(self_pid, 0)
always succeeds.
2. `rebuild_and_restart` does an `execve()` re-exec, which keeps the
same pid. After re-exec, merge_jobs from the previous server
instance still encode "the current pid" — so the cleanup never
fires, and stories like 799/800 sit forever with status="running"
while no actual merge runs.
Switch to a per-process server-start-time captured lazily in a
`OnceLock<f64>` (reset by execve, so the new instance sees a fresh
boot-time). A merge_job's recorded start-time < current boot-time means
it came from a previous instance: stale, delete it.
Legacy pid-encoded entries decode to None and are also treated as stale.
MergeJob.pid → MergeJob.server_start_time. Tests updated.
claude CLI 2.1.97 strictly enforces that --include-partial-messages
requires --print/-p to be set. The resume path skipped -p when the
prompt was empty (which is the common case on respawns when there's
no fresh failure context to inject), so the spawned claude process
saw `--resume <sid> ... --include-partial-messages` without -p and
exited with code 1: "include-partial-messages requires --print and
--output-format=stream-json".
Net effect: every coder respawn with prior_sessions > 0 and empty
prompt was failing immediately, looking exactly like a rate-limit
(empty agent log, zero tool calls). 819 hit retry-limit (4/3) and
got marked blocked because of this — not because of any actual code
or rate-limit issue.
Fix: always pass `-p <prompt>` on resume, even with empty prompt.
770's HTTP→read-RPC migration replaced fetch-based agent calls with
rpcCall over WebSocket. setupTests.ts only mocks fetch though, so the
real jsdom WebSocket runs — and jsdom's WebSocket implementation
asynchronously fires its connection-failure error after ~9 seconds
(via internal Timeout._onTimeout). Tests that await component-mount
state (like App.test.tsx > calls getCurrentProject() on mount) hang
on that pending promise and time out at 10s.
This silently broke 1 test on master (App.test.tsx getCurrentProject
on mount) and cascaded into 16 failures in any worktree where the
test file count changed and timing shifted (804, 806).
Fix: replace the global WebSocket constructor with a stub that
immediately fires onerror + onclose on the next microtask. rpcCall
sees the error and rejects synchronously; components catch and continue
rendering with empty state. Tests pass without flake.
Verified locally: vitest run → 349/349 pass.
Tests that create temp repos call `git init` without specifying a branch.
Git then prints a 12-line "hint: Using 'master' as the name for the
initial branch..." block — every test, every run. The output drowns out
actual failures.
Set init.defaultBranch=master via GIT_CONFIG_COUNT/KEY/VALUE env vars in
script/test. This affects only subprocesses spawned by the test runner;
no change to the user's real git config.
Verified: cargo test --bin huskies emits 0 `hint:` lines after this
change, all 2732 tests still pass.