The merge_jobs cleanup encoded the server's pid in the CRDT and checked
`kill(pid, 0)` to decide whether a "running" entry was stale. Two problems:
1. The cleanup runs *inside* the server, so checking whether the
server's own pid is alive is tautological — kill(self_pid, 0)
always succeeds.
2. `rebuild_and_restart` does an `execve()` re-exec, which keeps the
same pid. After re-exec, merge_jobs from the previous server
instance still encode "the current pid" — so the cleanup never
fires, and stories like 799/800 sit forever with status="running"
while no actual merge runs.
Switch to a per-process server-start-time captured lazily in a
`OnceLock<f64>` (reset by execve, so the new instance sees a fresh
boot-time). A merge_job's recorded start-time < current boot-time means
it came from a previous instance: stale, delete it.
Legacy pid-encoded entries decode to None and are also treated as stale.
MergeJob.pid → MergeJob.server_start_time. Tests updated.
claude CLI 2.1.97 strictly enforces that --include-partial-messages
requires --print/-p to be set. The resume path skipped -p when the
prompt was empty (which is the common case on respawns when there's
no fresh failure context to inject), so the spawned claude process
saw `--resume <sid> ... --include-partial-messages` without -p and
exited with code 1: "include-partial-messages requires --print and
--output-format=stream-json".
Net effect: every coder respawn with prior_sessions > 0 and empty
prompt was failing immediately, looking exactly like a rate-limit
(empty agent log, zero tool calls). 819 hit retry-limit (4/3) and
got marked blocked because of this — not because of any actual code
or rate-limit issue.
Fix: always pass `-p <prompt>` on resume, even with empty prompt.
770's HTTP→read-RPC migration replaced fetch-based agent calls with
rpcCall over WebSocket. setupTests.ts only mocks fetch though, so the
real jsdom WebSocket runs — and jsdom's WebSocket implementation
asynchronously fires its connection-failure error after ~9 seconds
(via internal Timeout._onTimeout). Tests that await component-mount
state (like App.test.tsx > calls getCurrentProject() on mount) hang
on that pending promise and time out at 10s.
This silently broke 1 test on master (App.test.tsx getCurrentProject
on mount) and cascaded into 16 failures in any worktree where the
test file count changed and timing shifted (804, 806).
Fix: replace the global WebSocket constructor with a stub that
immediately fires onerror + onclose on the next microtask. rpcCall
sees the error and rejects synchronously; components catch and continue
rendering with empty state. Tests pass without flake.
Verified locally: vitest run → 349/349 pass.
Tests that create temp repos call `git init` without specifying a branch.
Git then prints a 12-line "hint: Using 'master' as the name for the
initial branch..." block — every test, every run. The output drowns out
actual failures.
Set init.defaultBranch=master via GIT_CONFIG_COUNT/KEY/VALUE env vars in
script/test. This affects only subprocesses spawned by the test runner;
no change to the user's real git config.
Verified: cargo test --bin huskies emits 0 `hint:` lines after this
change, all 2732 tests still pass.
`tool_run_tests` in `server/src/http/mcp/shell_tools/script.rs` is fully
blocking server-side: it spawns the test child, polls every 1s server-side
until exit (or `TEST_TIMEOUT_SECS = 1200s`), and returns the full
{passed, exit_code, output} directly. There is NO async/started-status
return path.
But two places told agents the wrong story:
1. `tools_list/system_tools.rs` description claimed "Returns immediately
with status: started. Poll get_test_result..." — agents read tool
descriptions for protocol semantics, so they followed this and burned
turns polling get_test_result.
2. `agents.toml` had been correctly saying it blocks, but my last commit
(776aad38) "fixed" it the wrong way based on a misread of the code.
Now both say: run_tests blocks server-side, returns the full result, do
not poll get_test_result. get_test_result remains for external observers
(UI checking on a job another caller started).
Reverts the prompt change in 776aad38 with the correct text.
Pre-f958f57e, run_tests blocked until completion. After that fix it became
a background-job starter, with get_test_result polling. The agent prompts
were never updated, so they still said "run_tests blocks until complete"
— and agents then waste turns polling.
Updated coder-1/2/3, coder-opus, and qa prompts to describe the actual
flow: run_tests is async, get_test_result blocks for up to 20s per call,
test suites typically take 1-5 minutes so expect a few polls.
Companion bug filed for bumping TEST_POLL_BLOCK_SECS so one poll covers
most test runs (root-cause fix; this commit is the prompt half).
Walked server/src/, frontend/src/, and crates/, extracting each module's
//! doc-comment to build a directory-level source map. One row per
directory + one row per top-level file.
Replaces the hand-written stopgap from 5d6757dd with content auto-derived
from the codebase, so it stays useful as decomposes happen — the
descriptions come from mod.rs, not from my recollection of where things
live.
Still a stopgap until 819 (auto-generated source-map-gen) lands and gets
wired into the agent spawn path, but the content is closer to what 819
will produce.
818 stripped the source map because it had stale paths. Empirically that
made coder agents far slower — they spent most of each session re-discovering
the codebase via Read/Grep before reaching any Edit, and ran out of turn
budget without committing.
Restoring a fresh source map keyed off current master. Uses directories
where possible so it stays useful through future decomposes, plus a
"Canonical examples" section pointing at the patterns to copy when adding
new CRDT collections, RPC handlers, services, chat commands, etc.
This is a stopgap until 819 (auto-generated source-map-gen) lands.