Bug 1101's reframed AC1: when a non-success merge runs, log the typed
GateFailureKind, the matched classifier-trigger substring (if any) and
~90 chars of surrounding context. Fires on every gate failure regardless
of routing, so the next fixup-loop bounce will tell us which substring is
fooling classify() into Fmt|Lint|SourceMapCheck on what's actually a Test
failure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stop_agent had the same order-of-operations bug fixed in the watchdog:
status flipped to Failed before the claude process was verified gone,
opening the idempotency window that allowed a duplicate spawn to race
in alongside the surviving process.
Now follows the three-step protocol:
1. Read worktree path under a read-only lock (no mutation).
2. SIGKILL the worktree's process tree via process_kill and block
until verified gone — start_agent's Running/Pending whitelist
continues to reject duplicate spawns throughout.
3. Only then mutate the agent record, abort the task handle, and
drop the child_killers entry.
Falls back to the old portable_pty SIGHUP path (with a warning) when
no worktree was recorded, matching the watchdog's behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `crate::process_kill` — reliable SIGKILL-with-verify primitives used
across the server in place of the various ad-hoc kill paths that ignored
their kill-effective return values. The module exposes three pieces:
- `sigkill_pids_and_verify(pids)`: SIGKILL each pid and block (up to 2s)
until every pid is verified gone. Returns survivors if not.
- `pids_matching(pattern)`: pgrep -f wrapper.
- `descendant_pids(root)`: recursive pgrep -P walker for process trees.
Wires the watchdog's limit-termination path through it, and reorders the
protocol to fix the duplicate-coder bug observed on story 1086 (2026-05-15):
Before: check_agent_limits set status=Failed before the kill ran. The
kill itself was `portable_pty::ChildKiller::kill()`, which sends SIGHUP
on Unix — claude-code ignores SIGHUP, so the process kept running while
the agent record was already marked terminated. The idempotency check
in `start_agent` whitelists Running/Pending, so the next auto-assign
pass spawned a fresh agent alongside the still-alive prior one. Two
claude PIDs sharing one session_id, racing on the same worktree.
After: status update is moved OUT of check_agent_limits and into the
caller AFTER the kill is verified. The kill itself is now SIGKILL-the-
process-tree-in-the-worktree, with explicit verification that every pid
is gone. The idempotency window is closed.
The existing watchdog test suite (14 tests) still passes; 7 new tests
cover the process_kill primitives directly.
`agents/pool/process.rs`'s `kill_all_children` and `kill_child_for_key`
still use the old portable_pty SIGHUP path — they have the same bug but
in lower-impact code paths (shutdown, operator stop). They will be
migrated under a separate story to keep this commit focused.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The orientation bundle was 96 KB per coder spawn with 85 KB of that being
source-map.json — a static symbol listing that drowned out the workflow rules
in AGENT.md and likely explains why PLAN.md ceremony is being skipped (the
instruction is ~5% of the bundle, buried under a wall of symbols). Agents are
excellent at grep on demand, so the source map adds little value as a preloaded
cheat sheet. File stays on disk for the merge-time source-map-check doc-coverage
gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1018's merge_failure_block_subscriber counted every MergeFailure transition
toward the 3-strike block threshold, but mergemaster's recovery iterations
(squash → fail → fix → retry) emit multiple MergeFailure transitions while
making real progress. Story 997 was blocked at 10:59:46 while mergemaster
was still resolving conflicts and would have succeeded a minute later.
Fix: pass the AgentPool to the subscriber. When a mergemaster agent is in
the pool for the story, MergeFailure transitions are recovery iterations
in progress and do NOT increment the consecutive-failure counter. Block
only fires for the genuinely-stuck case (no recovery agent attached and N
consecutive failures accumulate).
Tests:
- mergemaster_running_suppresses_block: 3 failures with recovery_running=true
→ counter stays empty, story stays in MergeFailure
- no_mergemaster_still_blocks_at_threshold: 3 failures with recovery_running=false
→ blocks (1018 behaviour preserved)
All 2938 tests pass.
The progress-aware no-progress cap (3 consecutive byte-identical diffs)
doesn't catch the degenerate pattern where the agent keeps making
DIFFERENT file edits each session but never commits — every respawn
resets the no-progress counter, infinite loop, budget burns.
Adds ContentKey::CommitRecoveryTotalAttempts: an absolute counter that
increments on every commit-recovery respawn regardless of progress.
TOTAL_ATTEMPTS_CAP = 8; when hit, block with reason 'agent flapped — N
respawns without ever committing'.
Two caps now bound the recovery loop:
- NO_PROGRESS_CAP (3): catches stuck-agent (same diff repeatedly)
- TOTAL_ATTEMPTS_CAP (8): catches flapping-agent (different diffs, no commits)
Easy to tune the constant lower if we see runaway in practice.
All 2936 tests pass.
The existing commit-recovery path blocked stories on the 2nd consecutive
exit-without-commit. For long sweep refactors (e.g. story 997, the typed
retries payload migration), claude-code's session-length boundary
naturally terminates the coder mid-sweep before it can commit — even
though substantial file-edit progress is being made each session. The
old cap-of-1 misclassified normal mid-flight progress as 'agent declined
to commit'.
New behaviour:
- Each commit-recovery respawn captures a worktree-diff byte-length
fingerprint (git diff master | wc -c).
- If the fingerprint differs from the previous attempt the agent made
file-edit progress, the no-progress counter resets to 1.
- If the fingerprint is byte-identical (no new edits between exits),
increment the no-progress counter.
- Block only when the counter reaches NO_PROGRESS_CAP (3) — i.e. three
consecutive respawns where the agent did literally nothing.
Adds ContentKey::CommitRecoveryDiffFingerprint to store the prior
fingerprint. Updates the existing block-test to reflect the new cap
semantics; existing 'first respawn issued' test continues to pass.
All 2935 tests pass.
Temporary diagnostic added to reap_stale_merge_jobs to surface the t,
current_boot, and decoded values being compared on every reap pass.
Will revert once the disappearance bug is understood.
When deterministic-merge produces a clean git squash but the post-squash
compile fails (typical when master gained a Stage payload field after the
feature branch forked — e.g. story 1018 hit `error[E0063]: missing field
plan` after 1010's PlanState landed), the failure is morally a merge
conflict that git's diff3 missed: the conflicting literal lives in a
different file from the type definition that changed on master. Routing
it as GatesFailed left mergemaster idle and the story stuck.
Changes:
- gates.rs GateFailureKind::classify: detect rustc compile errors
(`error[E\d+]`) as Build instead of falling through to Test. Clippy
errors (`error[clippy::...]`) still classify as Lint.
- agents/merge/mod.rs: new MergeResult::to_merge_failure_kind() method.
GateFailure with failure_kind=Build maps to ConflictDetected (so the
existing 998 subscriber auto-spawns mergemaster). Other gate failures
stay GatesFailed.
- agents/pool/pipeline/merge/runner.rs: replace the inline match with a
call to the new method.
Tests: 6 new unit tests covering the classifier branch and every
to_merge_failure_kind arm. All 2932 tests pass.
A mid-merge server restart used to silently kill the merge: the
in-flight tokio task died with the process, reap_stale_merge_jobs ran
on the new boot, saw the Running entry from the previous boot, and
simply deleted it. Mergemaster polling `get_merge_status` then saw
"Merge job disappeared", treated it as a strike, and after three
restarts escalated the story to MergeFailureFinal — even though no
real merge failure ever happened (this is what trapped story 998
during the bug 1001 iteration cycle).
Reap now also fires a `WatcherEvent::WorkItem reassign` for the
cleared story so the auto-assign watcher loop re-runs
start_merge_agent_work on the fresh boot. The story is still in
4_merge/; the merge resumes automatically. The change is contained to
the reap path — start_merge_agent_work's own behaviour is unchanged.
Added regression test
reap_stale_merge_jobs_emits_reassign_watcher_event that asserts the
new event fires. Existing
reap_stale_merge_jobs_removes_old_running_entry_without_merge still
passes (the "without_merge" guarantee is about agent spawning, not
about absence of watcher events).
Also exposes AgentPool::watcher_tx() as pub(crate) so the merge
runner can fan out re-dispatch events.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MergeFailureFinal was unreachable from move_story: the only transitions
out were Freeze (→ Frozen) and a self-loop on MergemasterAttempted, so
once mergemaster exhausted its 3-retry budget the only way to get a
story coding again was to delete + recreate it.
The respawn budget is a mergemaster bookkeeping detail, not a hard
ceiling. A human operator inspecting a Final story can reasonably
decide the gate failure is fixable, so this adds the same
FixupRequested → Coding edge that already exists for plain
MergeFailure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>