Commit Graph

167 Commits

Author SHA1 Message Date
dave 977b954e98 huskies: merge 1051 2026-05-14 18:04:30 +00:00
Timmy 822fcdaf2b chore: cargo fmt after Rust 1.93 toolchain bump 2026-05-14 16:33:35 +01:00
dave ee20e54d40 huskies: merge 1036 2026-05-14 15:13:25 +00:00
dave cfccc2e73c huskies: merge 1044 2026-05-14 14:54:13 +00:00
Timmy 8e996e2bd3 fix(1025): gate auto-block counter on mergemaster presence
1018's merge_failure_block_subscriber counted every MergeFailure transition
toward the 3-strike block threshold, but mergemaster's recovery iterations
(squash → fail → fix → retry) emit multiple MergeFailure transitions while
making real progress. Story 997 was blocked at 10:59:46 while mergemaster
was still resolving conflicts and would have succeeded a minute later.

Fix: pass the AgentPool to the subscriber. When a mergemaster agent is in
the pool for the story, MergeFailure transitions are recovery iterations
in progress and do NOT increment the consecutive-failure counter. Block
only fires for the genuinely-stuck case (no recovery agent attached and N
consecutive failures accumulate).

Tests:
- mergemaster_running_suppresses_block: 3 failures with recovery_running=true
  → counter stays empty, story stays in MergeFailure
- no_mergemaster_still_blocks_at_threshold: 3 failures with recovery_running=false
  → blocks (1018 behaviour preserved)

All 2938 tests pass.
2026-05-14 12:13:37 +01:00
dave c7a7cb4281 huskies: merge 997 2026-05-14 11:06:27 +00:00
Timmy 0572af2193 feat: outer cap on commit-recovery respawns catches flapping agents
The progress-aware no-progress cap (3 consecutive byte-identical diffs)
doesn't catch the degenerate pattern where the agent keeps making
DIFFERENT file edits each session but never commits — every respawn
resets the no-progress counter, infinite loop, budget burns.

Adds ContentKey::CommitRecoveryTotalAttempts: an absolute counter that
increments on every commit-recovery respawn regardless of progress.
TOTAL_ATTEMPTS_CAP = 8; when hit, block with reason 'agent flapped — N
respawns without ever committing'.

Two caps now bound the recovery loop:
- NO_PROGRESS_CAP (3): catches stuck-agent (same diff repeatedly)
- TOTAL_ATTEMPTS_CAP (8): catches flapping-agent (different diffs, no commits)

Easy to tune the constant lower if we see runaway in practice.
All 2936 tests pass.
2026-05-14 11:34:17 +01:00
Timmy bab337b289 feat: progress-aware commit-recovery cap (no longer block on 2nd attempt)
The existing commit-recovery path blocked stories on the 2nd consecutive
exit-without-commit. For long sweep refactors (e.g. story 997, the typed
retries payload migration), claude-code's session-length boundary
naturally terminates the coder mid-sweep before it can commit — even
though substantial file-edit progress is being made each session. The
old cap-of-1 misclassified normal mid-flight progress as 'agent declined
to commit'.

New behaviour:
- Each commit-recovery respawn captures a worktree-diff byte-length
  fingerprint (git diff master | wc -c).
- If the fingerprint differs from the previous attempt the agent made
  file-edit progress, the no-progress counter resets to 1.
- If the fingerprint is byte-identical (no new edits between exits),
  increment the no-progress counter.
- Block only when the counter reaches NO_PROGRESS_CAP (3) — i.e. three
  consecutive respawns where the agent did literally nothing.

Adds ContentKey::CommitRecoveryDiffFingerprint to store the prior
fingerprint. Updates the existing block-test to reflect the new cap
semantics; existing 'first respawn issued' test continues to pass.

All 2935 tests pass.
2026-05-14 11:24:02 +01:00
Timmy 5e5c5a0e08 revert: remove temporary merge-reap diagnostic logging
Reverts the diagnostic introduced in 91b4e4ff. Will re-add when we
actively debug the disappearance bug again.
2026-05-14 10:57:37 +01:00
Timmy 91b4e4ff7c diag: log merge-reap values to debug disappearance bug
Temporary diagnostic added to reap_stale_merge_jobs to surface the t,
current_boot, and decoded values being compared on every reap pass.
Will revert once the disappearance bug is understood.
2026-05-14 10:42:16 +01:00
dave 309542cf2c huskies: merge 1018 2026-05-14 09:38:15 +00:00
Timmy 8b2ba1c810 fix: post-squash compile errors reclassify as semantic merge conflicts
When deterministic-merge produces a clean git squash but the post-squash
compile fails (typical when master gained a Stage payload field after the
feature branch forked — e.g. story 1018 hit `error[E0063]: missing field
plan` after 1010's PlanState landed), the failure is morally a merge
conflict that git's diff3 missed: the conflicting literal lives in a
different file from the type definition that changed on master. Routing
it as GatesFailed left mergemaster idle and the story stuck.

Changes:
- gates.rs GateFailureKind::classify: detect rustc compile errors
  (`error[E\d+]`) as Build instead of falling through to Test. Clippy
  errors (`error[clippy::...]`) still classify as Lint.
- agents/merge/mod.rs: new MergeResult::to_merge_failure_kind() method.
  GateFailure with failure_kind=Build maps to ConflictDetected (so the
  existing 998 subscriber auto-spawns mergemaster). Other gate failures
  stay GatesFailed.
- agents/pool/pipeline/merge/runner.rs: replace the inline match with a
  call to the new method.

Tests: 6 new unit tests covering the classifier branch and every
to_merge_failure_kind arm. All 2932 tests pass.
2026-05-14 10:18:33 +01:00
dave ebf58ef224 huskies: merge 1008 2026-05-14 08:46:16 +00:00
dave 13ab97a615 huskies: merge 1010 2026-05-14 08:12:56 +00:00
dave 52180bc402 huskies: merge 1017 2026-05-13 23:55:35 +00:00
dave 29e800da21 huskies: merge 1016 2026-05-13 23:51:07 +00:00
dave 5ed1438ab9 huskies: merge 1015 2026-05-13 23:39:17 +00:00
dave 4e007bb770 huskies: merge 1009 2026-05-13 22:55:05 +00:00
dave a5cd3a2152 huskies: merge 994 2026-05-13 22:38:51 +00:00
dave cd9021fedf huskies: merge 1006 2026-05-13 21:41:39 +00:00
Timmy 2758f744f2 fix: reap_stale_merge_jobs re-dispatches instead of just deleting
A mid-merge server restart used to silently kill the merge: the
in-flight tokio task died with the process, reap_stale_merge_jobs ran
on the new boot, saw the Running entry from the previous boot, and
simply deleted it. Mergemaster polling `get_merge_status` then saw
"Merge job disappeared", treated it as a strike, and after three
restarts escalated the story to MergeFailureFinal — even though no
real merge failure ever happened (this is what trapped story 998
during the bug 1001 iteration cycle).

Reap now also fires a `WatcherEvent::WorkItem reassign` for the
cleared story so the auto-assign watcher loop re-runs
start_merge_agent_work on the fresh boot. The story is still in
4_merge/; the merge resumes automatically. The change is contained to
the reap path — start_merge_agent_work's own behaviour is unchanged.

Added regression test
reap_stale_merge_jobs_emits_reassign_watcher_event that asserts the
new event fires. Existing
reap_stale_merge_jobs_removes_old_running_entry_without_merge still
passes (the "without_merge" guarantee is about agent spawning, not
about absence of watcher events).

Also exposes AgentPool::watcher_tx() as pub(crate) so the merge
runner can fan out re-dispatch events.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 21:28:10 +01:00
dave bbdee1239b huskies: merge 998 2026-05-13 19:33:33 +00:00
dave caed894db9 huskies: merge 988 2026-05-13 17:28:52 +00:00
dave a078d3df7c huskies: merge 985 2026-05-13 16:52:19 +00:00
dave 580480094e huskies: merge 984 2026-05-13 16:47:51 +00:00
dave c3c9db3d8b huskies: merge 987 2026-05-13 16:30:31 +00:00
dave 430079ecbc huskies: merge 986 2026-05-13 16:01:51 +00:00
dave 91fbad568a huskies: merge 982 2026-05-13 15:34:41 +00:00
dave dcb43c465a huskies: merge 964 2026-05-13 14:56:08 +00:00
dave 14a39b6205 huskies: merge 980 2026-05-13 14:44:17 +00:00
dave 7854fbd78a huskies: merge 979 2026-05-13 14:14:00 +00:00
dave 4b18c01835 huskies: merge 973 2026-05-13 14:08:05 +00:00
dave e9a7468d8a huskies: merge 981 2026-05-13 14:01:02 +00:00
dave 77dc09668c huskies: merge 960 2026-05-13 13:24:15 +00:00
dave 93f774fcbb huskies: merge 967 2026-05-13 12:39:47 +00:00
dave 604fb55bd8 huskies: merge 959 2026-05-13 12:28:30 +00:00
dave 184c214c34 huskies: merge 962 2026-05-13 12:05:01 +00:00
dave 28338a8e8d huskies: merge 958 2026-05-13 11:52:51 +00:00
dave 8b53e20ca9 huskies: merge 961 2026-05-13 11:27:21 +00:00
dave 765d54fc4b huskies: merge 954 2026-05-13 09:35:51 +00:00
dave c228ae1640 fix: has_content_conflict_failure reads wrong CRDT key — auto-spawn mergemaster never fires
The function was calling `read_content(story_id)`, which returns the
story's *description* text (e.g. "Bug: Coder exits code 0 with
uncommitted work — force a commit-only respawn..."). It then scanned
that for "Merge conflict" / "CONFLICT (content):", which obviously
never matched, so the auto-spawn-mergemaster-on-content-conflict guard
in `pool/auto_assign/merge.rs` always saw `false` and skipped.

The actual gate output (where the merge runner stores the failure
message including conflict markers) lives at
`format!("{story_id}:gate_output")` — that's the key
`pipeline/advance/mod.rs:207` writes to. Read from there instead.

Witnessed: 954's merge hit a real `CONFLICT (content)` in
tests_regression.rs at 08:57:40, no mergemaster spawned, story stayed
in MergeFailure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 09:03:25 +00:00
dave 6a015d6202 huskies: merge 953 2026-05-13 08:57:35 +00:00
dave 7491eec257 fmt: collapse warm-resume unwrap_or_else closure per rustfmt
The 5-line spread of `.unwrap_or_else(|| { ... })` in spawn.rs (from
the bd517f28 + 65416476 warm-resume work) doesn't match rustfmt's
preference for the short form. Was blocking every merge gate since
the warm-resume fix landed.
2026-05-13 08:41:57 +00:00
dave 65416476e3 warm-resume: drop "read PLAN.md" from the resume nudge
Follow-up to bd517f28. When --resume succeeds, claude-code restores the
full prior conversation — the agent already has its file reads, tool
results, and reasoning in context. Telling it to "read PLAN.md" forces
a redundant tool call to re-read a doc it wrote itself. PLAN.md is the
cold-start orientation doc (driven by AGENT.md); the resume -p prompt
should just be a continuation nudge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 08:28:01 +00:00
dave bd517f2857 fix(warm-resume): send non-empty -p prompt with --resume so watchdog
respawns can actually warm

claude-code's --resume <session_id> requires either:
  a) a deferred-tool marker in the resumed session (i.e. the prior
     session paused mid-tool-call), or
  b) a non-empty -p prompt to continue the conversation with.

Watchdog-killed sessions have neither: the kill is asynchronous and
leaves no deferred-tool marker, and our harness was passing an empty
-p (because `resume_context_owned` is None for the common respawn
case). claude-code then aborts with:

  "Error: No deferred tool marker found in the resumed session.
   Either the session was not deferred, the marker is stale (tool
   already ran), or it exceeds the tail-scan window. Provide a
   prompt to continue the conversation."

The harness sees an aborted CLI with no session, prunes the recorded
session_id, and respawns cold — paying the full prompt-cache miss for
EVERY respawn. The new session_store logging (commit 0b50a624) made
this 100% legible: every warm spawn we observed went `mode=warm` →
crash → prune → `mode=cold` within a couple of seconds.

Fix: when resuming with no failure-context to send, default the -p
prompt to a brief "continue from PLAN.md" line. claude-code now has a
valid continuation message and warm-resume should actually work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 08:27:02 +00:00
dave 6e76b6a063 huskies: merge 930 2026-05-13 08:06:37 +00:00
dave a7840ea4b0 huskies: merge 946 2026-05-13 08:00:49 +00:00
dave 9ce5a8df0c huskies: merge 945 2026-05-13 06:09:34 +00:00
dave 3a8894ea8f obs: log warm/cold spawn mode at agent respawn decision point
Without this, the only way to tell whether a watchdog-respawn went warm
(--resume <session_id>) vs cold (fresh CLI invocation) was to read the
args list of the existing "Spawning claude with args:" log and check
whether --resume was present. That made it impossible to count
cold-paths or distinguish "supposed-to-be-warm but resume_failed
fallback" from "first session" without source-diving.

This adds one slog! per spawn, prefixed `[agent:{sid}:{name}] spawn
mode=warm|cold session_id=...`, so grep "spawn mode=" answers it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 05:44:46 +00:00
dave 2f50e2198b huskies: merge 951 2026-05-13 04:34:06 +00:00