huskies

Author	SHA1	Message	Date
Timmy	26527e7dae	diag(1101): log classify verdict + matched trigger on merge gate failures Bug 1101's reframed AC1: when a non-success merge runs, log the typed GateFailureKind, the matched classifier-trigger substring (if any) and ~90 chars of surrounding context. Fires on every gate failure regardless of routing, so the next fixup-loop bounce will tell us which substring is fooling classify() into Fmt\|Lint\|SourceMapCheck on what's actually a Test failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 23:13:38 +01:00
dave	04a57e92c2	huskies: merge 1103 bug Rate-limit warning at session start sticks the `rate_limit_exit` flag, causing 1053's fast-path bypass to skip completion on clean session exits	2026-05-15 21:02:37 +00:00
dave	4216ced493	huskies: merge 1100 bug Multiple LLM agents can run concurrently on the same story (coder + mergemaster + others) — enforce one-agent-per-story invariant	2026-05-15 20:24:31 +00:00
dave	01e60a670c	huskies: merge 1091 refactor Migrate the merge-gate's stale-cargo kill path to `process_kill`	2026-05-15 11:50:03 +00:00
dave	c4010854a5	huskies: merge 1089 bug Stuck-agent detector blocks stories on legitimate exploration / debugging — uses too narrow a "progress" signal	2026-05-15 11:40:44 +00:00
dave	4aa76ce673	huskies: merge 1090 refactor Migrate `AgentPool::kill_all_children` and `kill_child_for_key` to `process_kill` so server shutdown and `stop_agent` actually kill claude	2026-05-15 11:16:16 +00:00
Timmy	b7df5cbe4e	fix(agents): kill-then-status reorder in stop_agent stop_agent had the same order-of-operations bug fixed in the watchdog: status flipped to Failed before the claude process was verified gone, opening the idempotency window that allowed a duplicate spawn to race in alongside the surviving process. Now follows the three-step protocol: 1. Read worktree path under a read-only lock (no mutation). 2. SIGKILL the worktree's process tree via process_kill and block until verified gone — start_agent's Running/Pending whitelist continues to reject duplicate spawns throughout. 3. Only then mutate the agent record, abort the task handle, and drop the child_killers entry. Falls back to the old portable_pty SIGHUP path (with a warning) when no worktree was recorded, matching the watchdog's behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 10:46:02 +01:00
Timmy	fe9804b32c	feat: add process_kill module + use it to fix watchdog double-spawn Adds `crate::process_kill` — reliable SIGKILL-with-verify primitives used across the server in place of the various ad-hoc kill paths that ignored their kill-effective return values. The module exposes three pieces: - `sigkill_pids_and_verify(pids)`: SIGKILL each pid and block (up to 2s) until every pid is verified gone. Returns survivors if not. - `pids_matching(pattern)`: pgrep -f wrapper. - `descendant_pids(root)`: recursive pgrep -P walker for process trees. Wires the watchdog's limit-termination path through it, and reorders the protocol to fix the duplicate-coder bug observed on story 1086 (2026-05-15): Before: check_agent_limits set status=Failed before the kill ran. The kill itself was `portable_pty::ChildKiller::kill()`, which sends SIGHUP on Unix — claude-code ignores SIGHUP, so the process kept running while the agent record was already marked terminated. The idempotency check in `start_agent` whitelists Running/Pending, so the next auto-assign pass spawned a fresh agent alongside the still-alive prior one. Two claude PIDs sharing one session_id, racing on the same worktree. After: status update is moved OUT of check_agent_limits and into the caller AFTER the kill is verified. The kill itself is now SIGKILL-the- process-tree-in-the-worktree, with explicit verification that every pid is gone. The idempotency window is closed. The existing watchdog test suite (14 tests) still passes; 7 new tests cover the process_kill primitives directly. `agents/pool/process.rs`'s `kill_all_children` and `kill_child_for_key` still use the old portable_pty SIGHUP path — they have the same bug but in lower-impact code paths (shutdown, operator stop). They will be migrated under a separate story to keep this commit focused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 10:36:33 +01:00
dave	e82602db77	huskies: merge 1086 story Pipeline+Status split — Step C: migrate auto-assign, subscribers, and lifecycle transitions to read Pipeline + Status	2026-05-15 08:26:39 +00:00
Timmy	d89940e85b	fix: drop source-map.json from agent orientation bundle The orientation bundle was 96 KB per coder spawn with 85 KB of that being source-map.json — a static symbol listing that drowned out the workflow rules in AGENT.md and likely explains why PLAN.md ceremony is being skipped (the instruction is ~5% of the bundle, buried under a wall of symbols). Agents are excellent at grep on demand, so the source map adds little value as a preloaded cheat sheet. File stays on disk for the merge-time source-map-check doc-coverage gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 07:48:18 +01:00
dave	bb6a6063e8	huskies: merge 1066	2026-05-14 23:45:53 +00:00
dave	374aa77f27	huskies: merge 1069	2026-05-14 23:29:32 +00:00
dave	23c3301903	huskies: merge 1065	2026-05-14 21:48:09 +00:00
dave	54d9737428	huskies: merge 1060	2026-05-14 19:31:04 +00:00
dave	96e227d8d4	huskies: merge 1053	2026-05-14 18:40:37 +00:00
dave	977b954e98	huskies: merge 1051	2026-05-14 18:04:30 +00:00
dave	8f99fede34	huskies: merge 1050	2026-05-14 17:32:14 +00:00
Timmy	822fcdaf2b	chore: cargo fmt after Rust 1.93 toolchain bump	2026-05-14 16:33:35 +01:00
dave	ee20e54d40	huskies: merge 1036	2026-05-14 15:13:25 +00:00
dave	cfccc2e73c	huskies: merge 1044	2026-05-14 14:54:13 +00:00
Timmy	8e996e2bd3	fix(1025): gate auto-block counter on mergemaster presence 1018's merge_failure_block_subscriber counted every MergeFailure transition toward the 3-strike block threshold, but mergemaster's recovery iterations (squash → fail → fix → retry) emit multiple MergeFailure transitions while making real progress. Story 997 was blocked at 10:59:46 while mergemaster was still resolving conflicts and would have succeeded a minute later. Fix: pass the AgentPool to the subscriber. When a mergemaster agent is in the pool for the story, MergeFailure transitions are recovery iterations in progress and do NOT increment the consecutive-failure counter. Block only fires for the genuinely-stuck case (no recovery agent attached and N consecutive failures accumulate). Tests: - mergemaster_running_suppresses_block: 3 failures with recovery_running=true → counter stays empty, story stays in MergeFailure - no_mergemaster_still_blocks_at_threshold: 3 failures with recovery_running=false → blocks (1018 behaviour preserved) All 2938 tests pass.	2026-05-14 12:13:37 +01:00
dave	c7a7cb4281	huskies: merge 997	2026-05-14 11:06:27 +00:00
Timmy	0572af2193	feat: outer cap on commit-recovery respawns catches flapping agents The progress-aware no-progress cap (3 consecutive byte-identical diffs) doesn't catch the degenerate pattern where the agent keeps making DIFFERENT file edits each session but never commits — every respawn resets the no-progress counter, infinite loop, budget burns. Adds ContentKey::CommitRecoveryTotalAttempts: an absolute counter that increments on every commit-recovery respawn regardless of progress. TOTAL_ATTEMPTS_CAP = 8; when hit, block with reason 'agent flapped — N respawns without ever committing'. Two caps now bound the recovery loop: - NO_PROGRESS_CAP (3): catches stuck-agent (same diff repeatedly) - TOTAL_ATTEMPTS_CAP (8): catches flapping-agent (different diffs, no commits) Easy to tune the constant lower if we see runaway in practice. All 2936 tests pass.	2026-05-14 11:34:17 +01:00
Timmy	bab337b289	feat: progress-aware commit-recovery cap (no longer block on 2nd attempt) The existing commit-recovery path blocked stories on the 2nd consecutive exit-without-commit. For long sweep refactors (e.g. story 997, the typed retries payload migration), claude-code's session-length boundary naturally terminates the coder mid-sweep before it can commit — even though substantial file-edit progress is being made each session. The old cap-of-1 misclassified normal mid-flight progress as 'agent declined to commit'. New behaviour: - Each commit-recovery respawn captures a worktree-diff byte-length fingerprint (git diff master \| wc -c). - If the fingerprint differs from the previous attempt the agent made file-edit progress, the no-progress counter resets to 1. - If the fingerprint is byte-identical (no new edits between exits), increment the no-progress counter. - Block only when the counter reaches NO_PROGRESS_CAP (3) — i.e. three consecutive respawns where the agent did literally nothing. Adds ContentKey::CommitRecoveryDiffFingerprint to store the prior fingerprint. Updates the existing block-test to reflect the new cap semantics; existing 'first respawn issued' test continues to pass. All 2935 tests pass.	2026-05-14 11:24:02 +01:00
Timmy	5e5c5a0e08	revert: remove temporary merge-reap diagnostic logging Reverts the diagnostic introduced in `91b4e4ff`. Will re-add when we actively debug the disappearance bug again.	2026-05-14 10:57:37 +01:00
Timmy	91b4e4ff7c	diag: log merge-reap values to debug disappearance bug Temporary diagnostic added to reap_stale_merge_jobs to surface the t, current_boot, and decoded values being compared on every reap pass. Will revert once the disappearance bug is understood.	2026-05-14 10:42:16 +01:00
dave	309542cf2c	huskies: merge 1018	2026-05-14 09:38:15 +00:00
Timmy	8b2ba1c810	fix: post-squash compile errors reclassify as semantic merge conflicts When deterministic-merge produces a clean git squash but the post-squash compile fails (typical when master gained a Stage payload field after the feature branch forked — e.g. story 1018 hit `error[E0063]: missing field plan` after 1010's PlanState landed), the failure is morally a merge conflict that git's diff3 missed: the conflicting literal lives in a different file from the type definition that changed on master. Routing it as GatesFailed left mergemaster idle and the story stuck. Changes: - gates.rs GateFailureKind::classify: detect rustc compile errors (`error[E\d+]`) as Build instead of falling through to Test. Clippy errors (`error[clippy::...]`) still classify as Lint. - agents/merge/mod.rs: new MergeResult::to_merge_failure_kind() method. GateFailure with failure_kind=Build maps to ConflictDetected (so the existing 998 subscriber auto-spawns mergemaster). Other gate failures stay GatesFailed. - agents/pool/pipeline/merge/runner.rs: replace the inline match with a call to the new method. Tests: 6 new unit tests covering the classifier branch and every to_merge_failure_kind arm. All 2932 tests pass.	2026-05-14 10:18:33 +01:00
dave	ebf58ef224	huskies: merge 1008	2026-05-14 08:46:16 +00:00
dave	13ab97a615	huskies: merge 1010	2026-05-14 08:12:56 +00:00
dave	52180bc402	huskies: merge 1017	2026-05-13 23:55:35 +00:00
dave	29e800da21	huskies: merge 1016	2026-05-13 23:51:07 +00:00
dave	5ed1438ab9	huskies: merge 1015	2026-05-13 23:39:17 +00:00
dave	4e007bb770	huskies: merge 1009	2026-05-13 22:55:05 +00:00
dave	a5cd3a2152	huskies: merge 994	2026-05-13 22:38:51 +00:00
dave	cd9021fedf	huskies: merge 1006	2026-05-13 21:41:39 +00:00
Timmy	2758f744f2	fix: reap_stale_merge_jobs re-dispatches instead of just deleting A mid-merge server restart used to silently kill the merge: the in-flight tokio task died with the process, reap_stale_merge_jobs ran on the new boot, saw the Running entry from the previous boot, and simply deleted it. Mergemaster polling `get_merge_status` then saw "Merge job disappeared", treated it as a strike, and after three restarts escalated the story to MergeFailureFinal — even though no real merge failure ever happened (this is what trapped story 998 during the bug 1001 iteration cycle). Reap now also fires a `WatcherEvent::WorkItem reassign` for the cleared story so the auto-assign watcher loop re-runs start_merge_agent_work on the fresh boot. The story is still in 4_merge/; the merge resumes automatically. The change is contained to the reap path — start_merge_agent_work's own behaviour is unchanged. Added regression test reap_stale_merge_jobs_emits_reassign_watcher_event that asserts the new event fires. Existing reap_stale_merge_jobs_removes_old_running_entry_without_merge still passes (the "without_merge" guarantee is about agent spawning, not about absence of watcher events). Also exposes AgentPool::watcher_tx() as pub(crate) so the merge runner can fan out re-dispatch events. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 21:28:10 +01:00
dave	bbdee1239b	huskies: merge 998	2026-05-13 19:33:33 +00:00
Timmy	75dc1fc15a	feat: MergeFailureFinal → Coding via operator FixupRequested MergeFailureFinal was unreachable from move_story: the only transitions out were Freeze (→ Frozen) and a self-loop on MergemasterAttempted, so once mergemaster exhausted its 3-retry budget the only way to get a story coding again was to delete + recreate it. The respawn budget is a mergemaster bookkeeping detail, not a hard ceiling. A human operator inspecting a Final story can reasonably decide the gate failure is fixable, so this adds the same FixupRequested → Coding edge that already exists for plain MergeFailure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 20:21:48 +01:00
dave	caed894db9	huskies: merge 988	2026-05-13 17:28:52 +00:00
dave	a078d3df7c	huskies: merge 985	2026-05-13 16:52:19 +00:00
dave	580480094e	huskies: merge 984	2026-05-13 16:47:51 +00:00
dave	c3c9db3d8b	huskies: merge 987	2026-05-13 16:30:31 +00:00
dave	430079ecbc	huskies: merge 986	2026-05-13 16:01:51 +00:00
dave	91fbad568a	huskies: merge 982	2026-05-13 15:34:41 +00:00
dave	dcb43c465a	huskies: merge 964	2026-05-13 14:56:08 +00:00
dave	14a39b6205	huskies: merge 980	2026-05-13 14:44:17 +00:00
dave	e5d2465f66	huskies: merge 974	2026-05-13 14:26:42 +00:00
dave	7854fbd78a	huskies: merge 979	2026-05-13 14:14:00 +00:00
dave	4b18c01835	huskies: merge 973	2026-05-13 14:08:05 +00:00

1 2 3 4 5

250 Commits