huskies

Author	SHA1	Message	Date
dave	cfccc2e73c	huskies: merge 1044	2026-05-14 14:54:13 +00:00
Timmy	8e996e2bd3	fix(1025): gate auto-block counter on mergemaster presence 1018's merge_failure_block_subscriber counted every MergeFailure transition toward the 3-strike block threshold, but mergemaster's recovery iterations (squash → fail → fix → retry) emit multiple MergeFailure transitions while making real progress. Story 997 was blocked at 10:59:46 while mergemaster was still resolving conflicts and would have succeeded a minute later. Fix: pass the AgentPool to the subscriber. When a mergemaster agent is in the pool for the story, MergeFailure transitions are recovery iterations in progress and do NOT increment the consecutive-failure counter. Block only fires for the genuinely-stuck case (no recovery agent attached and N consecutive failures accumulate). Tests: - mergemaster_running_suppresses_block: 3 failures with recovery_running=true → counter stays empty, story stays in MergeFailure - no_mergemaster_still_blocks_at_threshold: 3 failures with recovery_running=false → blocks (1018 behaviour preserved) All 2938 tests pass.	2026-05-14 12:13:37 +01:00
dave	c7a7cb4281	huskies: merge 997	2026-05-14 11:06:27 +00:00
Timmy	0572af2193	feat: outer cap on commit-recovery respawns catches flapping agents The progress-aware no-progress cap (3 consecutive byte-identical diffs) doesn't catch the degenerate pattern where the agent keeps making DIFFERENT file edits each session but never commits — every respawn resets the no-progress counter, infinite loop, budget burns. Adds ContentKey::CommitRecoveryTotalAttempts: an absolute counter that increments on every commit-recovery respawn regardless of progress. TOTAL_ATTEMPTS_CAP = 8; when hit, block with reason 'agent flapped — N respawns without ever committing'. Two caps now bound the recovery loop: - NO_PROGRESS_CAP (3): catches stuck-agent (same diff repeatedly) - TOTAL_ATTEMPTS_CAP (8): catches flapping-agent (different diffs, no commits) Easy to tune the constant lower if we see runaway in practice. All 2936 tests pass.	2026-05-14 11:34:17 +01:00
Timmy	bab337b289	feat: progress-aware commit-recovery cap (no longer block on 2nd attempt) The existing commit-recovery path blocked stories on the 2nd consecutive exit-without-commit. For long sweep refactors (e.g. story 997, the typed retries payload migration), claude-code's session-length boundary naturally terminates the coder mid-sweep before it can commit — even though substantial file-edit progress is being made each session. The old cap-of-1 misclassified normal mid-flight progress as 'agent declined to commit'. New behaviour: - Each commit-recovery respawn captures a worktree-diff byte-length fingerprint (git diff master \| wc -c). - If the fingerprint differs from the previous attempt the agent made file-edit progress, the no-progress counter resets to 1. - If the fingerprint is byte-identical (no new edits between exits), increment the no-progress counter. - Block only when the counter reaches NO_PROGRESS_CAP (3) — i.e. three consecutive respawns where the agent did literally nothing. Adds ContentKey::CommitRecoveryDiffFingerprint to store the prior fingerprint. Updates the existing block-test to reflect the new cap semantics; existing 'first respawn issued' test continues to pass. All 2935 tests pass.	2026-05-14 11:24:02 +01:00
Timmy	5e5c5a0e08	revert: remove temporary merge-reap diagnostic logging Reverts the diagnostic introduced in `91b4e4ff`. Will re-add when we actively debug the disappearance bug again.	2026-05-14 10:57:37 +01:00
Timmy	91b4e4ff7c	diag: log merge-reap values to debug disappearance bug Temporary diagnostic added to reap_stale_merge_jobs to surface the t, current_boot, and decoded values being compared on every reap pass. Will revert once the disappearance bug is understood.	2026-05-14 10:42:16 +01:00
dave	309542cf2c	huskies: merge 1018	2026-05-14 09:38:15 +00:00
Timmy	8b2ba1c810	fix: post-squash compile errors reclassify as semantic merge conflicts When deterministic-merge produces a clean git squash but the post-squash compile fails (typical when master gained a Stage payload field after the feature branch forked — e.g. story 1018 hit `error[E0063]: missing field plan` after 1010's PlanState landed), the failure is morally a merge conflict that git's diff3 missed: the conflicting literal lives in a different file from the type definition that changed on master. Routing it as GatesFailed left mergemaster idle and the story stuck. Changes: - gates.rs GateFailureKind::classify: detect rustc compile errors (`error[E\d+]`) as Build instead of falling through to Test. Clippy errors (`error[clippy::...]`) still classify as Lint. - agents/merge/mod.rs: new MergeResult::to_merge_failure_kind() method. GateFailure with failure_kind=Build maps to ConflictDetected (so the existing 998 subscriber auto-spawns mergemaster). Other gate failures stay GatesFailed. - agents/pool/pipeline/merge/runner.rs: replace the inline match with a call to the new method. Tests: 6 new unit tests covering the classifier branch and every to_merge_failure_kind arm. All 2932 tests pass.	2026-05-14 10:18:33 +01:00
dave	ebf58ef224	huskies: merge 1008	2026-05-14 08:46:16 +00:00
dave	13ab97a615	huskies: merge 1010	2026-05-14 08:12:56 +00:00
dave	52180bc402	huskies: merge 1017	2026-05-13 23:55:35 +00:00
dave	29e800da21	huskies: merge 1016	2026-05-13 23:51:07 +00:00
dave	5ed1438ab9	huskies: merge 1015	2026-05-13 23:39:17 +00:00
dave	4e007bb770	huskies: merge 1009	2026-05-13 22:55:05 +00:00
dave	a5cd3a2152	huskies: merge 994	2026-05-13 22:38:51 +00:00
dave	cd9021fedf	huskies: merge 1006	2026-05-13 21:41:39 +00:00
Timmy	2758f744f2	fix: reap_stale_merge_jobs re-dispatches instead of just deleting A mid-merge server restart used to silently kill the merge: the in-flight tokio task died with the process, reap_stale_merge_jobs ran on the new boot, saw the Running entry from the previous boot, and simply deleted it. Mergemaster polling `get_merge_status` then saw "Merge job disappeared", treated it as a strike, and after three restarts escalated the story to MergeFailureFinal — even though no real merge failure ever happened (this is what trapped story 998 during the bug 1001 iteration cycle). Reap now also fires a `WatcherEvent::WorkItem reassign` for the cleared story so the auto-assign watcher loop re-runs start_merge_agent_work on the fresh boot. The story is still in 4_merge/; the merge resumes automatically. The change is contained to the reap path — start_merge_agent_work's own behaviour is unchanged. Added regression test reap_stale_merge_jobs_emits_reassign_watcher_event that asserts the new event fires. Existing reap_stale_merge_jobs_removes_old_running_entry_without_merge still passes (the "without_merge" guarantee is about agent spawning, not about absence of watcher events). Also exposes AgentPool::watcher_tx() as pub(crate) so the merge runner can fan out re-dispatch events. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 21:28:10 +01:00
dave	bbdee1239b	huskies: merge 998	2026-05-13 19:33:33 +00:00
dave	caed894db9	huskies: merge 988	2026-05-13 17:28:52 +00:00
dave	a078d3df7c	huskies: merge 985	2026-05-13 16:52:19 +00:00
dave	580480094e	huskies: merge 984	2026-05-13 16:47:51 +00:00
dave	c3c9db3d8b	huskies: merge 987	2026-05-13 16:30:31 +00:00
dave	430079ecbc	huskies: merge 986	2026-05-13 16:01:51 +00:00
dave	91fbad568a	huskies: merge 982	2026-05-13 15:34:41 +00:00
dave	dcb43c465a	huskies: merge 964	2026-05-13 14:56:08 +00:00
dave	14a39b6205	huskies: merge 980	2026-05-13 14:44:17 +00:00
dave	7854fbd78a	huskies: merge 979	2026-05-13 14:14:00 +00:00
dave	4b18c01835	huskies: merge 973	2026-05-13 14:08:05 +00:00
dave	e9a7468d8a	huskies: merge 981	2026-05-13 14:01:02 +00:00
dave	77dc09668c	huskies: merge 960	2026-05-13 13:24:15 +00:00
dave	93f774fcbb	huskies: merge 967	2026-05-13 12:39:47 +00:00
dave	604fb55bd8	huskies: merge 959	2026-05-13 12:28:30 +00:00
dave	184c214c34	huskies: merge 962	2026-05-13 12:05:01 +00:00
dave	28338a8e8d	huskies: merge 958	2026-05-13 11:52:51 +00:00
dave	8b53e20ca9	huskies: merge 961	2026-05-13 11:27:21 +00:00
dave	765d54fc4b	huskies: merge 954	2026-05-13 09:35:51 +00:00
dave	c228ae1640	fix: has_content_conflict_failure reads wrong CRDT key — auto-spawn mergemaster never fires The function was calling `read_content(story_id)`, which returns the story's description text (e.g. "Bug: Coder exits code 0 with uncommitted work — force a commit-only respawn..."). It then scanned that for "Merge conflict" / "CONFLICT (content):", which obviously never matched, so the auto-spawn-mergemaster-on-content-conflict guard in `pool/auto_assign/merge.rs` always saw `false` and skipped. The actual gate output (where the merge runner stores the failure message including conflict markers) lives at `format!("{story_id}:gate_output")` — that's the key `pipeline/advance/mod.rs:207` writes to. Read from there instead. Witnessed: 954's merge hit a real `CONFLICT (content)` in tests_regression.rs at 08:57:40, no mergemaster spawned, story stayed in MergeFailure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 09:03:25 +00:00
dave	6a015d6202	huskies: merge 953	2026-05-13 08:57:35 +00:00
dave	7491eec257	fmt: collapse warm-resume unwrap_or_else closure per rustfmt The 5-line spread of `.unwrap_or_else(\|\| { ... })` in spawn.rs (from the `bd517f28` + `65416476` warm-resume work) doesn't match rustfmt's preference for the short form. Was blocking every merge gate since the warm-resume fix landed.	2026-05-13 08:41:57 +00:00
dave	65416476e3	warm-resume: drop "read PLAN.md" from the resume nudge Follow-up to `bd517f28`. When --resume succeeds, claude-code restores the full prior conversation — the agent already has its file reads, tool results, and reasoning in context. Telling it to "read PLAN.md" forces a redundant tool call to re-read a doc it wrote itself. PLAN.md is the cold-start orientation doc (driven by AGENT.md); the resume -p prompt should just be a continuation nudge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 08:28:01 +00:00
dave	bd517f2857	fix(warm-resume): send non-empty -p prompt with --resume so watchdog respawns can actually warm claude-code's --resume <session_id> requires either: a) a deferred-tool marker in the resumed session (i.e. the prior session paused mid-tool-call), or b) a non-empty -p prompt to continue the conversation with. Watchdog-killed sessions have neither: the kill is asynchronous and leaves no deferred-tool marker, and our harness was passing an empty -p (because `resume_context_owned` is None for the common respawn case). claude-code then aborts with: "Error: No deferred tool marker found in the resumed session. Either the session was not deferred, the marker is stale (tool already ran), or it exceeds the tail-scan window. Provide a prompt to continue the conversation." The harness sees an aborted CLI with no session, prunes the recorded session_id, and respawns cold — paying the full prompt-cache miss for EVERY respawn. The new session_store logging (commit `0b50a624`) made this 100% legible: every warm spawn we observed went `mode=warm` → crash → prune → `mode=cold` within a couple of seconds. Fix: when resuming with no failure-context to send, default the -p prompt to a brief "continue from PLAN.md" line. claude-code now has a valid continuation message and warm-resume should actually work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 08:27:02 +00:00
dave	6e76b6a063	huskies: merge 930	2026-05-13 08:06:37 +00:00
dave	a7840ea4b0	huskies: merge 946	2026-05-13 08:00:49 +00:00
dave	9ce5a8df0c	huskies: merge 945	2026-05-13 06:09:34 +00:00
dave	3a8894ea8f	obs: log warm/cold spawn mode at agent respawn decision point Without this, the only way to tell whether a watchdog-respawn went warm (--resume <session_id>) vs cold (fresh CLI invocation) was to read the args list of the existing "Spawning claude with args:" log and check whether --resume was present. That made it impossible to count cold-paths or distinguish "supposed-to-be-warm but resume_failed fallback" from "first session" without source-diving. This adds one slog! per spawn, prefixed `[agent:{sid}:{name}] spawn mode=warm\|cold session_id=...`, so grep "spawn mode=" answers it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 05:44:46 +00:00
dave	2f50e2198b	huskies: merge 951	2026-05-13 04:34:06 +00:00
Timmy	c5abc44a63	test: serialise merge-pipeline tests against each other The 12 tests in `agents::pool::pipeline::merge::tests` share a process-wide `server_start_time` (a `OnceLock` captured the first time the merge subsystem runs) and the global merge-job CRDT log. Default cargo parallelism has caught at least one interleaving on the merge gate's Docker scheduler where `stale_running_merge_job_is_cleared_and_retry_succeeds` flakes — `delete_merge_job` from one test lands while another is mid- assertion. Couldn't reproduce locally despite many tries. Each test now acquires a poison-tolerant `std::sync::Mutex` at entry, so the 12 tests run serially relative to each other while the rest of the suite (2862 tests) stays parallel. Module-level `#![allow(clippy::await_holding_lock)]` covers the deliberate sync guard across `.await`s. Targeted isolation — not a global `--test-threads=1`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 01:50:44 +01:00
Timmy	d78dd9e8f9	feat(934): typed Stage enum replaces directory-string state model The state machine's `Stage` enum becomes the source of truth for pipeline state. Six stages of work land together: 1. Clean wire vocabulary (`coding`, `merge`, `merge_failure`, ...) replaces legacy directory-style strings (`2_current`, `4_merge`, ...) on the wire. `Stage::from_dir` accepted both during deployment; new writes always emit the clean form via `stage_dir_name`. Lexicographic `dir >= "5_done"` checks in lifecycle.rs become typed `matches!` checks since the new vocabulary doesn't sort in pipeline order. 2. `crdt_state::write_item` takes typed `&Stage`, serialising via `stage_dir_name` at the CRDT boundary. `#[cfg(test)] write_item_str` parses legacy strings for test fixtures. 3. `WorkItem::stage()` returns typed `crdt_state::Stage`; `stage_str()` is gone from the public API. Projection dispatches on the typed enum. 4. `frozen` becomes an orthogonal CRDT register. `Stage::Frozen` and `PipelineEvent::Freeze`/`Unfreeze` are removed; `transition_to_frozen`/ `unfrozen` set the flag directly without touching the stage register. 5. Watcher sweep and `tool_update_story`'s `blocked` setter route through `apply_transition` so the typed transition table validates every stage change. `update_story` gains a `frozen` field for symmetry. 6. One-shot startup migration rewrites pre-934 directory-style stage registers (and sets `frozen=true` on items previously at `7_frozen`). `Stage::from_dir` drops legacy aliases. The db boundary keeps a small normaliser so callers with legacy strings (MCP, tests) still work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 22:31:59 +01:00
Timmy	69d91d7707	feat(929): delete db/yaml_legacy.rs entirely — CRDT is the sole source of truth Final 929 sweep: every YAML-shaped helper is gone. No production code parses or writes YAML front matter anywhere. Surface removed: - db/yaml_legacy.rs (FrontMatter/StoryMetadata structs, parse_front_matter, set_front_matter_field, yaml_residue marker) — file deleted. - ItemMeta::from_yaml — deleted; callers pass typed ItemMeta::named(...) or ItemMeta::default() and use typed CRDT setters (set_depends_on, set_blocked, set_retry_count, set_agent, set_qa_mode, set_review_hold, set_item_type, set_epic, set_mergemaster_attempted) for the rest. - write_coverage_baseline_to_story_file + read_coverage_percent_from_json — the coverage_baseline YAML field was write-only (nothing read it back); removed along with its caller in agent_tools/lifecycle.rs. - update_story_in_file's generic `front_matter` HashMap parameter — tool_update_story now intercepts every known field name and routes it to a typed CRDT setter; unknown keys are rejected with an explicit error pointing at the typed setters. The function only takes user_story / description sections now. - All 117 ItemMeta::from_yaml callsites migrated. Where tests previously passed a YAML-shaped content blob and relied on the helper to extract name/depends_on/blocked/agent/qa, they now pass: write_item_with_content(id, stage, content, ItemMeta::named("Foo")) crate::crdt_state::set_depends_on(id, &[...]) // when needed crate::crdt_state::set_blocked(id, true) // when needed crate::crdt_state::set_agent(id, Some("...")) // when needed - write_story_content + write_story_file (test helper) now take an explicit `name: Option<&str>` instead of parsing it from content. - db::ops::move_item_stage stopped re-parsing YAML on every stage transition; metadata is read straight from the CRDT view when mirroring the row into SQLite. New CRDT setters added for symmetry: - crdt_state::set_name (mirrors set_agent — explicit name updates). cargo fmt --check, clippy --all-targets -- -D warnings, and the 2830-test suite all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:55:25 +01:00

1 2 3 4

164 Commits