Commit Graph

164 Commits

Author SHA1 Message Date
dave cfccc2e73c huskies: merge 1044 2026-05-14 14:54:13 +00:00
Timmy 8e996e2bd3 fix(1025): gate auto-block counter on mergemaster presence
1018's merge_failure_block_subscriber counted every MergeFailure transition
toward the 3-strike block threshold, but mergemaster's recovery iterations
(squash → fail → fix → retry) emit multiple MergeFailure transitions while
making real progress. Story 997 was blocked at 10:59:46 while mergemaster
was still resolving conflicts and would have succeeded a minute later.

Fix: pass the AgentPool to the subscriber. When a mergemaster agent is in
the pool for the story, MergeFailure transitions are recovery iterations
in progress and do NOT increment the consecutive-failure counter. Block
only fires for the genuinely-stuck case (no recovery agent attached and N
consecutive failures accumulate).

Tests:
- mergemaster_running_suppresses_block: 3 failures with recovery_running=true
  → counter stays empty, story stays in MergeFailure
- no_mergemaster_still_blocks_at_threshold: 3 failures with recovery_running=false
  → blocks (1018 behaviour preserved)

All 2938 tests pass.
2026-05-14 12:13:37 +01:00
dave c7a7cb4281 huskies: merge 997 2026-05-14 11:06:27 +00:00
Timmy 0572af2193 feat: outer cap on commit-recovery respawns catches flapping agents
The progress-aware no-progress cap (3 consecutive byte-identical diffs)
doesn't catch the degenerate pattern where the agent keeps making
DIFFERENT file edits each session but never commits — every respawn
resets the no-progress counter, infinite loop, budget burns.

Adds ContentKey::CommitRecoveryTotalAttempts: an absolute counter that
increments on every commit-recovery respawn regardless of progress.
TOTAL_ATTEMPTS_CAP = 8; when hit, block with reason 'agent flapped — N
respawns without ever committing'.

Two caps now bound the recovery loop:
- NO_PROGRESS_CAP (3): catches stuck-agent (same diff repeatedly)
- TOTAL_ATTEMPTS_CAP (8): catches flapping-agent (different diffs, no commits)

Easy to tune the constant lower if we see runaway in practice.
All 2936 tests pass.
2026-05-14 11:34:17 +01:00
Timmy bab337b289 feat: progress-aware commit-recovery cap (no longer block on 2nd attempt)
The existing commit-recovery path blocked stories on the 2nd consecutive
exit-without-commit. For long sweep refactors (e.g. story 997, the typed
retries payload migration), claude-code's session-length boundary
naturally terminates the coder mid-sweep before it can commit — even
though substantial file-edit progress is being made each session. The
old cap-of-1 misclassified normal mid-flight progress as 'agent declined
to commit'.

New behaviour:
- Each commit-recovery respawn captures a worktree-diff byte-length
  fingerprint (git diff master | wc -c).
- If the fingerprint differs from the previous attempt the agent made
  file-edit progress, the no-progress counter resets to 1.
- If the fingerprint is byte-identical (no new edits between exits),
  increment the no-progress counter.
- Block only when the counter reaches NO_PROGRESS_CAP (3) — i.e. three
  consecutive respawns where the agent did literally nothing.

Adds ContentKey::CommitRecoveryDiffFingerprint to store the prior
fingerprint. Updates the existing block-test to reflect the new cap
semantics; existing 'first respawn issued' test continues to pass.

All 2935 tests pass.
2026-05-14 11:24:02 +01:00
Timmy 5e5c5a0e08 revert: remove temporary merge-reap diagnostic logging
Reverts the diagnostic introduced in 91b4e4ff. Will re-add when we
actively debug the disappearance bug again.
2026-05-14 10:57:37 +01:00
Timmy 91b4e4ff7c diag: log merge-reap values to debug disappearance bug
Temporary diagnostic added to reap_stale_merge_jobs to surface the t,
current_boot, and decoded values being compared on every reap pass.
Will revert once the disappearance bug is understood.
2026-05-14 10:42:16 +01:00
dave 309542cf2c huskies: merge 1018 2026-05-14 09:38:15 +00:00
Timmy 8b2ba1c810 fix: post-squash compile errors reclassify as semantic merge conflicts
When deterministic-merge produces a clean git squash but the post-squash
compile fails (typical when master gained a Stage payload field after the
feature branch forked — e.g. story 1018 hit `error[E0063]: missing field
plan` after 1010's PlanState landed), the failure is morally a merge
conflict that git's diff3 missed: the conflicting literal lives in a
different file from the type definition that changed on master. Routing
it as GatesFailed left mergemaster idle and the story stuck.

Changes:
- gates.rs GateFailureKind::classify: detect rustc compile errors
  (`error[E\d+]`) as Build instead of falling through to Test. Clippy
  errors (`error[clippy::...]`) still classify as Lint.
- agents/merge/mod.rs: new MergeResult::to_merge_failure_kind() method.
  GateFailure with failure_kind=Build maps to ConflictDetected (so the
  existing 998 subscriber auto-spawns mergemaster). Other gate failures
  stay GatesFailed.
- agents/pool/pipeline/merge/runner.rs: replace the inline match with a
  call to the new method.

Tests: 6 new unit tests covering the classifier branch and every
to_merge_failure_kind arm. All 2932 tests pass.
2026-05-14 10:18:33 +01:00
dave ebf58ef224 huskies: merge 1008 2026-05-14 08:46:16 +00:00
dave 13ab97a615 huskies: merge 1010 2026-05-14 08:12:56 +00:00
dave 52180bc402 huskies: merge 1017 2026-05-13 23:55:35 +00:00
dave 29e800da21 huskies: merge 1016 2026-05-13 23:51:07 +00:00
dave 5ed1438ab9 huskies: merge 1015 2026-05-13 23:39:17 +00:00
dave 4e007bb770 huskies: merge 1009 2026-05-13 22:55:05 +00:00
dave a5cd3a2152 huskies: merge 994 2026-05-13 22:38:51 +00:00
dave cd9021fedf huskies: merge 1006 2026-05-13 21:41:39 +00:00
Timmy 2758f744f2 fix: reap_stale_merge_jobs re-dispatches instead of just deleting
A mid-merge server restart used to silently kill the merge: the
in-flight tokio task died with the process, reap_stale_merge_jobs ran
on the new boot, saw the Running entry from the previous boot, and
simply deleted it. Mergemaster polling `get_merge_status` then saw
"Merge job disappeared", treated it as a strike, and after three
restarts escalated the story to MergeFailureFinal — even though no
real merge failure ever happened (this is what trapped story 998
during the bug 1001 iteration cycle).

Reap now also fires a `WatcherEvent::WorkItem reassign` for the
cleared story so the auto-assign watcher loop re-runs
start_merge_agent_work on the fresh boot. The story is still in
4_merge/; the merge resumes automatically. The change is contained to
the reap path — start_merge_agent_work's own behaviour is unchanged.

Added regression test
reap_stale_merge_jobs_emits_reassign_watcher_event that asserts the
new event fires. Existing
reap_stale_merge_jobs_removes_old_running_entry_without_merge still
passes (the "without_merge" guarantee is about agent spawning, not
about absence of watcher events).

Also exposes AgentPool::watcher_tx() as pub(crate) so the merge
runner can fan out re-dispatch events.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 21:28:10 +01:00
dave bbdee1239b huskies: merge 998 2026-05-13 19:33:33 +00:00
dave caed894db9 huskies: merge 988 2026-05-13 17:28:52 +00:00
dave a078d3df7c huskies: merge 985 2026-05-13 16:52:19 +00:00
dave 580480094e huskies: merge 984 2026-05-13 16:47:51 +00:00
dave c3c9db3d8b huskies: merge 987 2026-05-13 16:30:31 +00:00
dave 430079ecbc huskies: merge 986 2026-05-13 16:01:51 +00:00
dave 91fbad568a huskies: merge 982 2026-05-13 15:34:41 +00:00
dave dcb43c465a huskies: merge 964 2026-05-13 14:56:08 +00:00
dave 14a39b6205 huskies: merge 980 2026-05-13 14:44:17 +00:00
dave 7854fbd78a huskies: merge 979 2026-05-13 14:14:00 +00:00
dave 4b18c01835 huskies: merge 973 2026-05-13 14:08:05 +00:00
dave e9a7468d8a huskies: merge 981 2026-05-13 14:01:02 +00:00
dave 77dc09668c huskies: merge 960 2026-05-13 13:24:15 +00:00
dave 93f774fcbb huskies: merge 967 2026-05-13 12:39:47 +00:00
dave 604fb55bd8 huskies: merge 959 2026-05-13 12:28:30 +00:00
dave 184c214c34 huskies: merge 962 2026-05-13 12:05:01 +00:00
dave 28338a8e8d huskies: merge 958 2026-05-13 11:52:51 +00:00
dave 8b53e20ca9 huskies: merge 961 2026-05-13 11:27:21 +00:00
dave 765d54fc4b huskies: merge 954 2026-05-13 09:35:51 +00:00
dave c228ae1640 fix: has_content_conflict_failure reads wrong CRDT key — auto-spawn mergemaster never fires
The function was calling `read_content(story_id)`, which returns the
story's *description* text (e.g. "Bug: Coder exits code 0 with
uncommitted work — force a commit-only respawn..."). It then scanned
that for "Merge conflict" / "CONFLICT (content):", which obviously
never matched, so the auto-spawn-mergemaster-on-content-conflict guard
in `pool/auto_assign/merge.rs` always saw `false` and skipped.

The actual gate output (where the merge runner stores the failure
message including conflict markers) lives at
`format!("{story_id}:gate_output")` — that's the key
`pipeline/advance/mod.rs:207` writes to. Read from there instead.

Witnessed: 954's merge hit a real `CONFLICT (content)` in
tests_regression.rs at 08:57:40, no mergemaster spawned, story stayed
in MergeFailure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 09:03:25 +00:00
dave 6a015d6202 huskies: merge 953 2026-05-13 08:57:35 +00:00
dave 7491eec257 fmt: collapse warm-resume unwrap_or_else closure per rustfmt
The 5-line spread of `.unwrap_or_else(|| { ... })` in spawn.rs (from
the bd517f28 + 65416476 warm-resume work) doesn't match rustfmt's
preference for the short form. Was blocking every merge gate since
the warm-resume fix landed.
2026-05-13 08:41:57 +00:00
dave 65416476e3 warm-resume: drop "read PLAN.md" from the resume nudge
Follow-up to bd517f28. When --resume succeeds, claude-code restores the
full prior conversation — the agent already has its file reads, tool
results, and reasoning in context. Telling it to "read PLAN.md" forces
a redundant tool call to re-read a doc it wrote itself. PLAN.md is the
cold-start orientation doc (driven by AGENT.md); the resume -p prompt
should just be a continuation nudge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 08:28:01 +00:00
dave bd517f2857 fix(warm-resume): send non-empty -p prompt with --resume so watchdog
respawns can actually warm

claude-code's --resume <session_id> requires either:
  a) a deferred-tool marker in the resumed session (i.e. the prior
     session paused mid-tool-call), or
  b) a non-empty -p prompt to continue the conversation with.

Watchdog-killed sessions have neither: the kill is asynchronous and
leaves no deferred-tool marker, and our harness was passing an empty
-p (because `resume_context_owned` is None for the common respawn
case). claude-code then aborts with:

  "Error: No deferred tool marker found in the resumed session.
   Either the session was not deferred, the marker is stale (tool
   already ran), or it exceeds the tail-scan window. Provide a
   prompt to continue the conversation."

The harness sees an aborted CLI with no session, prunes the recorded
session_id, and respawns cold — paying the full prompt-cache miss for
EVERY respawn. The new session_store logging (commit 0b50a624) made
this 100% legible: every warm spawn we observed went `mode=warm` →
crash → prune → `mode=cold` within a couple of seconds.

Fix: when resuming with no failure-context to send, default the -p
prompt to a brief "continue from PLAN.md" line. claude-code now has a
valid continuation message and warm-resume should actually work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 08:27:02 +00:00
dave 6e76b6a063 huskies: merge 930 2026-05-13 08:06:37 +00:00
dave a7840ea4b0 huskies: merge 946 2026-05-13 08:00:49 +00:00
dave 9ce5a8df0c huskies: merge 945 2026-05-13 06:09:34 +00:00
dave 3a8894ea8f obs: log warm/cold spawn mode at agent respawn decision point
Without this, the only way to tell whether a watchdog-respawn went warm
(--resume <session_id>) vs cold (fresh CLI invocation) was to read the
args list of the existing "Spawning claude with args:" log and check
whether --resume was present. That made it impossible to count
cold-paths or distinguish "supposed-to-be-warm but resume_failed
fallback" from "first session" without source-diving.

This adds one slog! per spawn, prefixed `[agent:{sid}:{name}] spawn
mode=warm|cold session_id=...`, so grep "spawn mode=" answers it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 05:44:46 +00:00
dave 2f50e2198b huskies: merge 951 2026-05-13 04:34:06 +00:00
Timmy c5abc44a63 test: serialise merge-pipeline tests against each other
The 12 tests in `agents::pool::pipeline::merge::tests` share a
process-wide `server_start_time` (a `OnceLock` captured the first time
the merge subsystem runs) and the global merge-job CRDT log. Default
cargo parallelism has caught at least one interleaving on the merge
gate's Docker scheduler where `stale_running_merge_job_is_cleared_and_retry_succeeds`
flakes — `delete_merge_job` from one test lands while another is mid-
assertion. Couldn't reproduce locally despite many tries.

Each test now acquires a poison-tolerant `std::sync::Mutex` at entry,
so the 12 tests run serially relative to each other while the rest of
the suite (2862 tests) stays parallel. Module-level
`#![allow(clippy::await_holding_lock)]` covers the deliberate sync
guard across `.await`s.

Targeted isolation — not a global `--test-threads=1`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 01:50:44 +01:00
Timmy d78dd9e8f9 feat(934): typed Stage enum replaces directory-string state model
The state machine's `Stage` enum becomes the source of truth for pipeline
state. Six stages of work land together:

  1. Clean wire vocabulary (`coding`, `merge`, `merge_failure`, ...) replaces
     legacy directory-style strings (`2_current`, `4_merge`, ...) on the wire.
     `Stage::from_dir` accepted both during deployment; new writes always
     emit the clean form via `stage_dir_name`. Lexicographic `dir >= "5_done"`
     checks in lifecycle.rs become typed `matches!` checks since the new
     vocabulary doesn't sort in pipeline order.
  2. `crdt_state::write_item` takes typed `&Stage`, serialising via
     `stage_dir_name` at the CRDT boundary. `#[cfg(test)] write_item_str`
     parses legacy strings for test fixtures.
  3. `WorkItem::stage()` returns typed `crdt_state::Stage`; `stage_str()`
     is gone from the public API. Projection dispatches on the typed enum.
  4. `frozen` becomes an orthogonal CRDT register. `Stage::Frozen` and
     `PipelineEvent::Freeze`/`Unfreeze` are removed; `transition_to_frozen`/
     `unfrozen` set the flag directly without touching the stage register.
  5. Watcher sweep and `tool_update_story`'s `blocked` setter route through
     `apply_transition` so the typed transition table validates every
     stage change. `update_story` gains a `frozen` field for symmetry.
  6. One-shot startup migration rewrites pre-934 directory-style stage
     registers (and sets `frozen=true` on items previously at `7_frozen`).
     `Stage::from_dir` drops legacy aliases. The db boundary keeps a small
     normaliser so callers with legacy strings (MCP, tests) still work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 22:31:59 +01:00
Timmy 69d91d7707 feat(929): delete db/yaml_legacy.rs entirely — CRDT is the sole source of truth
Final 929 sweep: every YAML-shaped helper is gone. No production code
parses or writes YAML front matter anywhere.

Surface removed:
- db/yaml_legacy.rs (FrontMatter/StoryMetadata structs, parse_front_matter,
  set_front_matter_field, yaml_residue marker) — file deleted.
- ItemMeta::from_yaml — deleted; callers pass typed ItemMeta::named(...) or
  ItemMeta::default() and use typed CRDT setters (set_depends_on,
  set_blocked, set_retry_count, set_agent, set_qa_mode, set_review_hold,
  set_item_type, set_epic, set_mergemaster_attempted) for the rest.
- write_coverage_baseline_to_story_file + read_coverage_percent_from_json —
  the coverage_baseline YAML field was write-only (nothing read it back);
  removed along with its caller in agent_tools/lifecycle.rs.
- update_story_in_file's generic `front_matter` HashMap parameter —
  tool_update_story now intercepts every known field name and routes it
  to a typed CRDT setter; unknown keys are rejected with an explicit error
  pointing at the typed setters. The function only takes user_story /
  description sections now.
- All 117 ItemMeta::from_yaml callsites migrated. Where tests previously
  passed a YAML-shaped content blob and relied on the helper to extract
  name/depends_on/blocked/agent/qa, they now pass:
    write_item_with_content(id, stage, content, ItemMeta::named("Foo"))
    crate::crdt_state::set_depends_on(id, &[...])    // when needed
    crate::crdt_state::set_blocked(id, true)         // when needed
    crate::crdt_state::set_agent(id, Some("..."))    // when needed
- write_story_content + write_story_file (test helper) now take an
  explicit `name: Option<&str>` instead of parsing it from content.
- db::ops::move_item_stage stopped re-parsing YAML on every stage
  transition; metadata is read straight from the CRDT view when mirroring
  the row into SQLite.

New CRDT setters added for symmetry:
- crdt_state::set_name (mirrors set_agent — explicit name updates).

cargo fmt --check, clippy --all-targets -- -D warnings, and the
2830-test suite all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 20:55:25 +01:00