Commit Graph

3672 Commits

Author SHA1 Message Date
Timmy 8e996e2bd3 fix(1025): gate auto-block counter on mergemaster presence
1018's merge_failure_block_subscriber counted every MergeFailure transition
toward the 3-strike block threshold, but mergemaster's recovery iterations
(squash → fail → fix → retry) emit multiple MergeFailure transitions while
making real progress. Story 997 was blocked at 10:59:46 while mergemaster
was still resolving conflicts and would have succeeded a minute later.

Fix: pass the AgentPool to the subscriber. When a mergemaster agent is in
the pool for the story, MergeFailure transitions are recovery iterations
in progress and do NOT increment the consecutive-failure counter. Block
only fires for the genuinely-stuck case (no recovery agent attached and N
consecutive failures accumulate).

Tests:
- mergemaster_running_suppresses_block: 3 failures with recovery_running=true
  → counter stays empty, story stays in MergeFailure
- no_mergemaster_still_blocks_at_threshold: 3 failures with recovery_running=false
  → blocks (1018 behaviour preserved)

All 2938 tests pass.
2026-05-14 12:13:37 +01:00
dave c7a7cb4281 huskies: merge 997 2026-05-14 11:06:27 +00:00
Timmy 0572af2193 feat: outer cap on commit-recovery respawns catches flapping agents
The progress-aware no-progress cap (3 consecutive byte-identical diffs)
doesn't catch the degenerate pattern where the agent keeps making
DIFFERENT file edits each session but never commits — every respawn
resets the no-progress counter, infinite loop, budget burns.

Adds ContentKey::CommitRecoveryTotalAttempts: an absolute counter that
increments on every commit-recovery respawn regardless of progress.
TOTAL_ATTEMPTS_CAP = 8; when hit, block with reason 'agent flapped — N
respawns without ever committing'.

Two caps now bound the recovery loop:
- NO_PROGRESS_CAP (3): catches stuck-agent (same diff repeatedly)
- TOTAL_ATTEMPTS_CAP (8): catches flapping-agent (different diffs, no commits)

Easy to tune the constant lower if we see runaway in practice.
All 2936 tests pass.
2026-05-14 11:34:17 +01:00
Timmy bab337b289 feat: progress-aware commit-recovery cap (no longer block on 2nd attempt)
The existing commit-recovery path blocked stories on the 2nd consecutive
exit-without-commit. For long sweep refactors (e.g. story 997, the typed
retries payload migration), claude-code's session-length boundary
naturally terminates the coder mid-sweep before it can commit — even
though substantial file-edit progress is being made each session. The
old cap-of-1 misclassified normal mid-flight progress as 'agent declined
to commit'.

New behaviour:
- Each commit-recovery respawn captures a worktree-diff byte-length
  fingerprint (git diff master | wc -c).
- If the fingerprint differs from the previous attempt the agent made
  file-edit progress, the no-progress counter resets to 1.
- If the fingerprint is byte-identical (no new edits between exits),
  increment the no-progress counter.
- Block only when the counter reaches NO_PROGRESS_CAP (3) — i.e. three
  consecutive respawns where the agent did literally nothing.

Adds ContentKey::CommitRecoveryDiffFingerprint to store the prior
fingerprint. Updates the existing block-test to reflect the new cap
semantics; existing 'first respawn issued' test continues to pass.

All 2935 tests pass.
2026-05-14 11:24:02 +01:00
Timmy 5e5c5a0e08 revert: remove temporary merge-reap diagnostic logging
Reverts the diagnostic introduced in 91b4e4ff. Will re-add when we
actively debug the disappearance bug again.
2026-05-14 10:57:37 +01:00
Timmy 91b4e4ff7c diag: log merge-reap values to debug disappearance bug
Temporary diagnostic added to reap_stale_merge_jobs to surface the t,
current_boot, and decoded values being compared on every reap pass.
Will revert once the disappearance bug is understood.
2026-05-14 10:42:16 +01:00
dave 309542cf2c huskies: merge 1018 2026-05-14 09:38:15 +00:00
Timmy 8b2ba1c810 fix: post-squash compile errors reclassify as semantic merge conflicts
When deterministic-merge produces a clean git squash but the post-squash
compile fails (typical when master gained a Stage payload field after the
feature branch forked — e.g. story 1018 hit `error[E0063]: missing field
plan` after 1010's PlanState landed), the failure is morally a merge
conflict that git's diff3 missed: the conflicting literal lives in a
different file from the type definition that changed on master. Routing
it as GatesFailed left mergemaster idle and the story stuck.

Changes:
- gates.rs GateFailureKind::classify: detect rustc compile errors
  (`error[E\d+]`) as Build instead of falling through to Test. Clippy
  errors (`error[clippy::...]`) still classify as Lint.
- agents/merge/mod.rs: new MergeResult::to_merge_failure_kind() method.
  GateFailure with failure_kind=Build maps to ConflictDetected (so the
  existing 998 subscriber auto-spawns mergemaster). Other gate failures
  stay GatesFailed.
- agents/pool/pipeline/merge/runner.rs: replace the inline match with a
  call to the new method.

Tests: 6 new unit tests covering the classifier branch and every
to_merge_failure_kind arm. All 2932 tests pass.
2026-05-14 10:18:33 +01:00
dave e3f5875b8e huskies: merge 1019 2026-05-14 08:52:38 +00:00
dave ebf58ef224 huskies: merge 1008 2026-05-14 08:46:16 +00:00
dave 761b6934f1 huskies: merge 1007 2026-05-14 08:41:44 +00:00
dave 13ab97a615 huskies: merge 1010 2026-05-14 08:12:56 +00:00
dave 4520e0e6f9 huskies: merge 995 2026-05-14 07:55:40 +00:00
dave 52180bc402 huskies: merge 1017 2026-05-13 23:55:35 +00:00
dave 29e800da21 huskies: merge 1016 2026-05-13 23:51:07 +00:00
dave 5ed1438ab9 huskies: merge 1015 2026-05-13 23:39:17 +00:00
dave 69b207872a huskies: merge 1014 2026-05-13 23:25:10 +00:00
dave 8754c790b9 huskies: merge 1013 2026-05-13 23:12:18 +00:00
dave 4e007bb770 huskies: merge 1009 2026-05-13 22:55:05 +00:00
dave a5cd3a2152 huskies: merge 994 2026-05-13 22:38:51 +00:00
dave 1ee23e7bfe huskies: merge 996 2026-05-13 22:29:09 +00:00
dave cd9021fedf huskies: merge 1006 2026-05-13 21:41:39 +00:00
dave eb48ef19e7 huskies: merge 1011 2026-05-13 21:32:11 +00:00
Timmy 2758f744f2 fix: reap_stale_merge_jobs re-dispatches instead of just deleting
A mid-merge server restart used to silently kill the merge: the
in-flight tokio task died with the process, reap_stale_merge_jobs ran
on the new boot, saw the Running entry from the previous boot, and
simply deleted it. Mergemaster polling `get_merge_status` then saw
"Merge job disappeared", treated it as a strike, and after three
restarts escalated the story to MergeFailureFinal — even though no
real merge failure ever happened (this is what trapped story 998
during the bug 1001 iteration cycle).

Reap now also fires a `WatcherEvent::WorkItem reassign` for the
cleared story so the auto-assign watcher loop re-runs
start_merge_agent_work on the fresh boot. The story is still in
4_merge/; the merge resumes automatically. The change is contained to
the reap path — start_merge_agent_work's own behaviour is unchanged.

Added regression test
reap_stale_merge_jobs_emits_reassign_watcher_event that asserts the
new event fires. Existing
reap_stale_merge_jobs_removes_old_running_entry_without_merge still
passes (the "without_merge" guarantee is about agent spawning, not
about absence of watcher events).

Also exposes AgentPool::watcher_tx() as pub(crate) so the merge
runner can fan out re-dispatch events.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 21:28:10 +01:00
dave bbdee1239b huskies: merge 998 2026-05-13 19:33:33 +00:00
Timmy 75dc1fc15a feat: MergeFailureFinal → Coding via operator FixupRequested
MergeFailureFinal was unreachable from move_story: the only transitions
out were Freeze (→ Frozen) and a self-loop on MergemasterAttempted, so
once mergemaster exhausted its 3-retry budget the only way to get a
story coding again was to delete + recreate it.

The respawn budget is a mergemaster bookkeeping detail, not a hard
ceiling. A human operator inspecting a Final story can reasonably
decide the gate failure is fixable, so this adds the same
FixupRequested → Coding edge that already exists for plain
MergeFailure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 20:21:48 +01:00
Timmy b6898886d7 chore(1001): retire recover_half_written_items from MCP surface
The recovery tool was a one-shot migration aid for the half-written
items that existed before the Stage 1 allocator fix. The three live
orphans (989/1000/1001) have been migrated; the Stage 1 fix prevents
new half-writes; the tool's job is done.

Removes the MCP wrapper, schema, dispatch case, and tools-list
assertion. The db::recover module itself stays in-process (under
`#[allow(dead_code)]`) so it can be re-exposed quickly if the bug
ever resurfaces — its regression tests still run as part of the
default suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 19:36:02 +01:00
Timmy 92b1744c3a feat(1001): story_ids filter for recover_half_written_items
The first dry-run against the live pipeline surfaced 735 orphans (35
tombstoned half-writes, 700 stale content rows with no CRDT entry —
mostly artefacts of the pre-numeric-id era). Bulk-recovering would
resurrect a lot of stories the user deliberately purged in the past.

Add an optional `story_ids` filter that restricts both discovery (in
dry-run) and recovery to a named subset, so the operator can target
the specific recent half-writes without touching anything else. The
new test asserts the filter is honoured.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 19:26:07 +01:00
Timmy cd411ba443 feat(1001): recover_half_written_items MCP tool
Adds db::recover, a discovery + recovery layer for pipeline items that
got half-written before the Stage 1 fix landed (content in content
store + SQLite shadow, no live CRDT entry). For each orphan, the
content body is re-anchored to a fresh non-tombstoned id and the old
id's content row is cleared.

Exposed as the recover_half_written_items MCP tool. dry_run defaults
to true so the caller can review what would change before mutating.

YAML front-matter parsing is hand-rolled and scoped to the three
fields the create_*_file path emits (name, type, depends_on). It
tolerates missing or malformed lines by falling back to safe
defaults; the orphan is recovered with the best metadata we can pull
from the body and the rest is left to the operator to fix up.

The discovery step is read-only and idempotent. Recovery is also
idempotent in the sense that once an orphan is lifted, the next
discovery pass won't see it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 19:16:05 +01:00
Timmy c61f715878 fix(1001): stop create_* from half-writing onto tombstoned IDs
Root cause: db::next_item_number scanned the visible CRDT index and the
content store but not the tombstone set, so it would hand out a numeric
ID whose CRDT entry had been tombstoned. crdt_state::write_item then
silently no-op'd the insert (tombstone-match guard) while the content
store and SQLite shadow happily accepted the row, producing a split-
brain half-write that was invisible to every CRDT-driven read path and
couldn't be cleaned up by delete_story / purge_story.

This change closes the loop:

- crdt_state::read::{is_tombstoned, tombstoned_ids} expose the
  tombstone set so callers outside crdt_state can consult it.

- db::next_item_number now scans tombstoned_ids() too. The allocator
  skips past tombstoned numeric IDs instead of treating their slots as
  free.

- write_item logs a WARN when it rejects a write for a tombstoned ID
  (was silent). The warn is a tripwire — if the allocator ever lets one
  slip through again we'll see it in the log.

- create_item_in_backlog adds two defence-in-depth checks:
    (a) before any write, reject if the allocator returned a
        tombstoned ID;
    (b) after the writes, call read_item to confirm the CRDT entry
        materialised. If not, roll back the content-store + shadow-DB
        rows via db::delete_item and return Err.

Regression tests cover the allocator skip, the is_tombstoned accessor,
and the create_item_in_backlog rollback path.

Out of scope for this commit:
- Recovery of the already-half-written items currently in the running
  pipeline (989, 1000, 1001) — Stage 2/3 of the plan, handled
  separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 19:05:48 +01:00
dave caed894db9 huskies: merge 988 2026-05-13 17:28:52 +00:00
dave a078d3df7c huskies: merge 985 2026-05-13 16:52:19 +00:00
dave 580480094e huskies: merge 984 2026-05-13 16:47:51 +00:00
dave c3c9db3d8b huskies: merge 987 2026-05-13 16:30:31 +00:00
dave 430079ecbc huskies: merge 986 2026-05-13 16:01:51 +00:00
dave 91fbad568a huskies: merge 982 2026-05-13 15:34:41 +00:00
dave e6d051d016 huskies: merge 983 2026-05-13 15:29:59 +00:00
dave f268dca5bb huskies: merge 977 2026-05-13 15:11:37 +00:00
dave dcb43c465a huskies: merge 964 2026-05-13 14:56:08 +00:00
Timmy c811672e18 huskies: progress 983 — differentiated icons for stuck-story states
Distinct icons in StagePanel/GatewayPanel/render.rs status output for
blocked-with-running-recovery (robot), blocked-with-queued-recovery (hourglass),
and blocked-cold (red circle). All 2822 tests pass.
2026-05-13 15:46:36 +01:00
dave 14a39b6205 huskies: merge 980 2026-05-13 14:44:17 +00:00
Timmy 246f44d8f3 fix: widen keepalive test timeout to eliminate CI flake
keepalive_connection_survives_with_pong_responses set ping_ms=100,
timeout_ms=250, so the server's pong-deadline fired ~560ms after the
first ping — only ~60ms past the end of the test's 500ms await window.
Under CI scheduler jitter that 60ms slack was insufficient and the
server timer fired inside the test window, closing the connection
mid-await and producing a flake.

Bump timeout_ms to 2000ms so the pong-deadline cannot fire within
the test window under any realistic jitter. ping_ms stays at 100ms
so the test still exercises multiple ping/pong rounds in the same
wall-clock budget.

Test still passes locally; was hitting 964's merge gate as a flake.
2026-05-13 15:41:25 +01:00
dave e5d2465f66 huskies: merge 974 2026-05-13 14:26:42 +00:00
dave 7854fbd78a huskies: merge 979 2026-05-13 14:14:00 +00:00
dave 4b18c01835 huskies: merge 973 2026-05-13 14:08:05 +00:00
dave e9a7468d8a huskies: merge 981 2026-05-13 14:01:02 +00:00
dave 51aa649ce4 huskies: merge 978 2026-05-13 13:51:05 +00:00
dave 6fc6c9fcb2 huskies: merge 975 2026-05-13 13:45:10 +00:00
dave 5617da5c27 huskies: merge 972 2026-05-13 13:39:20 +00:00
dave 61815ebf5c huskies: merge 976 2026-05-13 13:31:19 +00:00