Toolchain bump surfaced new lints (derivable_impls,
unnecessary_unwrap, unnecessary_sort_by, while_let_loop,
collapsible_match, unnecessary_option_map_or_else, cmp_owned)
across bft-json-crdt and huskies-server. All fixed mechanically.
Cargo.toml: dropped the no-longer-existing `rustls-tls` matrix-sdk
feature, then chased through the 0.17 API breakage:
- Relation::Reply is now a tuple variant wrapping Reply, not a
struct variant with `in_reply_to`
- UserIdentifier::UserIdOrLocalpart removed — use
UserIdentifier::Matrix(MatrixUserIdentifier::new(..))
- SendMessageLikeEventResult no longer exposes event_id directly;
it's now on the inner `response` field
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1018's merge_failure_block_subscriber counted every MergeFailure transition
toward the 3-strike block threshold, but mergemaster's recovery iterations
(squash → fail → fix → retry) emit multiple MergeFailure transitions while
making real progress. Story 997 was blocked at 10:59:46 while mergemaster
was still resolving conflicts and would have succeeded a minute later.
Fix: pass the AgentPool to the subscriber. When a mergemaster agent is in
the pool for the story, MergeFailure transitions are recovery iterations
in progress and do NOT increment the consecutive-failure counter. Block
only fires for the genuinely-stuck case (no recovery agent attached and N
consecutive failures accumulate).
Tests:
- mergemaster_running_suppresses_block: 3 failures with recovery_running=true
→ counter stays empty, story stays in MergeFailure
- no_mergemaster_still_blocks_at_threshold: 3 failures with recovery_running=false
→ blocks (1018 behaviour preserved)
All 2938 tests pass.
The progress-aware no-progress cap (3 consecutive byte-identical diffs)
doesn't catch the degenerate pattern where the agent keeps making
DIFFERENT file edits each session but never commits — every respawn
resets the no-progress counter, infinite loop, budget burns.
Adds ContentKey::CommitRecoveryTotalAttempts: an absolute counter that
increments on every commit-recovery respawn regardless of progress.
TOTAL_ATTEMPTS_CAP = 8; when hit, block with reason 'agent flapped — N
respawns without ever committing'.
Two caps now bound the recovery loop:
- NO_PROGRESS_CAP (3): catches stuck-agent (same diff repeatedly)
- TOTAL_ATTEMPTS_CAP (8): catches flapping-agent (different diffs, no commits)
Easy to tune the constant lower if we see runaway in practice.
All 2936 tests pass.
The existing commit-recovery path blocked stories on the 2nd consecutive
exit-without-commit. For long sweep refactors (e.g. story 997, the typed
retries payload migration), claude-code's session-length boundary
naturally terminates the coder mid-sweep before it can commit — even
though substantial file-edit progress is being made each session. The
old cap-of-1 misclassified normal mid-flight progress as 'agent declined
to commit'.
New behaviour:
- Each commit-recovery respawn captures a worktree-diff byte-length
fingerprint (git diff master | wc -c).
- If the fingerprint differs from the previous attempt the agent made
file-edit progress, the no-progress counter resets to 1.
- If the fingerprint is byte-identical (no new edits between exits),
increment the no-progress counter.
- Block only when the counter reaches NO_PROGRESS_CAP (3) — i.e. three
consecutive respawns where the agent did literally nothing.
Adds ContentKey::CommitRecoveryDiffFingerprint to store the prior
fingerprint. Updates the existing block-test to reflect the new cap
semantics; existing 'first respawn issued' test continues to pass.
All 2935 tests pass.
Temporary diagnostic added to reap_stale_merge_jobs to surface the t,
current_boot, and decoded values being compared on every reap pass.
Will revert once the disappearance bug is understood.
When deterministic-merge produces a clean git squash but the post-squash
compile fails (typical when master gained a Stage payload field after the
feature branch forked — e.g. story 1018 hit `error[E0063]: missing field
plan` after 1010's PlanState landed), the failure is morally a merge
conflict that git's diff3 missed: the conflicting literal lives in a
different file from the type definition that changed on master. Routing
it as GatesFailed left mergemaster idle and the story stuck.
Changes:
- gates.rs GateFailureKind::classify: detect rustc compile errors
(`error[E\d+]`) as Build instead of falling through to Test. Clippy
errors (`error[clippy::...]`) still classify as Lint.
- agents/merge/mod.rs: new MergeResult::to_merge_failure_kind() method.
GateFailure with failure_kind=Build maps to ConflictDetected (so the
existing 998 subscriber auto-spawns mergemaster). Other gate failures
stay GatesFailed.
- agents/pool/pipeline/merge/runner.rs: replace the inline match with a
call to the new method.
Tests: 6 new unit tests covering the classifier branch and every
to_merge_failure_kind arm. All 2932 tests pass.
A mid-merge server restart used to silently kill the merge: the
in-flight tokio task died with the process, reap_stale_merge_jobs ran
on the new boot, saw the Running entry from the previous boot, and
simply deleted it. Mergemaster polling `get_merge_status` then saw
"Merge job disappeared", treated it as a strike, and after three
restarts escalated the story to MergeFailureFinal — even though no
real merge failure ever happened (this is what trapped story 998
during the bug 1001 iteration cycle).
Reap now also fires a `WatcherEvent::WorkItem reassign` for the
cleared story so the auto-assign watcher loop re-runs
start_merge_agent_work on the fresh boot. The story is still in
4_merge/; the merge resumes automatically. The change is contained to
the reap path — start_merge_agent_work's own behaviour is unchanged.
Added regression test
reap_stale_merge_jobs_emits_reassign_watcher_event that asserts the
new event fires. Existing
reap_stale_merge_jobs_removes_old_running_entry_without_merge still
passes (the "without_merge" guarantee is about agent spawning, not
about absence of watcher events).
Also exposes AgentPool::watcher_tx() as pub(crate) so the merge
runner can fan out re-dispatch events.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MergeFailureFinal was unreachable from move_story: the only transitions
out were Freeze (→ Frozen) and a self-loop on MergemasterAttempted, so
once mergemaster exhausted its 3-retry budget the only way to get a
story coding again was to delete + recreate it.
The respawn budget is a mergemaster bookkeeping detail, not a hard
ceiling. A human operator inspecting a Final story can reasonably
decide the gate failure is fixable, so this adds the same
FixupRequested → Coding edge that already exists for plain
MergeFailure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recovery tool was a one-shot migration aid for the half-written
items that existed before the Stage 1 allocator fix. The three live
orphans (989/1000/1001) have been migrated; the Stage 1 fix prevents
new half-writes; the tool's job is done.
Removes the MCP wrapper, schema, dispatch case, and tools-list
assertion. The db::recover module itself stays in-process (under
`#[allow(dead_code)]`) so it can be re-exposed quickly if the bug
ever resurfaces — its regression tests still run as part of the
default suite.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first dry-run against the live pipeline surfaced 735 orphans (35
tombstoned half-writes, 700 stale content rows with no CRDT entry —
mostly artefacts of the pre-numeric-id era). Bulk-recovering would
resurrect a lot of stories the user deliberately purged in the past.
Add an optional `story_ids` filter that restricts both discovery (in
dry-run) and recovery to a named subset, so the operator can target
the specific recent half-writes without touching anything else. The
new test asserts the filter is honoured.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds db::recover, a discovery + recovery layer for pipeline items that
got half-written before the Stage 1 fix landed (content in content
store + SQLite shadow, no live CRDT entry). For each orphan, the
content body is re-anchored to a fresh non-tombstoned id and the old
id's content row is cleared.
Exposed as the recover_half_written_items MCP tool. dry_run defaults
to true so the caller can review what would change before mutating.
YAML front-matter parsing is hand-rolled and scoped to the three
fields the create_*_file path emits (name, type, depends_on). It
tolerates missing or malformed lines by falling back to safe
defaults; the orphan is recovered with the best metadata we can pull
from the body and the rest is left to the operator to fix up.
The discovery step is read-only and idempotent. Recovery is also
idempotent in the sense that once an orphan is lifted, the next
discovery pass won't see it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause: db::next_item_number scanned the visible CRDT index and the
content store but not the tombstone set, so it would hand out a numeric
ID whose CRDT entry had been tombstoned. crdt_state::write_item then
silently no-op'd the insert (tombstone-match guard) while the content
store and SQLite shadow happily accepted the row, producing a split-
brain half-write that was invisible to every CRDT-driven read path and
couldn't be cleaned up by delete_story / purge_story.
This change closes the loop:
- crdt_state::read::{is_tombstoned, tombstoned_ids} expose the
tombstone set so callers outside crdt_state can consult it.
- db::next_item_number now scans tombstoned_ids() too. The allocator
skips past tombstoned numeric IDs instead of treating their slots as
free.
- write_item logs a WARN when it rejects a write for a tombstoned ID
(was silent). The warn is a tripwire — if the allocator ever lets one
slip through again we'll see it in the log.
- create_item_in_backlog adds two defence-in-depth checks:
(a) before any write, reject if the allocator returned a
tombstoned ID;
(b) after the writes, call read_item to confirm the CRDT entry
materialised. If not, roll back the content-store + shadow-DB
rows via db::delete_item and return Err.
Regression tests cover the allocator skip, the is_tombstoned accessor,
and the create_item_in_backlog rollback path.
Out of scope for this commit:
- Recovery of the already-half-written items currently in the running
pipeline (989, 1000, 1001) — Stage 2/3 of the plan, handled
separately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>