huskies

Author	SHA1	Message	Date
dave	8faf19f3ab	huskies: merge 1034	2026-05-14 14:02:21 +00:00
Timmy	8625b9a7fc	fix: rust 1.95.0 clippy lints and matrix-sdk 0.17 API changes Toolchain bump surfaced new lints (derivable_impls, unnecessary_unwrap, unnecessary_sort_by, while_let_loop, collapsible_match, unnecessary_option_map_or_else, cmp_owned) across bft-json-crdt and huskies-server. All fixed mechanically. Cargo.toml: dropped the no-longer-existing `rustls-tls` matrix-sdk feature, then chased through the 0.17 API breakage: - Relation::Reply is now a tuple variant wrapping Reply, not a struct variant with `in_reply_to` - UserIdentifier::UserIdOrLocalpart removed — use UserIdentifier::Matrix(MatrixUserIdentifier::new(..)) - SendMessageLikeEventResult no longer exposes event_id directly; it's now on the inner `response` field Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 14:48:49 +01:00
dave	9501412598	huskies: merge 1030	2026-05-14 13:29:59 +00:00
dave	f1c96595de	huskies: merge 1035	2026-05-14 13:17:38 +00:00
dave	c353c0a6be	huskies: merge 1033	2026-05-14 13:08:43 +00:00
dave	72d79deec9	huskies: merge 1026	2026-05-14 13:00:51 +00:00
dave	a80d0a497a	huskies: merge 1029	2026-05-14 12:53:01 +00:00
dave	4fad283814	huskies: merge 1027	2026-05-14 11:39:14 +00:00
dave	c64deca7c2	huskies: merge 1023	2026-05-14 11:24:05 +00:00
Timmy	8e996e2bd3	fix(1025): gate auto-block counter on mergemaster presence 1018's merge_failure_block_subscriber counted every MergeFailure transition toward the 3-strike block threshold, but mergemaster's recovery iterations (squash → fail → fix → retry) emit multiple MergeFailure transitions while making real progress. Story 997 was blocked at 10:59:46 while mergemaster was still resolving conflicts and would have succeeded a minute later. Fix: pass the AgentPool to the subscriber. When a mergemaster agent is in the pool for the story, MergeFailure transitions are recovery iterations in progress and do NOT increment the consecutive-failure counter. Block only fires for the genuinely-stuck case (no recovery agent attached and N consecutive failures accumulate). Tests: - mergemaster_running_suppresses_block: 3 failures with recovery_running=true → counter stays empty, story stays in MergeFailure - no_mergemaster_still_blocks_at_threshold: 3 failures with recovery_running=false → blocks (1018 behaviour preserved) All 2938 tests pass.	2026-05-14 12:13:37 +01:00
dave	c7a7cb4281	huskies: merge 997	2026-05-14 11:06:27 +00:00
Timmy	0572af2193	feat: outer cap on commit-recovery respawns catches flapping agents The progress-aware no-progress cap (3 consecutive byte-identical diffs) doesn't catch the degenerate pattern where the agent keeps making DIFFERENT file edits each session but never commits — every respawn resets the no-progress counter, infinite loop, budget burns. Adds ContentKey::CommitRecoveryTotalAttempts: an absolute counter that increments on every commit-recovery respawn regardless of progress. TOTAL_ATTEMPTS_CAP = 8; when hit, block with reason 'agent flapped — N respawns without ever committing'. Two caps now bound the recovery loop: - NO_PROGRESS_CAP (3): catches stuck-agent (same diff repeatedly) - TOTAL_ATTEMPTS_CAP (8): catches flapping-agent (different diffs, no commits) Easy to tune the constant lower if we see runaway in practice. All 2936 tests pass.	2026-05-14 11:34:17 +01:00
Timmy	bab337b289	feat: progress-aware commit-recovery cap (no longer block on 2nd attempt) The existing commit-recovery path blocked stories on the 2nd consecutive exit-without-commit. For long sweep refactors (e.g. story 997, the typed retries payload migration), claude-code's session-length boundary naturally terminates the coder mid-sweep before it can commit — even though substantial file-edit progress is being made each session. The old cap-of-1 misclassified normal mid-flight progress as 'agent declined to commit'. New behaviour: - Each commit-recovery respawn captures a worktree-diff byte-length fingerprint (git diff master \| wc -c). - If the fingerprint differs from the previous attempt the agent made file-edit progress, the no-progress counter resets to 1. - If the fingerprint is byte-identical (no new edits between exits), increment the no-progress counter. - Block only when the counter reaches NO_PROGRESS_CAP (3) — i.e. three consecutive respawns where the agent did literally nothing. Adds ContentKey::CommitRecoveryDiffFingerprint to store the prior fingerprint. Updates the existing block-test to reflect the new cap semantics; existing 'first respawn issued' test continues to pass. All 2935 tests pass.	2026-05-14 11:24:02 +01:00
Timmy	5e5c5a0e08	revert: remove temporary merge-reap diagnostic logging Reverts the diagnostic introduced in `91b4e4ff`. Will re-add when we actively debug the disappearance bug again.	2026-05-14 10:57:37 +01:00
Timmy	91b4e4ff7c	diag: log merge-reap values to debug disappearance bug Temporary diagnostic added to reap_stale_merge_jobs to surface the t, current_boot, and decoded values being compared on every reap pass. Will revert once the disappearance bug is understood.	2026-05-14 10:42:16 +01:00
dave	309542cf2c	huskies: merge 1018	2026-05-14 09:38:15 +00:00
Timmy	8b2ba1c810	fix: post-squash compile errors reclassify as semantic merge conflicts When deterministic-merge produces a clean git squash but the post-squash compile fails (typical when master gained a Stage payload field after the feature branch forked — e.g. story 1018 hit `error[E0063]: missing field plan` after 1010's PlanState landed), the failure is morally a merge conflict that git's diff3 missed: the conflicting literal lives in a different file from the type definition that changed on master. Routing it as GatesFailed left mergemaster idle and the story stuck. Changes: - gates.rs GateFailureKind::classify: detect rustc compile errors (`error[E\d+]`) as Build instead of falling through to Test. Clippy errors (`error[clippy::...]`) still classify as Lint. - agents/merge/mod.rs: new MergeResult::to_merge_failure_kind() method. GateFailure with failure_kind=Build maps to ConflictDetected (so the existing 998 subscriber auto-spawns mergemaster). Other gate failures stay GatesFailed. - agents/pool/pipeline/merge/runner.rs: replace the inline match with a call to the new method. Tests: 6 new unit tests covering the classifier branch and every to_merge_failure_kind arm. All 2932 tests pass.	2026-05-14 10:18:33 +01:00
dave	e3f5875b8e	huskies: merge 1019	2026-05-14 08:52:38 +00:00
dave	ebf58ef224	huskies: merge 1008	2026-05-14 08:46:16 +00:00
dave	761b6934f1	huskies: merge 1007	2026-05-14 08:41:44 +00:00
dave	13ab97a615	huskies: merge 1010	2026-05-14 08:12:56 +00:00
dave	4520e0e6f9	huskies: merge 995	2026-05-14 07:55:40 +00:00
dave	52180bc402	huskies: merge 1017	2026-05-13 23:55:35 +00:00
dave	29e800da21	huskies: merge 1016	2026-05-13 23:51:07 +00:00
dave	5ed1438ab9	huskies: merge 1015	2026-05-13 23:39:17 +00:00
dave	69b207872a	huskies: merge 1014	2026-05-13 23:25:10 +00:00
dave	4e007bb770	huskies: merge 1009	2026-05-13 22:55:05 +00:00
dave	a5cd3a2152	huskies: merge 994	2026-05-13 22:38:51 +00:00
dave	1ee23e7bfe	huskies: merge 996	2026-05-13 22:29:09 +00:00
dave	cd9021fedf	huskies: merge 1006	2026-05-13 21:41:39 +00:00
dave	eb48ef19e7	huskies: merge 1011	2026-05-13 21:32:11 +00:00
Timmy	2758f744f2	fix: reap_stale_merge_jobs re-dispatches instead of just deleting A mid-merge server restart used to silently kill the merge: the in-flight tokio task died with the process, reap_stale_merge_jobs ran on the new boot, saw the Running entry from the previous boot, and simply deleted it. Mergemaster polling `get_merge_status` then saw "Merge job disappeared", treated it as a strike, and after three restarts escalated the story to MergeFailureFinal — even though no real merge failure ever happened (this is what trapped story 998 during the bug 1001 iteration cycle). Reap now also fires a `WatcherEvent::WorkItem reassign` for the cleared story so the auto-assign watcher loop re-runs start_merge_agent_work on the fresh boot. The story is still in 4_merge/; the merge resumes automatically. The change is contained to the reap path — start_merge_agent_work's own behaviour is unchanged. Added regression test reap_stale_merge_jobs_emits_reassign_watcher_event that asserts the new event fires. Existing reap_stale_merge_jobs_removes_old_running_entry_without_merge still passes (the "without_merge" guarantee is about agent spawning, not about absence of watcher events). Also exposes AgentPool::watcher_tx() as pub(crate) so the merge runner can fan out re-dispatch events. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 21:28:10 +01:00
dave	bbdee1239b	huskies: merge 998	2026-05-13 19:33:33 +00:00
Timmy	75dc1fc15a	feat: MergeFailureFinal → Coding via operator FixupRequested MergeFailureFinal was unreachable from move_story: the only transitions out were Freeze (→ Frozen) and a self-loop on MergemasterAttempted, so once mergemaster exhausted its 3-retry budget the only way to get a story coding again was to delete + recreate it. The respawn budget is a mergemaster bookkeeping detail, not a hard ceiling. A human operator inspecting a Final story can reasonably decide the gate failure is fixable, so this adds the same FixupRequested → Coding edge that already exists for plain MergeFailure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 20:21:48 +01:00
Timmy	b6898886d7	chore(1001): retire recover_half_written_items from MCP surface The recovery tool was a one-shot migration aid for the half-written items that existed before the Stage 1 allocator fix. The three live orphans (989/1000/1001) have been migrated; the Stage 1 fix prevents new half-writes; the tool's job is done. Removes the MCP wrapper, schema, dispatch case, and tools-list assertion. The db::recover module itself stays in-process (under `#[allow(dead_code)]`) so it can be re-exposed quickly if the bug ever resurfaces — its regression tests still run as part of the default suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 19:36:02 +01:00
Timmy	92b1744c3a	feat(1001): story_ids filter for recover_half_written_items The first dry-run against the live pipeline surfaced 735 orphans (35 tombstoned half-writes, 700 stale content rows with no CRDT entry — mostly artefacts of the pre-numeric-id era). Bulk-recovering would resurrect a lot of stories the user deliberately purged in the past. Add an optional `story_ids` filter that restricts both discovery (in dry-run) and recovery to a named subset, so the operator can target the specific recent half-writes without touching anything else. The new test asserts the filter is honoured. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 19:26:07 +01:00
Timmy	cd411ba443	feat(1001): recover_half_written_items MCP tool Adds db::recover, a discovery + recovery layer for pipeline items that got half-written before the Stage 1 fix landed (content in content store + SQLite shadow, no live CRDT entry). For each orphan, the content body is re-anchored to a fresh non-tombstoned id and the old id's content row is cleared. Exposed as the recover_half_written_items MCP tool. dry_run defaults to true so the caller can review what would change before mutating. YAML front-matter parsing is hand-rolled and scoped to the three fields the create_*_file path emits (name, type, depends_on). It tolerates missing or malformed lines by falling back to safe defaults; the orphan is recovered with the best metadata we can pull from the body and the rest is left to the operator to fix up. The discovery step is read-only and idempotent. Recovery is also idempotent in the sense that once an orphan is lifted, the next discovery pass won't see it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 19:16:05 +01:00
Timmy	c61f715878	fix(1001): stop create_* from half-writing onto tombstoned IDs Root cause: db::next_item_number scanned the visible CRDT index and the content store but not the tombstone set, so it would hand out a numeric ID whose CRDT entry had been tombstoned. crdt_state::write_item then silently no-op'd the insert (tombstone-match guard) while the content store and SQLite shadow happily accepted the row, producing a split- brain half-write that was invisible to every CRDT-driven read path and couldn't be cleaned up by delete_story / purge_story. This change closes the loop: - crdt_state::read::{is_tombstoned, tombstoned_ids} expose the tombstone set so callers outside crdt_state can consult it. - db::next_item_number now scans tombstoned_ids() too. The allocator skips past tombstoned numeric IDs instead of treating their slots as free. - write_item logs a WARN when it rejects a write for a tombstoned ID (was silent). The warn is a tripwire — if the allocator ever lets one slip through again we'll see it in the log. - create_item_in_backlog adds two defence-in-depth checks: (a) before any write, reject if the allocator returned a tombstoned ID; (b) after the writes, call read_item to confirm the CRDT entry materialised. If not, roll back the content-store + shadow-DB rows via db::delete_item and return Err. Regression tests cover the allocator skip, the is_tombstoned accessor, and the create_item_in_backlog rollback path. Out of scope for this commit: - Recovery of the already-half-written items currently in the running pipeline (989, 1000, 1001) — Stage 2/3 of the plan, handled separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 19:05:48 +01:00
dave	caed894db9	huskies: merge 988	2026-05-13 17:28:52 +00:00
dave	a078d3df7c	huskies: merge 985	2026-05-13 16:52:19 +00:00
dave	580480094e	huskies: merge 984	2026-05-13 16:47:51 +00:00
dave	c3c9db3d8b	huskies: merge 987	2026-05-13 16:30:31 +00:00
dave	430079ecbc	huskies: merge 986	2026-05-13 16:01:51 +00:00
dave	91fbad568a	huskies: merge 982	2026-05-13 15:34:41 +00:00
dave	f268dca5bb	huskies: merge 977	2026-05-13 15:11:37 +00:00
dave	dcb43c465a	huskies: merge 964	2026-05-13 14:56:08 +00:00
Timmy	c811672e18	huskies: progress 983 — differentiated icons for stuck-story states Distinct icons in StagePanel/GatewayPanel/render.rs status output for blocked-with-running-recovery (robot), blocked-with-queued-recovery (hourglass), and blocked-cold (red circle). All 2822 tests pass.	2026-05-13 15:46:36 +01:00
dave	14a39b6205	huskies: merge 980	2026-05-13 14:44:17 +00:00
Timmy	246f44d8f3	fix: widen keepalive test timeout to eliminate CI flake keepalive_connection_survives_with_pong_responses set ping_ms=100, timeout_ms=250, so the server's pong-deadline fired ~560ms after the first ping — only ~60ms past the end of the test's 500ms await window. Under CI scheduler jitter that 60ms slack was insufficient and the server timer fired inside the test window, closing the connection mid-await and producing a flake. Bump timeout_ms to 2000ms so the pong-deadline cannot fire within the test window under any realistic jitter. ping_ms stays at 100ms so the test still exercises multiple ping/pong rounds in the same wall-clock budget. Test still passes locally; was hitting 964's merge gate as a flake.	2026-05-13 15:41:25 +01:00
dave	e5d2465f66	huskies: merge 974	2026-05-13 14:26:42 +00:00

1 2 3 4 5 ...

980 Commits