huskies

Author	SHA1	Message	Date
dave	cd9021fedf	huskies: merge 1006	2026-05-13 21:41:39 +00:00
dave	eb48ef19e7	huskies: merge 1011	2026-05-13 21:32:11 +00:00
Timmy	2758f744f2	fix: reap_stale_merge_jobs re-dispatches instead of just deleting A mid-merge server restart used to silently kill the merge: the in-flight tokio task died with the process, reap_stale_merge_jobs ran on the new boot, saw the Running entry from the previous boot, and simply deleted it. Mergemaster polling `get_merge_status` then saw "Merge job disappeared", treated it as a strike, and after three restarts escalated the story to MergeFailureFinal — even though no real merge failure ever happened (this is what trapped story 998 during the bug 1001 iteration cycle). Reap now also fires a `WatcherEvent::WorkItem reassign` for the cleared story so the auto-assign watcher loop re-runs start_merge_agent_work on the fresh boot. The story is still in 4_merge/; the merge resumes automatically. The change is contained to the reap path — start_merge_agent_work's own behaviour is unchanged. Added regression test reap_stale_merge_jobs_emits_reassign_watcher_event that asserts the new event fires. Existing reap_stale_merge_jobs_removes_old_running_entry_without_merge still passes (the "without_merge" guarantee is about agent spawning, not about absence of watcher events). Also exposes AgentPool::watcher_tx() as pub(crate) so the merge runner can fan out re-dispatch events. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 21:28:10 +01:00
dave	bbdee1239b	huskies: merge 998	2026-05-13 19:33:33 +00:00
Timmy	75dc1fc15a	feat: MergeFailureFinal → Coding via operator FixupRequested MergeFailureFinal was unreachable from move_story: the only transitions out were Freeze (→ Frozen) and a self-loop on MergemasterAttempted, so once mergemaster exhausted its 3-retry budget the only way to get a story coding again was to delete + recreate it. The respawn budget is a mergemaster bookkeeping detail, not a hard ceiling. A human operator inspecting a Final story can reasonably decide the gate failure is fixable, so this adds the same FixupRequested → Coding edge that already exists for plain MergeFailure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 20:21:48 +01:00
Timmy	b6898886d7	chore(1001): retire recover_half_written_items from MCP surface The recovery tool was a one-shot migration aid for the half-written items that existed before the Stage 1 allocator fix. The three live orphans (989/1000/1001) have been migrated; the Stage 1 fix prevents new half-writes; the tool's job is done. Removes the MCP wrapper, schema, dispatch case, and tools-list assertion. The db::recover module itself stays in-process (under `#[allow(dead_code)]`) so it can be re-exposed quickly if the bug ever resurfaces — its regression tests still run as part of the default suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 19:36:02 +01:00
Timmy	92b1744c3a	feat(1001): story_ids filter for recover_half_written_items The first dry-run against the live pipeline surfaced 735 orphans (35 tombstoned half-writes, 700 stale content rows with no CRDT entry — mostly artefacts of the pre-numeric-id era). Bulk-recovering would resurrect a lot of stories the user deliberately purged in the past. Add an optional `story_ids` filter that restricts both discovery (in dry-run) and recovery to a named subset, so the operator can target the specific recent half-writes without touching anything else. The new test asserts the filter is honoured. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 19:26:07 +01:00
Timmy	cd411ba443	feat(1001): recover_half_written_items MCP tool Adds db::recover, a discovery + recovery layer for pipeline items that got half-written before the Stage 1 fix landed (content in content store + SQLite shadow, no live CRDT entry). For each orphan, the content body is re-anchored to a fresh non-tombstoned id and the old id's content row is cleared. Exposed as the recover_half_written_items MCP tool. dry_run defaults to true so the caller can review what would change before mutating. YAML front-matter parsing is hand-rolled and scoped to the three fields the create_*_file path emits (name, type, depends_on). It tolerates missing or malformed lines by falling back to safe defaults; the orphan is recovered with the best metadata we can pull from the body and the rest is left to the operator to fix up. The discovery step is read-only and idempotent. Recovery is also idempotent in the sense that once an orphan is lifted, the next discovery pass won't see it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 19:16:05 +01:00
Timmy	c61f715878	fix(1001): stop create_* from half-writing onto tombstoned IDs Root cause: db::next_item_number scanned the visible CRDT index and the content store but not the tombstone set, so it would hand out a numeric ID whose CRDT entry had been tombstoned. crdt_state::write_item then silently no-op'd the insert (tombstone-match guard) while the content store and SQLite shadow happily accepted the row, producing a split- brain half-write that was invisible to every CRDT-driven read path and couldn't be cleaned up by delete_story / purge_story. This change closes the loop: - crdt_state::read::{is_tombstoned, tombstoned_ids} expose the tombstone set so callers outside crdt_state can consult it. - db::next_item_number now scans tombstoned_ids() too. The allocator skips past tombstoned numeric IDs instead of treating their slots as free. - write_item logs a WARN when it rejects a write for a tombstoned ID (was silent). The warn is a tripwire — if the allocator ever lets one slip through again we'll see it in the log. - create_item_in_backlog adds two defence-in-depth checks: (a) before any write, reject if the allocator returned a tombstoned ID; (b) after the writes, call read_item to confirm the CRDT entry materialised. If not, roll back the content-store + shadow-DB rows via db::delete_item and return Err. Regression tests cover the allocator skip, the is_tombstoned accessor, and the create_item_in_backlog rollback path. Out of scope for this commit: - Recovery of the already-half-written items currently in the running pipeline (989, 1000, 1001) — Stage 2/3 of the plan, handled separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 19:05:48 +01:00
dave	caed894db9	huskies: merge 988	2026-05-13 17:28:52 +00:00
dave	a078d3df7c	huskies: merge 985	2026-05-13 16:52:19 +00:00
dave	580480094e	huskies: merge 984	2026-05-13 16:47:51 +00:00
dave	c3c9db3d8b	huskies: merge 987	2026-05-13 16:30:31 +00:00
dave	430079ecbc	huskies: merge 986	2026-05-13 16:01:51 +00:00
dave	91fbad568a	huskies: merge 982	2026-05-13 15:34:41 +00:00
dave	f268dca5bb	huskies: merge 977	2026-05-13 15:11:37 +00:00
dave	dcb43c465a	huskies: merge 964	2026-05-13 14:56:08 +00:00
Timmy	c811672e18	huskies: progress 983 — differentiated icons for stuck-story states Distinct icons in StagePanel/GatewayPanel/render.rs status output for blocked-with-running-recovery (robot), blocked-with-queued-recovery (hourglass), and blocked-cold (red circle). All 2822 tests pass.	2026-05-13 15:46:36 +01:00
dave	14a39b6205	huskies: merge 980	2026-05-13 14:44:17 +00:00
Timmy	246f44d8f3	fix: widen keepalive test timeout to eliminate CI flake keepalive_connection_survives_with_pong_responses set ping_ms=100, timeout_ms=250, so the server's pong-deadline fired ~560ms after the first ping — only ~60ms past the end of the test's 500ms await window. Under CI scheduler jitter that 60ms slack was insufficient and the server timer fired inside the test window, closing the connection mid-await and producing a flake. Bump timeout_ms to 2000ms so the pong-deadline cannot fire within the test window under any realistic jitter. ping_ms stays at 100ms so the test still exercises multiple ping/pong rounds in the same wall-clock budget. Test still passes locally; was hitting 964's merge gate as a flake.	2026-05-13 15:41:25 +01:00
dave	e5d2465f66	huskies: merge 974	2026-05-13 14:26:42 +00:00
dave	7854fbd78a	huskies: merge 979	2026-05-13 14:14:00 +00:00
dave	4b18c01835	huskies: merge 973	2026-05-13 14:08:05 +00:00
dave	e9a7468d8a	huskies: merge 981	2026-05-13 14:01:02 +00:00
dave	5617da5c27	huskies: merge 972	2026-05-13 13:39:20 +00:00
dave	77dc09668c	huskies: merge 960	2026-05-13 13:24:15 +00:00
dave	a47fbc4179	huskies: merge 971	2026-05-13 13:17:40 +00:00
dave	9a6963ac04	huskies: merge 963	2026-05-13 12:53:03 +00:00
dave	93f774fcbb	huskies: merge 967	2026-05-13 12:39:47 +00:00
dave	604fb55bd8	huskies: merge 959	2026-05-13 12:28:30 +00:00
dave	c89a5c2da6	huskies: merge 966	2026-05-13 12:21:43 +00:00
dave	184c214c34	huskies: merge 962	2026-05-13 12:05:01 +00:00
dave	28338a8e8d	huskies: merge 958	2026-05-13 11:52:51 +00:00
dave	8b53e20ca9	huskies: merge 961	2026-05-13 11:27:21 +00:00
dave	396a47d7c2	huskies: merge 957	2026-05-13 10:07:49 +00:00
dave	765d54fc4b	huskies: merge 954	2026-05-13 09:35:51 +00:00
dave	c228ae1640	fix: has_content_conflict_failure reads wrong CRDT key — auto-spawn mergemaster never fires The function was calling `read_content(story_id)`, which returns the story's description text (e.g. "Bug: Coder exits code 0 with uncommitted work — force a commit-only respawn..."). It then scanned that for "Merge conflict" / "CONFLICT (content):", which obviously never matched, so the auto-spawn-mergemaster-on-content-conflict guard in `pool/auto_assign/merge.rs` always saw `false` and skipped. The actual gate output (where the merge runner stores the failure message including conflict markers) lives at `format!("{story_id}:gate_output")` — that's the key `pipeline/advance/mod.rs:207` writes to. Read from there instead. Witnessed: 954's merge hit a real `CONFLICT (content)` in tests_regression.rs at 08:57:40, no mergemaster spawned, story stayed in MergeFailure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 09:03:25 +00:00
dave	6a015d6202	huskies: merge 953	2026-05-13 08:57:35 +00:00
dave	6bd11d41f9	huskies: merge 895	2026-05-13 08:52:59 +00:00
dave	4a8ed4348b	huskies: merge 950	2026-05-13 08:46:22 +00:00
dave	7491eec257	fmt: collapse warm-resume unwrap_or_else closure per rustfmt The 5-line spread of `.unwrap_or_else(\|\| { ... })` in spawn.rs (from the `bd517f28` + `65416476` warm-resume work) doesn't match rustfmt's preference for the short form. Was blocking every merge gate since the warm-resume fix landed.	2026-05-13 08:41:57 +00:00
dave	65416476e3	warm-resume: drop "read PLAN.md" from the resume nudge Follow-up to `bd517f28`. When --resume succeeds, claude-code restores the full prior conversation — the agent already has its file reads, tool results, and reasoning in context. Telling it to "read PLAN.md" forces a redundant tool call to re-read a doc it wrote itself. PLAN.md is the cold-start orientation doc (driven by AGENT.md); the resume -p prompt should just be a continuation nudge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 08:28:01 +00:00
dave	bd517f2857	fix(warm-resume): send non-empty -p prompt with --resume so watchdog respawns can actually warm claude-code's --resume <session_id> requires either: a) a deferred-tool marker in the resumed session (i.e. the prior session paused mid-tool-call), or b) a non-empty -p prompt to continue the conversation with. Watchdog-killed sessions have neither: the kill is asynchronous and leaves no deferred-tool marker, and our harness was passing an empty -p (because `resume_context_owned` is None for the common respawn case). claude-code then aborts with: "Error: No deferred tool marker found in the resumed session. Either the session was not deferred, the marker is stale (tool already ran), or it exceeds the tail-scan window. Provide a prompt to continue the conversation." The harness sees an aborted CLI with no session, prunes the recorded session_id, and respawns cold — paying the full prompt-cache miss for EVERY respawn. The new session_store logging (commit `0b50a624`) made this 100% legible: every warm spawn we observed went `mode=warm` → crash → prune → `mode=cold` within a couple of seconds. Fix: when resuming with no failure-context to send, default the -p prompt to a brief "continue from PLAN.md" line. claude-code now has a valid continuation message and warm-resume should actually work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 08:27:02 +00:00
dave	0b50a624b8	obs(session_store): log every record/lookup/remove for warm-resume diagnostics Helps explain WHY each spawn goes warm vs cold. The existing `spawn mode=warm\|cold` log only shows the outcome at the spawn point — to count where warmth is being lost, we need to see: - when a session_id is recorded (and for which key), - what every lookup returns (key + Some/None), - when remove_sessions_for_story prunes (which is currently the only explicit cold-induction path beyond "first ever spawn"). After this lands a grep of "session_store" in the logs gives the full warm-resume health picture: which (story,agent,model) triples have a recorded session, which lookups are hitting it, and which prunes are costing us a warm respawn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 08:12:42 +00:00
dave	6e76b6a063	huskies: merge 930	2026-05-13 08:06:37 +00:00
dave	a7840ea4b0	huskies: merge 946	2026-05-13 08:00:49 +00:00
dave	4a0fbcaa95	huskies: merge 949	2026-05-13 07:14:50 +00:00
dave	09a8edc0a1	huskies: merge 919	2026-05-13 06:27:10 +00:00
dave	9ce5a8df0c	huskies: merge 945	2026-05-13 06:09:34 +00:00
dave	3a8894ea8f	obs: log warm/cold spawn mode at agent respawn decision point Without this, the only way to tell whether a watchdog-respawn went warm (--resume <session_id>) vs cold (fresh CLI invocation) was to read the args list of the existing "Spawning claude with args:" log and check whether --resume was present. That made it impossible to count cold-paths or distinguish "supposed-to-be-warm but resume_failed fallback" from "first session" without source-diving. This adds one slog! per spawn, prefixed `[agent:{sid}:{name}] spawn mode=warm\|cold session_id=...`, so grep "spawn mode=" answers it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 05:44:46 +00:00

1 2 3 4 5 ...

958 Commits