Commit Graph

162 Commits

Author SHA1 Message Date
dave 916dc2b11d huskies: merge 910 2026-05-12 16:02:49 +00:00
Timmy d04facd24f style: cargo fmt on pty/mod.rs (916 landed with a manually line-broken string literal)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:41:58 +01:00
Timmy 38df9c78af test(916): use far-future reset_at in inactivity-extension regression test to avoid spawn-time race
The original 90b31fc8 test computed reset_at = now + 3s in the test thread,
then relied on the script spawning fast enough that the rate_limit_event
arrived while reset_at was still meaningfully in the future. Under
cargo-test load the spawn could take long enough that block_until - now
clamped to 0 and the inactivity timeout killed the script before its sleep
finished. Pin reset_at to 2099-01-01 (matching the existing
rate_limit_hard_block_sends_watcher_hard_block_event test) so the
extension is essentially infinite and the assertion isolates the
extension-vs-no-extension behavior from wall-clock slack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:36:24 +01:00
dave a34c9796b5 huskies: merge 913 2026-05-12 15:30:23 +00:00
Timmy 90b31fc84f fix(916): rate-limit hard block extends inactivity deadline so the watchdog doesn't kill mid-wait
When claude-code emits a rate_limit_event with status != allowed_warning,
the subprocess waits internally for the limit to clear before retrying. No
PTY output flows during that window, so the inactivity timeout in the PTY
runner would fire and kill the agent — mergemaster especially, whose
15-minute inactivity window is shorter than typical rate-limit backoffs.

Track `block_until = Some(reset_at)` on hard-block events and add the
remaining time-until-reset to the per-iteration recv timeout. Once reset_at
passes (or an earlier emit arrives), the extension implicitly drops to 0
and the base inactivity timeout resumes. Turn/budget counts aren't affected
— they come from the session log and only advance when API calls actually
complete, so a stalled retry doesn't burn either.

Regression test in agents/pty/mod.rs spawns a script that emits a hard-block
with reset_at = now+3s, sleeps 3s, then exits, with inactivity_timeout_secs
= 1. Without the fix the runner kills the script at 1s; with the fix the
deadline is bumped past the sleep and the run completes cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:22:21 +01:00
dave 2c5326f339 huskies: merge 890 2026-05-12 14:48:52 +00:00
Timmy 98d496b1ad fix(901): unblock_story works on CRDT-only stories post-865
Bug 901: `unblock_story` (and the chat `unblock` command) routed through
`parse_front_matter` and errored with "Missing front matter" on any
post-865 story (story content is now CRDT-only with no YAML on disk).

In `chat/commands/unblock.rs::unblock_by_story_id`:
  - Drop the early `parse_front_matter` gate.
  - Read story name and blocked state from the CRDT register API instead
    of parsed YAML (`crdt_state::read_item`, `pipeline_state::read_typed`).
  - Keep the legacy fallback cleanup, but gate it on the content actually
    starting with a `---` YAML block, so CRDT-only stories don't hit a
    parse error there either.
  - Remove the now-unused `parse_front_matter` import.

Surfaced a second sub-bug: even when the state-machine transition
fired (`Blocked + Unblock → Coding`), the CRDT `blocked` register was
never explicitly cleared. Pre-865 the YAML-strip content_transform
cleared it as a side effect; post-865 there is no YAML to strip.

  - Add `crdt_state::set_blocked(story_id, bool)` parallel to
    `set_retry_count`. Wired through `crdt_state::write` and the
    crate-level re-export.
  - `agents::lifecycle::transition_to_unblocked` now calls
    `set_blocked(story_id, false)` alongside `set_retry_count(0)` so
    the legacy register stays in sync with the typed stage.

Test: `unblock_command_works_on_crdt_only_story_no_yaml` seeds a CRDT
entry with no YAML on disk, runs unblock, asserts success + cleared
blocked + retry_count=0. All 10 existing unblock tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:13:01 +01:00
dave 9be438e6d3 huskies: merge 865 2026-05-08 14:29:06 +00:00
dave 61cf7684de huskies: merge 864 2026-04-30 22:27:51 +00:00
dave 3911c24c26 test: drop opus-pin regression test that conflicts with 864's signature change
864 changes write_item_with_content to take 4 args (ItemMeta), but the
master regression test calls the 3-arg form. After 864 squash-merges,
the merged code has the 4-arg fn AND the 3-arg call site, breaking
compile in the merge worktree.

Drop the test for now (the actual run on 864 today validated the fix
end-to-end). Re-add it in a follow-up after 864 lands, using the new
signature.
2026-04-30 22:23:16 +00:00
dave 1251b869a6 style: cargo fmt on today's new code (883/884/886/opus-pin)
The mergemaster gates run rustfmt and rejected 864's merge because
several files I added/touched in master today had not been fmt'd.
Six files affected, mostly trivial line-wrapping nits. Fixes the
formatting gate for the next 864 merge attempt.
2026-04-30 22:15:37 +00:00
dave 66f340a7a3 fix: prune session_store on stdio abort, respawn cold
The bug 882 abort-respawn safeguard caps consecutive crashes at 5 then
blocks the story — but the underlying stdio abort itself stays unfixed:
each respawn calls start_agent which reads session_store.json, finds the
prior session id, passes --resume to claude-code, and re-triggers the
same crash. Five identical respawns later, the story is blocked.

Now: when an abort+no-session exit triggers respawn, we first call
session_store::remove_sessions_for_story to drop every entry for the
story. The next spawn starts cold (no --resume), which avoids the
bloated stdio replay claude-code is choking on.

The function was already implemented but #[cfg(test)] only — promoted
to a non-test pub fn. Existing remove_sessions_for_story_cleans_up test
unchanged and still green.

Net effect: instead of "5 retries, then blocked", we get "1 abort, prune,
respawn cold, agent runs normally". The story can resume work without
losing its worktree state.
2026-04-30 18:19:01 +00:00
dave a8eac3c278 fix: read agent pin from CRDT register, not just YAML front matter
After story 871 the `agent` pin lives in the typed CRDT register
(`PipelineItemView.agent`), not the YAML front matter — the YAML
mutation was removed at the same time. Both spawn-resolution paths
(`auto_assign::story_checks::read_story_front_matter_agent` and
`start::validation::read_front_matter_agent`) still read only YAML
via parse_front_matter, which returns None for any story whose pin
was set via the post-871 typed setter. The spawn then falls back to
"first available coder," silently downgrading opus-pinned stories to
the first available sonnet — which is why 855/864/866 kept hitting the
80-turn watchdog limit despite the user's explicit opus pin.

Now: both paths consult `crdt_state::read_item()` first and use
`view.agent` if non-empty. YAML parsing remains as a fallback so older
stories whose CRDT entry doesn't yet have the field still resolve.

Adds a regression test that seeds an item with empty YAML, sets the
typed CRDT register via `set_agent`, and asserts
`read_story_front_matter_agent` returns the CRDT value.
2026-04-30 16:36:18 +00:00
dave b0de86767a huskies: merge 882 2026-04-30 00:35:35 +00:00
dave 1d86202abb huskies: merge 868 2026-04-29 23:34:24 +00:00
dave e02e566648 huskies: merge 881_bug_inject_prior_gate_failure_output_into_retry_agent_s_system_prompt 2026-04-29 22:52:55 +00:00
dave 9a3f60d5d3 huskies: merge 866 2026-04-29 22:47:53 +00:00
dave a49f668b5a huskies: merge 867 2026-04-29 22:17:08 +00:00
dave 7e2f122d36 huskies: merge 880 2026-04-29 21:46:12 +00:00
dave 4d24b5b661 huskies: merge 855 2026-04-29 21:41:03 +00:00
dave a7b1572693 huskies: merge 856 2026-04-29 21:34:58 +00:00
dave fc86774618 huskies: merge 857 2026-04-29 17:45:51 +00:00
dave 8a7e1aa036 huskies: merge 873 2026-04-29 16:11:34 +00:00
dave 2655288412 huskies: merge 870 2026-04-29 15:26:57 +00:00
dave f3e4d5d072 huskies: merge 869 2026-04-29 14:58:11 +00:00
dave edeed3d1b6 huskies: merge 861 2026-04-29 11:12:20 +00:00
dave 19a2ffde96 huskies: merge 860 2026-04-29 10:53:39 +00:00
dave 11d111360d huskies: merge 858 2026-04-29 10:47:18 +00:00
dave 0403dc9871 huskies: merge 833 2026-04-29 09:55:09 +00:00
dave 4ed1fb5110 huskies: merge 854 2026-04-29 09:29:32 +00:00
dave dcd695ad0e huskies: merge 852 2026-04-29 08:55:49 +00:00
dave 549a9defc4 huskies: merge 851 2026-04-29 08:42:28 +00:00
dave 89bf4ae0cf huskies: merge 831 2026-04-29 00:16:18 +00:00
dave 6092f7efbb huskies: merge 822 2026-04-28 23:12:25 +00:00
dave 2a77f73ba4 fix(merge): use server-start-time, not pid, for stale-merge detection
The merge_jobs cleanup encoded the server's pid in the CRDT and checked
`kill(pid, 0)` to decide whether a "running" entry was stale. Two problems:

  1. The cleanup runs *inside* the server, so checking whether the
     server's own pid is alive is tautological — kill(self_pid, 0)
     always succeeds.
  2. `rebuild_and_restart` does an `execve()` re-exec, which keeps the
     same pid. After re-exec, merge_jobs from the previous server
     instance still encode "the current pid" — so the cleanup never
     fires, and stories like 799/800 sit forever with status="running"
     while no actual merge runs.

Switch to a per-process server-start-time captured lazily in a
`OnceLock<f64>` (reset by execve, so the new instance sees a fresh
boot-time). A merge_job's recorded start-time < current boot-time means
it came from a previous instance: stale, delete it.

Legacy pid-encoded entries decode to None and are also treated as stale.

MergeJob.pid → MergeJob.server_start_time. Tests updated.
2026-04-28 20:41:32 +00:00
dave f5ab75ecaa huskies: merge 819 2026-04-28 20:28:35 +00:00
dave b060d8fc88 fix(pty): always pass -p on resume so --include-partial-messages works
claude CLI 2.1.97 strictly enforces that --include-partial-messages
requires --print/-p to be set. The resume path skipped -p when the
prompt was empty (which is the common case on respawns when there's
no fresh failure context to inject), so the spawned claude process
saw `--resume <sid> ... --include-partial-messages` without -p and
exited with code 1: "include-partial-messages requires --print and
--output-format=stream-json".

Net effect: every coder respawn with prior_sessions > 0 and empty
prompt was failing immediately, looking exactly like a rate-limit
(empty agent log, zero tool calls). 819 hit retry-limit (4/3) and
got marked blocked because of this — not because of any actual code
or rate-limit issue.

Fix: always pass `-p <prompt>` on resume, even with empty prompt.
2026-04-28 20:14:32 +00:00
dave e4af2d5c08 huskies: merge 803 2026-04-28 19:10:41 +00:00
dave 619bdd9c82 huskies: merge 801 2026-04-28 16:43:04 +00:00
dave f62012ee9c huskies: merge 793 2026-04-28 15:21:51 +00:00
dave 7cd9706c0f huskies: merge 813 2026-04-28 14:22:19 +00:00
dave 8f23d13ac8 huskies: merge 779 2026-04-28 13:48:40 +00:00
dave 36ca8d5e3b huskies: merge 827 2026-04-28 13:01:48 +00:00
dave 6c2bdde695 huskies: merge 783 2026-04-28 11:17:40 +00:00
dave 7faacb6664 huskies: merge 773 2026-04-28 10:24:04 +00:00
dave 70aaffc2ab huskies: merge 777 2026-04-28 00:33:14 +00:00
dave 63ce7b9ec3 huskies: merge 759 2026-04-28 00:07:04 +00:00
dave 7ee542dd1e huskies: merge 757 2026-04-27 23:36:56 +00:00
dave 1388658ae8 huskies: merge 730_story_use_numeric_only_story_ids_across_mcp_worktrees_git_branches_and_log_paths 2026-04-27 20:22:47 +00:00
dave 615e1c7f73 huskies: merge 738_refactor_delete_fs_shadow_code_from_lifecycle_rs_and_the_work_directory_watcher 2026-04-27 19:56:53 +00:00