huskies

Author	SHA1	Message	Date
Timmy	fe9804b32c	feat: add process_kill module + use it to fix watchdog double-spawn Adds `crate::process_kill` — reliable SIGKILL-with-verify primitives used across the server in place of the various ad-hoc kill paths that ignored their kill-effective return values. The module exposes three pieces: - `sigkill_pids_and_verify(pids)`: SIGKILL each pid and block (up to 2s) until every pid is verified gone. Returns survivors if not. - `pids_matching(pattern)`: pgrep -f wrapper. - `descendant_pids(root)`: recursive pgrep -P walker for process trees. Wires the watchdog's limit-termination path through it, and reorders the protocol to fix the duplicate-coder bug observed on story 1086 (2026-05-15): Before: check_agent_limits set status=Failed before the kill ran. The kill itself was `portable_pty::ChildKiller::kill()`, which sends SIGHUP on Unix — claude-code ignores SIGHUP, so the process kept running while the agent record was already marked terminated. The idempotency check in `start_agent` whitelists Running/Pending, so the next auto-assign pass spawned a fresh agent alongside the still-alive prior one. Two claude PIDs sharing one session_id, racing on the same worktree. After: status update is moved OUT of check_agent_limits and into the caller AFTER the kill is verified. The kill itself is now SIGKILL-the- process-tree-in-the-worktree, with explicit verification that every pid is gone. The idempotency window is closed. The existing watchdog test suite (14 tests) still passes; 7 new tests cover the process_kill primitives directly. `agents/pool/process.rs`'s `kill_all_children` and `kill_child_for_key` still use the old portable_pty SIGHUP path — they have the same bug but in lower-impact code paths (shutdown, operator stop). They will be migrated under a separate story to keep this commit focused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 10:36:33 +01:00
dave	c7a7cb4281	huskies: merge 997	2026-05-14 11:06:27 +00:00
dave	5ed1438ab9	huskies: merge 1015	2026-05-13 23:39:17 +00:00
dave	4e007bb770	huskies: merge 1009	2026-05-13 22:55:05 +00:00
dave	8b53e20ca9	huskies: merge 961	2026-05-13 11:27:21 +00:00
dave	9ce5a8df0c	huskies: merge 945	2026-05-13 06:09:34 +00:00
dave	2f50e2198b	huskies: merge 951	2026-05-13 04:34:06 +00:00
Timmy	d78dd9e8f9	feat(934): typed Stage enum replaces directory-string state model The state machine's `Stage` enum becomes the source of truth for pipeline state. Six stages of work land together: 1. Clean wire vocabulary (`coding`, `merge`, `merge_failure`, ...) replaces legacy directory-style strings (`2_current`, `4_merge`, ...) on the wire. `Stage::from_dir` accepted both during deployment; new writes always emit the clean form via `stage_dir_name`. Lexicographic `dir >= "5_done"` checks in lifecycle.rs become typed `matches!` checks since the new vocabulary doesn't sort in pipeline order. 2. `crdt_state::write_item` takes typed `&Stage`, serialising via `stage_dir_name` at the CRDT boundary. `#[cfg(test)] write_item_str` parses legacy strings for test fixtures. 3. `WorkItem::stage()` returns typed `crdt_state::Stage`; `stage_str()` is gone from the public API. Projection dispatches on the typed enum. 4. `frozen` becomes an orthogonal CRDT register. `Stage::Frozen` and `PipelineEvent::Freeze`/`Unfreeze` are removed; `transition_to_frozen`/ `unfrozen` set the flag directly without touching the stage register. 5. Watcher sweep and `tool_update_story`'s `blocked` setter route through `apply_transition` so the typed transition table validates every stage change. `update_story` gains a `frozen` field for symmetry. 6. One-shot startup migration rewrites pre-934 directory-style stage registers (and sets `frozen=true` on items previously at `7_frozen`). `Stage::from_dir` drops legacy aliases. The db boundary keeps a small normaliser so callers with legacy strings (MCP, tests) still work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 22:31:59 +01:00
dave	148ce37beb	huskies: merge 891	2026-05-12 17:09:01 +00:00
Timmy	6feb68f3e3	fix(923): watchdog counts only tool-using turns; narration-only turns no longer burn budget Observed: stories 917, 918, 920, 910 all turn-limit-killed despite producing real commits. Tally across their session logs shows 30–55% of assistant turns were pure narration ("I'll read X next", "Now let me check Y") with no tool_use. At 80 max_turns the effective work budget was ~44 tool calls, not enough for a typical bug fix's edit + test + check_criterion cycle. Changes: - New optional AgentConfig field max_tool_turns. When set the watchdog uses it instead of max_turns; only assistant messages whose data.message.content has at least one tool_use block count. - count_turns_in_log in agents/pool/auto_assign/watchdog/limits.rs filters on tool_use. Existing test helper write_fake_session_log now emits tool_use blocks; added write_fake_mixed_session_log for the narration regression test. - agents.toml: coders/coder-opus get max_turns=200 (claude-code's own --max-turns cap, sized to never bite before the watchdog) and max_tool_turns=80. qa: 120 / 40. mergemaster: 250 / 100. Budgets unchanged — the dollar cap remains the runaway-loop backstop, with ~$3-5 worst-case waste if an agent narrates indefinitely. - Two new regression tests: * watchdog_does_not_count_narration_only_turns: 5 tool + 30 narration under max_tool_turns=10 stays Running. * watchdog_max_tool_turns_overrides_max_turns: 4 tool turns at max_tool_turns=3 / max_turns=200 still terminates with TurnLimit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:25:11 +01:00
dave	9a3f60d5d3	huskies: merge 866	2026-04-29 22:47:53 +00:00
dave	2655288412	huskies: merge 870	2026-04-29 15:26:57 +00:00
dave	6c2bdde695	huskies: merge 783	2026-04-28 11:17:40 +00:00

13 Commits