Commit Graph

3 Commits

Author SHA1 Message Date
Timmy fe9804b32c feat: add process_kill module + use it to fix watchdog double-spawn
Adds `crate::process_kill` — reliable SIGKILL-with-verify primitives used
across the server in place of the various ad-hoc kill paths that ignored
their kill-effective return values. The module exposes three pieces:

  - `sigkill_pids_and_verify(pids)`: SIGKILL each pid and block (up to 2s)
    until every pid is verified gone. Returns survivors if not.
  - `pids_matching(pattern)`: pgrep -f wrapper.
  - `descendant_pids(root)`: recursive pgrep -P walker for process trees.

Wires the watchdog's limit-termination path through it, and reorders the
protocol to fix the duplicate-coder bug observed on story 1086 (2026-05-15):

  Before: check_agent_limits set status=Failed before the kill ran. The
  kill itself was `portable_pty::ChildKiller::kill()`, which sends SIGHUP
  on Unix — claude-code ignores SIGHUP, so the process kept running while
  the agent record was already marked terminated. The idempotency check
  in `start_agent` whitelists Running/Pending, so the next auto-assign
  pass spawned a fresh agent alongside the still-alive prior one. Two
  claude PIDs sharing one session_id, racing on the same worktree.

  After: status update is moved OUT of check_agent_limits and into the
  caller AFTER the kill is verified. The kill itself is now SIGKILL-the-
  process-tree-in-the-worktree, with explicit verification that every pid
  is gone. The idempotency window is closed.

The existing watchdog test suite (14 tests) still passes; 7 new tests
cover the process_kill primitives directly.

`agents/pool/process.rs`'s `kill_all_children` and `kill_child_for_key`
still use the old portable_pty SIGHUP path — they have the same bug but
in lower-impact code paths (shutdown, operator stop). They will be
migrated under a separate story to keep this commit focused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 10:36:33 +01:00
Timmy 6feb68f3e3 fix(923): watchdog counts only tool-using turns; narration-only turns no longer burn budget
Observed: stories 917, 918, 920, 910 all turn-limit-killed despite producing
real commits. Tally across their session logs shows 30–55% of assistant
turns were pure narration ("I'll read X next", "Now let me check Y") with
no tool_use. At 80 max_turns the effective work budget was ~44 tool calls,
not enough for a typical bug fix's edit + test + check_criterion cycle.

Changes:
- New optional AgentConfig field max_tool_turns. When set the watchdog
  uses it instead of max_turns; only assistant messages whose
  data.message.content has at least one tool_use block count.
- count_turns_in_log in agents/pool/auto_assign/watchdog/limits.rs
  filters on tool_use. Existing test helper write_fake_session_log now
  emits tool_use blocks; added write_fake_mixed_session_log for the
  narration regression test.
- agents.toml: coders/coder-opus get max_turns=200 (claude-code's own
  --max-turns cap, sized to never bite before the watchdog) and
  max_tool_turns=80. qa: 120 / 40. mergemaster: 250 / 100. Budgets
  unchanged — the dollar cap remains the runaway-loop backstop, with
  ~$3-5 worst-case waste if an agent narrates indefinitely.
- Two new regression tests:
  * watchdog_does_not_count_narration_only_turns: 5 tool + 30 narration
    under max_tool_turns=10 stays Running.
  * watchdog_max_tool_turns_overrides_max_turns: 4 tool turns at
    max_tool_turns=3 / max_turns=200 still terminates with TurnLimit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:25:11 +01:00
dave 6c2bdde695 huskies: merge 783 2026-04-28 11:17:40 +00:00