feat: add process_kill module + use it to fix watchdog double-spawn

Adds `crate::process_kill` — reliable SIGKILL-with-verify primitives used
across the server in place of the various ad-hoc kill paths that ignored
their kill-effective return values. The module exposes three pieces:

  - `sigkill_pids_and_verify(pids)`: SIGKILL each pid and block (up to 2s)
    until every pid is verified gone. Returns survivors if not.
  - `pids_matching(pattern)`: pgrep -f wrapper.
  - `descendant_pids(root)`: recursive pgrep -P walker for process trees.

Wires the watchdog's limit-termination path through it, and reorders the
protocol to fix the duplicate-coder bug observed on story 1086 (2026-05-15):

  Before: check_agent_limits set status=Failed before the kill ran. The
  kill itself was `portable_pty::ChildKiller::kill()`, which sends SIGHUP
  on Unix — claude-code ignores SIGHUP, so the process kept running while
  the agent record was already marked terminated. The idempotency check
  in `start_agent` whitelists Running/Pending, so the next auto-assign
  pass spawned a fresh agent alongside the still-alive prior one. Two
  claude PIDs sharing one session_id, racing on the same worktree.

  After: status update is moved OUT of check_agent_limits and into the
  caller AFTER the kill is verified. The kill itself is now SIGKILL-the-
  process-tree-in-the-worktree, with explicit verification that every pid
  is gone. The idempotency window is closed.

The existing watchdog test suite (14 tests) still passes; 7 new tests
cover the process_kill primitives directly.

`agents/pool/process.rs`'s `kill_all_children` and `kill_child_for_key`
still use the old portable_pty SIGHUP path — they have the same bug but
in lower-impact code paths (shutdown, operator stop). They will be
migrated under a separate story to keep this commit focused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Timmy
2026-05-15 10:36:33 +01:00
parent 8446ab1c71
commit fe9804b32c
5 changed files with 403 additions and 12 deletions
@@ -187,13 +187,14 @@ pub(super) fn check_agent_limits(
),
};
// Mark agent as Failed with termination reason.
if let Ok(mut lock) = agents.lock()
&& let Some(agent) = lock.get_mut(key)
{
agent.status = AgentStatus::Failed;
agent.termination_reason = Some(reason.clone());
}
// NOTE: agent status is intentionally NOT updated here. Setting
// `status = Failed` before the kill (the previous behaviour)
// opened a window where the `start_agent` idempotency check
// (which whitelists Running/Pending) would let a fresh spawn
// through while the prior PTY child was still alive — directly
// causing the concurrent-agents bug we hit on story 1086
// (2026-05-15). The caller (`run_watchdog_pass`) is responsible
// for: (1) verifying the kill, (2) THEN updating the agent record.
slog!("[watchdog] Terminating agent '{key}': {reason_str}.");