Adds `crate::process_kill` — reliable SIGKILL-with-verify primitives used
across the server in place of the various ad-hoc kill paths that ignored
their kill-effective return values. The module exposes three pieces:
- `sigkill_pids_and_verify(pids)`: SIGKILL each pid and block (up to 2s)
until every pid is verified gone. Returns survivors if not.
- `pids_matching(pattern)`: pgrep -f wrapper.
- `descendant_pids(root)`: recursive pgrep -P walker for process trees.
Wires the watchdog's limit-termination path through it, and reorders the
protocol to fix the duplicate-coder bug observed on story 1086 (2026-05-15):
Before: check_agent_limits set status=Failed before the kill ran. The
kill itself was `portable_pty::ChildKiller::kill()`, which sends SIGHUP
on Unix — claude-code ignores SIGHUP, so the process kept running while
the agent record was already marked terminated. The idempotency check
in `start_agent` whitelists Running/Pending, so the next auto-assign
pass spawned a fresh agent alongside the still-alive prior one. Two
claude PIDs sharing one session_id, racing on the same worktree.
After: status update is moved OUT of check_agent_limits and into the
caller AFTER the kill is verified. The kill itself is now SIGKILL-the-
process-tree-in-the-worktree, with explicit verification that every pid
is gone. The idempotency window is closed.
The existing watchdog test suite (14 tests) still passes; 7 new tests
cover the process_kill primitives directly.
`agents/pool/process.rs`'s `kill_all_children` and `kill_child_for_key`
still use the old portable_pty SIGHUP path — they have the same bug but
in lower-impact code paths (shutdown, operator stop). They will be
migrated under a separate story to keep this commit focused.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 13-file refactor pass (commits db00a5d4 through eca15b4e) introduced
~89 clippy errors and 38 cargo fmt issues — every agent in every worktree
hit them on script/test, burning their turn budget on cleanup before doing
real story work. This is the silent kill behind 644, 652, 655, 664, 667
all hitting watchdog limits this round.
Changes:
- cargo fmt --all across 37 files (formatting normalisation only)
- #![allow(unused_imports, dead_code)] on 24 split modules where the
python-script splitter imported liberally to be safe; tighter cleanup
per-import will happen as agents touch each module
- Removed truly-dead re-exports (cleanup_merge_workspace, slog_warn from
http/mcp/mod.rs, CliArgs/print_help from main.rs)
- Prefixed _auth_msg in crdt_sync/server.rs (handshake helper return is
bound but not consumed)
- Converted dangling /// doc block in crdt_sync/mod.rs to //! so it
attaches to the module
- Removed empty lines after doc comments in 4 spots (clippy lint)
All 2636 tests pass; clippy --all-targets -- -D warnings clean.
The biggest miss is #[tokio::main] — without it, async fn main() doesn't compile,
and the binary in every worktree fails 'cargo check'. Agents in those worktrees
burn their turn budgets trying to fix the build before they can do real work, then
get killed by the watchdog. That's why all three in-flight stories failed.
Other restored attributes:
- #[cfg(unix)] on 4 tests in merge/squash and scaffold (skip on non-Unix)
- #[allow(dead_code)] on AuthListenerResult test enum
- #[allow(clippy::too_many_arguments)] on run_pty_session
Same root cause as the earlier #[test] attribute losses: my line ranges started
at the fn line, missing the leading attribute on the previous line.
Lost in commit db00a5d4 when extracting tests from main.rs into cli.rs;
the line range used for the panics_on_duplicate_agent_names test in main.rs
started at the fn signature instead of the attribute line.
The 1258-line main.rs is split into:
- main.rs: mod declarations, async fn main + panics_on_duplicate_agent_names test (894 lines)
- cli.rs: CliArgs struct, parse_cli_args, print_help, resolve_path_arg + their tests (372 lines)
main.rs cannot itself become a directory (binary crate must have main.rs at the
crate root); cli.rs is a sibling module.
No behaviour change. All cli tests pass; full suite green.
In gateway mode the bot has no local CRDT or project filesystem, so all
bot commands (status, backlog, start, assign, etc.) returned empty or
broken results. Now the gateway bot proxies non-local commands via HTTP
to the active project's /api/bot/command endpoint, which already exists
on every project server.
Only a small set of gateway-local commands (help, ambient, reset, switch)
are still handled directly by the gateway. Everything else is forwarded
automatically, so new commands added in the future will work through the
proxy without additional gateway changes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>