Commit Graph

3697 Commits

Author SHA1 Message Date
Timmy fe9804b32c feat: add process_kill module + use it to fix watchdog double-spawn
Adds `crate::process_kill` — reliable SIGKILL-with-verify primitives used
across the server in place of the various ad-hoc kill paths that ignored
their kill-effective return values. The module exposes three pieces:

  - `sigkill_pids_and_verify(pids)`: SIGKILL each pid and block (up to 2s)
    until every pid is verified gone. Returns survivors if not.
  - `pids_matching(pattern)`: pgrep -f wrapper.
  - `descendant_pids(root)`: recursive pgrep -P walker for process trees.

Wires the watchdog's limit-termination path through it, and reorders the
protocol to fix the duplicate-coder bug observed on story 1086 (2026-05-15):

  Before: check_agent_limits set status=Failed before the kill ran. The
  kill itself was `portable_pty::ChildKiller::kill()`, which sends SIGHUP
  on Unix — claude-code ignores SIGHUP, so the process kept running while
  the agent record was already marked terminated. The idempotency check
  in `start_agent` whitelists Running/Pending, so the next auto-assign
  pass spawned a fresh agent alongside the still-alive prior one. Two
  claude PIDs sharing one session_id, racing on the same worktree.

  After: status update is moved OUT of check_agent_limits and into the
  caller AFTER the kill is verified. The kill itself is now SIGKILL-the-
  process-tree-in-the-worktree, with explicit verification that every pid
  is gone. The idempotency window is closed.

The existing watchdog test suite (14 tests) still passes; 7 new tests
cover the process_kill primitives directly.

`agents/pool/process.rs`'s `kill_all_children` and `kill_child_for_key`
still use the old portable_pty SIGHUP path — they have the same bug but
in lower-impact code paths (shutdown, operator stop). They will be
migrated under a separate story to keep this commit focused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 10:36:33 +01:00
Timmy 8446ab1c71 chore: gitignore .huskies/double_timmy_log.md
Local-only scratchpad for tracking suspected duplicate-Timmy /
duplicate-create_story incidents while we hunt the cause.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 10:06:37 +01:00
dave b5054b08d3 huskies: regen source-map.json 2026-05-15 08:47:38 +00:00
dave df32a1542b huskies: merge 1087 story Pipeline+Status split — Step D: migrate CRDT storage to (Pipeline, Status) and remove the Stage enum 2026-05-15 08:47:38 +00:00
dave e82602db77 huskies: merge 1086 story Pipeline+Status split — Step C: migrate auto-assign, subscribers, and lifecycle transitions to read Pipeline + Status 2026-05-15 08:26:39 +00:00
Timmy 2d6105c778 fix: skip setup commands on worktree reuse so reconciler doesn't fire npm ci every 30s
Story 1066 (merged 2026-05-14 23:39) introduced a periodic reconciler that
calls `reconcile_worktree_create` every 30 seconds (default
`reconcile_interval_secs`). The reconciler's docstring promises it is a no-op
for stories whose worktree already exists — but the implementation calls
`create_worktree`, whose reuse path was running `run_setup_commands`
unconditionally. Setup includes destructive `npm ci` (rm -rf node_modules
then reinstall), so every Coding story got `npm ci` fired every 30 seconds.

When story 1086 hit a gate-failure retry loop on 2026-05-15, the merge gate's
own `npm install`/`npm run build` raced one of these reconciler-driven
`npm ci` runs that was wiping node_modules — leaving `.bin/tsc` as a broken
symlink pointing into a half-populated `typescript/` package and producing
`sh: 1: tsc: not found`. 37 npm ci fires for 1086 in 5 hours against only
3 real Coding transitions, a 12x amplification driven entirely by the
30-second reconcile cadence.

Fix: align `create_worktree`'s behaviour with the contract `reconcile_worktree_create`
already documents — reuse is a no-op for setup commands. Sparse checkout
and `.mcp.json` rewrite still run (both cheap and idempotent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 08:57:38 +01:00
Timmy d89940e85b fix: drop source-map.json from agent orientation bundle
The orientation bundle was 96 KB per coder spawn with 85 KB of that being
source-map.json — a static symbol listing that drowned out the workflow rules
in AGENT.md and likely explains why PLAN.md ceremony is being skipped (the
instruction is ~5% of the bundle, buried under a wall of symbols). Agents are
excellent at grep on demand, so the source map adds little value as a preloaded
cheat sheet. File stays on disk for the merge-time source-map-check doc-coverage
gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:48:18 +01:00
dave 60fceee204 huskies: regen source-map.json 2026-05-15 02:03:30 +00:00
dave 13f7dab5f0 huskies: merge 1088 2026-05-15 02:03:30 +00:00
dave f7413cc711 huskies: regen source-map.json 2026-05-15 01:38:05 +00:00
dave b053f14d58 huskies: merge 1085 2026-05-15 01:38:05 +00:00
dave 56179d712e huskies: merge 1078 2026-05-15 01:32:29 +00:00
dave a06bf6778b huskies: regen source-map.json 2026-05-15 01:27:25 +00:00
dave 1506141155 huskies: merge 1072 2026-05-15 01:27:25 +00:00
dave ae69cd50b1 huskies: regen source-map.json 2026-05-15 00:58:57 +00:00
dave 0c23d209a0 huskies: merge 1077 2026-05-15 00:58:57 +00:00
dave eac5763e03 huskies: merge 1075 2026-05-15 00:48:06 +00:00
dave 6530eeab6d huskies: merge 811 2026-05-15 00:42:14 +00:00
dave 5eb8f2f8a7 huskies: regen source-map.json 2026-05-15 00:37:01 +00:00
dave f9b140add9 huskies: merge 1073 2026-05-15 00:37:01 +00:00
dave d4db96f709 huskies: merge 1070 2026-05-15 00:20:29 +00:00
dave 5f08573db8 huskies: merge 1076 2026-05-15 00:10:15 +00:00
dave da83fcb78d huskies: merge 1074 2026-05-15 00:01:58 +00:00
dave f04bdd1f14 huskies: regen source-map.json 2026-05-14 23:45:53 +00:00
dave bb6a6063e8 huskies: merge 1066 2026-05-14 23:45:53 +00:00
dave bf813d910b huskies: regen source-map.json 2026-05-14 23:29:32 +00:00
dave 374aa77f27 huskies: merge 1069 2026-05-14 23:29:32 +00:00
Timmy bbc4c9aa45 Bump version to 0.11.0 v0.11.0 2026-05-14 23:31:15 +01:00
Timmy 556d335997 chore: refresh source-map.json before 0.11 release
Catches up master with entries added by stories that merged in a binary
predating 1065 (merge-pipeline source-map regen): ErrorBoundary,
WsConnectivity, transition_merge_failure_to_retry, and others.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:28:47 +01:00
dave c66016394b huskies: merge 1063 2026-05-14 21:53:56 +00:00
dave 23c3301903 huskies: merge 1065 2026-05-14 21:48:09 +00:00
Timmy e6865a1bc6 fix: stop event-triggers Lagged handler from re-emitting via the same channel
Merge 1061 added a replay_current_pipeline_state() call to the broadcast::Lagged
branch, but replay broadcasts one event per CRDT item (~997) into a 256-slot
channel, deterministically re-overflowing it and triggering another Lagged. The
loop pinned CPU and likely caused today's machine crash. Revert to the pre-1061
behaviour of logging and continuing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 22:33:14 +01:00
dave 8f666bd6b3 huskies: merge 1062 2026-05-14 20:36:51 +00:00
dave 5678f2a556 huskies: merge 1061 2026-05-14 20:12:51 +00:00
dave 54d9737428 huskies: merge 1060 2026-05-14 19:31:04 +00:00
Timmy 667601012c fix: populate story_name in event buffer via CRDT lookup
`subscribe_to_watcher` was pushing StoredEvents into the event
buffer with story_name hardcoded to String::new(), so /api/events
polled by the gateway always omitted the title. The 1035 fix
patched the other path (gateway_relay status_to_stored) but left
this one bleeding empty strings.

Lookup happens once at the subscriber boundary rather than at all
44 watcher emit sites — the story_id is already in hand and
crdt_state::read_item is the canonical name source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 20:24:27 +01:00
dave b64eb69aee huskies: merge 1056 2026-05-14 19:14:03 +00:00
dave 6d53382f8c huskies: merge 1059 2026-05-14 19:09:17 +00:00
dave 7d7e02f7b0 huskies: merge 1058 2026-05-14 19:04:24 +00:00
dave 595777f366 huskies: merge 1054 2026-05-14 18:53:07 +00:00
dave 96e227d8d4 huskies: merge 1053 2026-05-14 18:40:37 +00:00
dave bb5abcd042 huskies: merge 811 2026-05-14 18:32:37 +00:00
Timmy 03a0ca258a docs: explain why libsqlite3-sys is pinned to 0.35 in server/Cargo.toml
The dep is declared only to flip on the `bundled` feature for the
static musl build, and 0.35 is the ceiling forced by rusqlite 0.37
(matrix-sdk-sqlite) and sqlx-sqlite 0.9.0-alpha.1. Future readers
no longer have to reconstruct that from cargo-tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 19:27:39 +01:00
dave b9709a6466 huskies: merge 1052 2026-05-14 18:11:57 +00:00
dave 977b954e98 huskies: merge 1051 2026-05-14 18:04:30 +00:00
dave 8f99fede34 huskies: merge 1050 2026-05-14 17:32:14 +00:00
dave 0d3c5579da huskies: merge 1047 2026-05-14 17:17:41 +00:00
dave 1f9f34ab58 huskies: merge 1038 2026-05-14 17:06:50 +00:00
dave 4553df5b21 huskies: merge 1045 2026-05-14 16:53:45 +00:00
dave 311883f45d huskies: merge 1039 2026-05-14 16:33:47 +00:00