Commit Graph

1022 Commits

Author SHA1 Message Date
Timmy b7df5cbe4e fix(agents): kill-then-status reorder in stop_agent
stop_agent had the same order-of-operations bug fixed in the watchdog:
status flipped to Failed before the claude process was verified gone,
opening the idempotency window that allowed a duplicate spawn to race
in alongside the surviving process.

Now follows the three-step protocol:
1. Read worktree path under a read-only lock (no mutation).
2. SIGKILL the worktree's process tree via process_kill and block
   until verified gone — start_agent's Running/Pending whitelist
   continues to reject duplicate spawns throughout.
3. Only then mutate the agent record, abort the task handle, and
   drop the child_killers entry.

Falls back to the old portable_pty SIGHUP path (with a warning) when
no worktree was recorded, matching the watchdog's behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 10:46:02 +01:00
Timmy fe9804b32c feat: add process_kill module + use it to fix watchdog double-spawn
Adds `crate::process_kill` — reliable SIGKILL-with-verify primitives used
across the server in place of the various ad-hoc kill paths that ignored
their kill-effective return values. The module exposes three pieces:

  - `sigkill_pids_and_verify(pids)`: SIGKILL each pid and block (up to 2s)
    until every pid is verified gone. Returns survivors if not.
  - `pids_matching(pattern)`: pgrep -f wrapper.
  - `descendant_pids(root)`: recursive pgrep -P walker for process trees.

Wires the watchdog's limit-termination path through it, and reorders the
protocol to fix the duplicate-coder bug observed on story 1086 (2026-05-15):

  Before: check_agent_limits set status=Failed before the kill ran. The
  kill itself was `portable_pty::ChildKiller::kill()`, which sends SIGHUP
  on Unix — claude-code ignores SIGHUP, so the process kept running while
  the agent record was already marked terminated. The idempotency check
  in `start_agent` whitelists Running/Pending, so the next auto-assign
  pass spawned a fresh agent alongside the still-alive prior one. Two
  claude PIDs sharing one session_id, racing on the same worktree.

  After: status update is moved OUT of check_agent_limits and into the
  caller AFTER the kill is verified. The kill itself is now SIGKILL-the-
  process-tree-in-the-worktree, with explicit verification that every pid
  is gone. The idempotency window is closed.

The existing watchdog test suite (14 tests) still passes; 7 new tests
cover the process_kill primitives directly.

`agents/pool/process.rs`'s `kill_all_children` and `kill_child_for_key`
still use the old portable_pty SIGHUP path — they have the same bug but
in lower-impact code paths (shutdown, operator stop). They will be
migrated under a separate story to keep this commit focused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 10:36:33 +01:00
dave df32a1542b huskies: merge 1087 story Pipeline+Status split — Step D: migrate CRDT storage to (Pipeline, Status) and remove the Stage enum 2026-05-15 08:47:38 +00:00
dave e82602db77 huskies: merge 1086 story Pipeline+Status split — Step C: migrate auto-assign, subscribers, and lifecycle transitions to read Pipeline + Status 2026-05-15 08:26:39 +00:00
Timmy 2d6105c778 fix: skip setup commands on worktree reuse so reconciler doesn't fire npm ci every 30s
Story 1066 (merged 2026-05-14 23:39) introduced a periodic reconciler that
calls `reconcile_worktree_create` every 30 seconds (default
`reconcile_interval_secs`). The reconciler's docstring promises it is a no-op
for stories whose worktree already exists — but the implementation calls
`create_worktree`, whose reuse path was running `run_setup_commands`
unconditionally. Setup includes destructive `npm ci` (rm -rf node_modules
then reinstall), so every Coding story got `npm ci` fired every 30 seconds.

When story 1086 hit a gate-failure retry loop on 2026-05-15, the merge gate's
own `npm install`/`npm run build` raced one of these reconciler-driven
`npm ci` runs that was wiping node_modules — leaving `.bin/tsc` as a broken
symlink pointing into a half-populated `typescript/` package and producing
`sh: 1: tsc: not found`. 37 npm ci fires for 1086 in 5 hours against only
3 real Coding transitions, a 12x amplification driven entirely by the
30-second reconcile cadence.

Fix: align `create_worktree`'s behaviour with the contract `reconcile_worktree_create`
already documents — reuse is a no-op for setup commands. Sparse checkout
and `.mcp.json` rewrite still run (both cheap and idempotent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 08:57:38 +01:00
Timmy d89940e85b fix: drop source-map.json from agent orientation bundle
The orientation bundle was 96 KB per coder spawn with 85 KB of that being
source-map.json — a static symbol listing that drowned out the workflow rules
in AGENT.md and likely explains why PLAN.md ceremony is being skipped (the
instruction is ~5% of the bundle, buried under a wall of symbols). Agents are
excellent at grep on demand, so the source map adds little value as a preloaded
cheat sheet. File stays on disk for the merge-time source-map-check doc-coverage
gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:48:18 +01:00
dave 13f7dab5f0 huskies: merge 1088 2026-05-15 02:03:30 +00:00
dave b053f14d58 huskies: merge 1085 2026-05-15 01:38:05 +00:00
dave 56179d712e huskies: merge 1078 2026-05-15 01:32:29 +00:00
dave 1506141155 huskies: merge 1072 2026-05-15 01:27:25 +00:00
dave 0c23d209a0 huskies: merge 1077 2026-05-15 00:58:57 +00:00
dave eac5763e03 huskies: merge 1075 2026-05-15 00:48:06 +00:00
dave f9b140add9 huskies: merge 1073 2026-05-15 00:37:01 +00:00
dave d4db96f709 huskies: merge 1070 2026-05-15 00:20:29 +00:00
dave 5f08573db8 huskies: merge 1076 2026-05-15 00:10:15 +00:00
dave da83fcb78d huskies: merge 1074 2026-05-15 00:01:58 +00:00
dave bb6a6063e8 huskies: merge 1066 2026-05-14 23:45:53 +00:00
dave 374aa77f27 huskies: merge 1069 2026-05-14 23:29:32 +00:00
Timmy bbc4c9aa45 Bump version to 0.11.0 2026-05-14 23:31:15 +01:00
dave c66016394b huskies: merge 1063 2026-05-14 21:53:56 +00:00
dave 23c3301903 huskies: merge 1065 2026-05-14 21:48:09 +00:00
Timmy e6865a1bc6 fix: stop event-triggers Lagged handler from re-emitting via the same channel
Merge 1061 added a replay_current_pipeline_state() call to the broadcast::Lagged
branch, but replay broadcasts one event per CRDT item (~997) into a 256-slot
channel, deterministically re-overflowing it and triggering another Lagged. The
loop pinned CPU and likely caused today's machine crash. Revert to the pre-1061
behaviour of logging and continuing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 22:33:14 +01:00
dave 8f666bd6b3 huskies: merge 1062 2026-05-14 20:36:51 +00:00
dave 5678f2a556 huskies: merge 1061 2026-05-14 20:12:51 +00:00
dave 54d9737428 huskies: merge 1060 2026-05-14 19:31:04 +00:00
Timmy 667601012c fix: populate story_name in event buffer via CRDT lookup
`subscribe_to_watcher` was pushing StoredEvents into the event
buffer with story_name hardcoded to String::new(), so /api/events
polled by the gateway always omitted the title. The 1035 fix
patched the other path (gateway_relay status_to_stored) but left
this one bleeding empty strings.

Lookup happens once at the subscriber boundary rather than at all
44 watcher emit sites — the story_id is already in hand and
crdt_state::read_item is the canonical name source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 20:24:27 +01:00
dave 595777f366 huskies: merge 1054 2026-05-14 18:53:07 +00:00
dave 96e227d8d4 huskies: merge 1053 2026-05-14 18:40:37 +00:00
Timmy 03a0ca258a docs: explain why libsqlite3-sys is pinned to 0.35 in server/Cargo.toml
The dep is declared only to flip on the `bundled` feature for the
static musl build, and 0.35 is the ceiling forced by rusqlite 0.37
(matrix-sdk-sqlite) and sqlx-sqlite 0.9.0-alpha.1. Future readers
no longer have to reconstruct that from cargo-tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 19:27:39 +01:00
dave b9709a6466 huskies: merge 1052 2026-05-14 18:11:57 +00:00
dave 977b954e98 huskies: merge 1051 2026-05-14 18:04:30 +00:00
dave 8f99fede34 huskies: merge 1050 2026-05-14 17:32:14 +00:00
dave 1f9f34ab58 huskies: merge 1038 2026-05-14 17:06:50 +00:00
dave 311883f45d huskies: merge 1039 2026-05-14 16:33:47 +00:00
dave 9e06fff8a8 huskies: merge 1046 2026-05-14 16:20:07 +00:00
Timmy 822fcdaf2b chore: cargo fmt after Rust 1.93 toolchain bump 2026-05-14 16:33:35 +01:00
dave ee20e54d40 huskies: merge 1036 2026-05-14 15:13:25 +00:00
dave cfccc2e73c huskies: merge 1044 2026-05-14 14:54:13 +00:00
dave 960b4f4d1d huskies: merge 1032 2026-05-14 14:47:49 +00:00
dave bc99821274 huskies: merge 1031 2026-05-14 14:36:16 +00:00
dave 3d741acefb huskies: merge 1043 2026-05-14 14:31:09 +00:00
dave 5a3f94cae1 huskies: merge 1042 2026-05-14 14:25:15 +00:00
dave 8faf19f3ab huskies: merge 1034 2026-05-14 14:02:21 +00:00
Timmy 8625b9a7fc fix: rust 1.95.0 clippy lints and matrix-sdk 0.17 API changes
Toolchain bump surfaced new lints (derivable_impls,
unnecessary_unwrap, unnecessary_sort_by, while_let_loop,
collapsible_match, unnecessary_option_map_or_else, cmp_owned)
across bft-json-crdt and huskies-server. All fixed mechanically.

Cargo.toml: dropped the no-longer-existing `rustls-tls` matrix-sdk
feature, then chased through the 0.17 API breakage:
- Relation::Reply is now a tuple variant wrapping Reply, not a
  struct variant with `in_reply_to`
- UserIdentifier::UserIdOrLocalpart removed — use
  UserIdentifier::Matrix(MatrixUserIdentifier::new(..))
- SendMessageLikeEventResult no longer exposes event_id directly;
  it's now on the inner `response` field

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 14:48:49 +01:00
dave 9501412598 huskies: merge 1030 2026-05-14 13:29:59 +00:00
dave f1c96595de huskies: merge 1035 2026-05-14 13:17:38 +00:00
dave c353c0a6be huskies: merge 1033 2026-05-14 13:08:43 +00:00
dave 72d79deec9 huskies: merge 1026 2026-05-14 13:00:51 +00:00
dave a80d0a497a huskies: merge 1029 2026-05-14 12:53:01 +00:00
dave 4fad283814 huskies: merge 1027 2026-05-14 11:39:14 +00:00