Commit Graph

994 Commits

Author SHA1 Message Date
dave efafe44db1 huskies: merge 1110 story Chat bootstrap Phase 2b: additional stack overlays (Go, Python, Ruby, JVM) 2026-05-16 23:20:31 +00:00
dave 3a43337735 huskies: merge 1107 story Chat bootstrap Phase 2a: stack-overlay framework + Rust and Node stack overlays 2026-05-16 23:01:49 +00:00
dave 10d992a7e4 huskies: merge 1106 story Chat bootstrap Phase 1: new project chat command spawns a bare project container and registers it with the gateway 2026-05-16 22:39:20 +00:00
dave 979492449e huskies: merge 1105 bug Freeze from Backlog stores wrong resume_to — Unfreeze restores to Coding instead of Backlog 2026-05-15 22:33:54 +00:00
Timmy 6fbe239313 fix(1102): require non-empty origin.id on create_* MCP tools
bug 1102 was created today with origin={kind:user, id:""} because
build_origin silently defaulted id to empty when the caller didn't pass
one — we couldn't tell who filed it. Bug 1088's origin field is useless
as audit if every caller can omit themselves.

Changes:
- build_origin (server/src/http/mcp/story_tools/mod.rs) now returns
  Result<String, String> and rejects missing/empty/whitespace-only id
  with an instructional error pointing at bug 1102 / story 1104.
- 5 create_* tool handlers (bug, spike, refactor, epic, story) now
  resolve origin BEFORE create_*_file so an attribution-less call
  leaves no half-state behind.
- 5 tool input schemas now advertise origin as a required object via
  a shared origin_schema() helper. The schema description gives every
  caller (coder agent, chat bot, user, system) a concrete example so
  the LLM populates the field correctly on first sight.
- Test fixtures pass origin = {kind:"test", id:"test-suite"}.

Story 1104 (signed actions) is the longer-term replacement; this is the
quick attribution win agreed for master ahead of that design work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 23:13:54 +01:00
Timmy 26527e7dae diag(1101): log classify verdict + matched trigger on merge gate failures
Bug 1101's reframed AC1: when a non-success merge runs, log the typed
GateFailureKind, the matched classifier-trigger substring (if any) and
~90 chars of surrounding context. Fires on every gate failure regardless
of routing, so the next fixup-loop bounce will tell us which substring is
fooling classify() into Fmt|Lint|SourceMapCheck on what's actually a Test
failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 23:13:38 +01:00
dave 04a57e92c2 huskies: merge 1103 bug Rate-limit warning at session start sticks the rate_limit_exit flag, causing 1053's fast-path bypass to skip completion on clean session exits 2026-05-15 21:02:37 +00:00
dave 4216ced493 huskies: merge 1100 bug Multiple LLM agents can run concurrently on the same story (coder + mergemaster + others) — enforce one-agent-per-story invariant 2026-05-15 20:24:31 +00:00
dave 63d86f1263 huskies: merge 1096 bug Shadow drift: set_agent writes CRDT agent register without updating pipeline_items.agent 2026-05-15 19:05:56 +00:00
dave 1adc734801 huskies: merge 1098 bug Shadow drift: set_retry_count / bump_retry_count write CRDT register without updating pipeline_items.retry_count 2026-05-15 18:25:25 +00:00
dave 8531bac6cd huskies: merge 1097 bug Shadow drift: set_depends_on writes CRDT depends_on register without updating pipeline_items.depends_on 2026-05-15 12:40:17 +00:00
dave 2857c3b46b huskies: merge 1094 bug delete_story leaks zombie rows in pipeline_items shadow table — 176 tombstoned items still report non-terminal stages 2026-05-15 12:27:48 +00:00
dave 62d1535e76 huskies: merge 1095 bug Shadow drift: set_name writes CRDT name register without updating pipeline_items.name 2026-05-15 12:10:11 +00:00
dave fc5481dbe4 huskies: merge 1093 bug Chat dispatcher spawns one Timmy per inbound message — needs coalesce window + per-session serial lock 2026-05-15 12:03:09 +00:00
dave 01e60a670c huskies: merge 1091 refactor Migrate the merge-gate's stale-cargo kill path to process_kill 2026-05-15 11:50:03 +00:00
dave c4010854a5 huskies: merge 1089 bug Stuck-agent detector blocks stories on legitimate exploration / debugging — uses too narrow a "progress" signal 2026-05-15 11:40:44 +00:00
dave 4aa76ce673 huskies: merge 1090 refactor Migrate AgentPool::kill_all_children and kill_child_for_key to process_kill so server shutdown and stop_agent actually kill claude 2026-05-15 11:16:16 +00:00
Timmy fb82bd7bca test(tick_loop): de-flake reconcile_never_floods_broadcast_channel
The test asserted msg_count == 0 on a process-global broadcast channel
(TRANSITION_TX is a single OnceLock<Sender> shared across the test
binary), so any concurrent test calling apply_transition could land
events in our receiver between the drain and the post-reconcile check.
Observed failure: 3 stray transitions from parallel tests.

Drop the strict count check.  The real "never floods" invariant is
captured by the Lagged check alone: 1000 seeded items must not overflow
the 256-slot channel, which can only hold if the reconcile path
bypasses the broadcast (AC4).  The sibling test
`reconcile_pass_scales_to_1000_items_without_lagged_divergence` already
uses this Lagged-only pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 11:13:31 +01:00
Timmy b7df5cbe4e fix(agents): kill-then-status reorder in stop_agent
stop_agent had the same order-of-operations bug fixed in the watchdog:
status flipped to Failed before the claude process was verified gone,
opening the idempotency window that allowed a duplicate spawn to race
in alongside the surviving process.

Now follows the three-step protocol:
1. Read worktree path under a read-only lock (no mutation).
2. SIGKILL the worktree's process tree via process_kill and block
   until verified gone — start_agent's Running/Pending whitelist
   continues to reject duplicate spawns throughout.
3. Only then mutate the agent record, abort the task handle, and
   drop the child_killers entry.

Falls back to the old portable_pty SIGHUP path (with a warning) when
no worktree was recorded, matching the watchdog's behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 10:46:02 +01:00
Timmy fe9804b32c feat: add process_kill module + use it to fix watchdog double-spawn
Adds `crate::process_kill` — reliable SIGKILL-with-verify primitives used
across the server in place of the various ad-hoc kill paths that ignored
their kill-effective return values. The module exposes three pieces:

  - `sigkill_pids_and_verify(pids)`: SIGKILL each pid and block (up to 2s)
    until every pid is verified gone. Returns survivors if not.
  - `pids_matching(pattern)`: pgrep -f wrapper.
  - `descendant_pids(root)`: recursive pgrep -P walker for process trees.

Wires the watchdog's limit-termination path through it, and reorders the
protocol to fix the duplicate-coder bug observed on story 1086 (2026-05-15):

  Before: check_agent_limits set status=Failed before the kill ran. The
  kill itself was `portable_pty::ChildKiller::kill()`, which sends SIGHUP
  on Unix — claude-code ignores SIGHUP, so the process kept running while
  the agent record was already marked terminated. The idempotency check
  in `start_agent` whitelists Running/Pending, so the next auto-assign
  pass spawned a fresh agent alongside the still-alive prior one. Two
  claude PIDs sharing one session_id, racing on the same worktree.

  After: status update is moved OUT of check_agent_limits and into the
  caller AFTER the kill is verified. The kill itself is now SIGKILL-the-
  process-tree-in-the-worktree, with explicit verification that every pid
  is gone. The idempotency window is closed.

The existing watchdog test suite (14 tests) still passes; 7 new tests
cover the process_kill primitives directly.

`agents/pool/process.rs`'s `kill_all_children` and `kill_child_for_key`
still use the old portable_pty SIGHUP path — they have the same bug but
in lower-impact code paths (shutdown, operator stop). They will be
migrated under a separate story to keep this commit focused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 10:36:33 +01:00
dave df32a1542b huskies: merge 1087 story Pipeline+Status split — Step D: migrate CRDT storage to (Pipeline, Status) and remove the Stage enum 2026-05-15 08:47:38 +00:00
dave e82602db77 huskies: merge 1086 story Pipeline+Status split — Step C: migrate auto-assign, subscribers, and lifecycle transitions to read Pipeline + Status 2026-05-15 08:26:39 +00:00
Timmy 2d6105c778 fix: skip setup commands on worktree reuse so reconciler doesn't fire npm ci every 30s
Story 1066 (merged 2026-05-14 23:39) introduced a periodic reconciler that
calls `reconcile_worktree_create` every 30 seconds (default
`reconcile_interval_secs`). The reconciler's docstring promises it is a no-op
for stories whose worktree already exists — but the implementation calls
`create_worktree`, whose reuse path was running `run_setup_commands`
unconditionally. Setup includes destructive `npm ci` (rm -rf node_modules
then reinstall), so every Coding story got `npm ci` fired every 30 seconds.

When story 1086 hit a gate-failure retry loop on 2026-05-15, the merge gate's
own `npm install`/`npm run build` raced one of these reconciler-driven
`npm ci` runs that was wiping node_modules — leaving `.bin/tsc` as a broken
symlink pointing into a half-populated `typescript/` package and producing
`sh: 1: tsc: not found`. 37 npm ci fires for 1086 in 5 hours against only
3 real Coding transitions, a 12x amplification driven entirely by the
30-second reconcile cadence.

Fix: align `create_worktree`'s behaviour with the contract `reconcile_worktree_create`
already documents — reuse is a no-op for setup commands. Sparse checkout
and `.mcp.json` rewrite still run (both cheap and idempotent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 08:57:38 +01:00
Timmy d89940e85b fix: drop source-map.json from agent orientation bundle
The orientation bundle was 96 KB per coder spawn with 85 KB of that being
source-map.json — a static symbol listing that drowned out the workflow rules
in AGENT.md and likely explains why PLAN.md ceremony is being skipped (the
instruction is ~5% of the bundle, buried under a wall of symbols). Agents are
excellent at grep on demand, so the source map adds little value as a preloaded
cheat sheet. File stays on disk for the merge-time source-map-check doc-coverage
gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:48:18 +01:00
dave 13f7dab5f0 huskies: merge 1088 2026-05-15 02:03:30 +00:00
dave b053f14d58 huskies: merge 1085 2026-05-15 01:38:05 +00:00
dave 56179d712e huskies: merge 1078 2026-05-15 01:32:29 +00:00
dave 1506141155 huskies: merge 1072 2026-05-15 01:27:25 +00:00
dave 0c23d209a0 huskies: merge 1077 2026-05-15 00:58:57 +00:00
dave eac5763e03 huskies: merge 1075 2026-05-15 00:48:06 +00:00
dave f9b140add9 huskies: merge 1073 2026-05-15 00:37:01 +00:00
dave d4db96f709 huskies: merge 1070 2026-05-15 00:20:29 +00:00
dave 5f08573db8 huskies: merge 1076 2026-05-15 00:10:15 +00:00
dave da83fcb78d huskies: merge 1074 2026-05-15 00:01:58 +00:00
dave bb6a6063e8 huskies: merge 1066 2026-05-14 23:45:53 +00:00
dave 374aa77f27 huskies: merge 1069 2026-05-14 23:29:32 +00:00
dave c66016394b huskies: merge 1063 2026-05-14 21:53:56 +00:00
dave 23c3301903 huskies: merge 1065 2026-05-14 21:48:09 +00:00
Timmy e6865a1bc6 fix: stop event-triggers Lagged handler from re-emitting via the same channel
Merge 1061 added a replay_current_pipeline_state() call to the broadcast::Lagged
branch, but replay broadcasts one event per CRDT item (~997) into a 256-slot
channel, deterministically re-overflowing it and triggering another Lagged. The
loop pinned CPU and likely caused today's machine crash. Revert to the pre-1061
behaviour of logging and continuing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 22:33:14 +01:00
dave 8f666bd6b3 huskies: merge 1062 2026-05-14 20:36:51 +00:00
dave 5678f2a556 huskies: merge 1061 2026-05-14 20:12:51 +00:00
dave 54d9737428 huskies: merge 1060 2026-05-14 19:31:04 +00:00
Timmy 667601012c fix: populate story_name in event buffer via CRDT lookup
`subscribe_to_watcher` was pushing StoredEvents into the event
buffer with story_name hardcoded to String::new(), so /api/events
polled by the gateway always omitted the title. The 1035 fix
patched the other path (gateway_relay status_to_stored) but left
this one bleeding empty strings.

Lookup happens once at the subscriber boundary rather than at all
44 watcher emit sites — the story_id is already in hand and
crdt_state::read_item is the canonical name source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 20:24:27 +01:00
dave 595777f366 huskies: merge 1054 2026-05-14 18:53:07 +00:00
dave 96e227d8d4 huskies: merge 1053 2026-05-14 18:40:37 +00:00
dave b9709a6466 huskies: merge 1052 2026-05-14 18:11:57 +00:00
dave 977b954e98 huskies: merge 1051 2026-05-14 18:04:30 +00:00
dave 8f99fede34 huskies: merge 1050 2026-05-14 17:32:14 +00:00
dave 1f9f34ab58 huskies: merge 1038 2026-05-14 17:06:50 +00:00
dave 311883f45d huskies: merge 1039 2026-05-14 16:33:47 +00:00