Two issues that surfaced when story 1 ran in the adopted huskies-server
sled:
1. Dockerfile.base: the base image had no nodejs / claude CLI, so every
coder agent spawn in an adopted project sled failed with
`Unable to spawn claude: No viable candidates found in PATH`. Install
nodejs + @anthropic-ai/claude-code in the base image so every sled
built from it can spawn agents out of the box.
2. worktree/create.rs::install_pre_commit_hook: `git config --worktree`
requires `extensions.worktreeConfig = true` to be set on the repo
config; without it, every worktree creation logged a noisy
`Pre-commit hook install failed` warning. Enable the extension
idempotently before the per-worktree hooks-path set so the hook
install succeeds cleanly.
After this, rebuild huskies-project-base and recreate any adopted
project containers to pick up the CLI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Story 1130 added HUSKIES_HOST=0.0.0.0 so the server INSIDE a project
container binds to all interfaces, but the host-side `docker -p`
mapping was still `127.0.0.1:{port}:3001` and `127.0.0.1:{ssh_port}:22`
— reachable from the docker host only, blocking remote MCP clients
and out-of-host SSH onto the project container.
Switch host-side mapping to 0.0.0.0 for both the MCP and SSH ports so
project containers spawned via `new project` are reachable from
anywhere that can route to the docker host. Existing containers
created before this commit retain their localhost-only mapping and
need to be recreated to pick up the change.
Add a regression test asserting both -p arguments use 0.0.0.0 and
reject any 127.0.0.1 restriction in the mapping.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The merge gate classifier was matching trigger keywords like
`missing_doc_comments` inside passing-test name lines
(e.g. `test agents::gates::tests::classify_lint_from_missing_doc_comments ... ok`),
causing every gate failure to be mis-classified as Lint and bounced
back to a fixup coder. Strip `test … … ok` lines before scanning for
lint triggers. Also removes the temporary diagnostic block in
runner.rs that confirmed the bug.
Applied directly to master because the 1101 feature branch carried
stale work from an earlier incarnation of the story that semantically
conflicted with master's later diagnostic commit (`is_fixup` deleted
on the branch, referenced on master).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bug 1102 was created today with origin={kind:user, id:""} because
build_origin silently defaulted id to empty when the caller didn't pass
one — we couldn't tell who filed it. Bug 1088's origin field is useless
as audit if every caller can omit themselves.
Changes:
- build_origin (server/src/http/mcp/story_tools/mod.rs) now returns
Result<String, String> and rejects missing/empty/whitespace-only id
with an instructional error pointing at bug 1102 / story 1104.
- 5 create_* tool handlers (bug, spike, refactor, epic, story) now
resolve origin BEFORE create_*_file so an attribution-less call
leaves no half-state behind.
- 5 tool input schemas now advertise origin as a required object via
a shared origin_schema() helper. The schema description gives every
caller (coder agent, chat bot, user, system) a concrete example so
the LLM populates the field correctly on first sight.
- Test fixtures pass origin = {kind:"test", id:"test-suite"}.
Story 1104 (signed actions) is the longer-term replacement; this is the
quick attribution win agreed for master ahead of that design work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug 1101's reframed AC1: when a non-success merge runs, log the typed
GateFailureKind, the matched classifier-trigger substring (if any) and
~90 chars of surrounding context. Fires on every gate failure regardless
of routing, so the next fixup-loop bounce will tell us which substring is
fooling classify() into Fmt|Lint|SourceMapCheck on what's actually a Test
failure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test asserted msg_count == 0 on a process-global broadcast channel
(TRANSITION_TX is a single OnceLock<Sender> shared across the test
binary), so any concurrent test calling apply_transition could land
events in our receiver between the drain and the post-reconcile check.
Observed failure: 3 stray transitions from parallel tests.
Drop the strict count check. The real "never floods" invariant is
captured by the Lagged check alone: 1000 seeded items must not overflow
the 256-slot channel, which can only hold if the reconcile path
bypasses the broadcast (AC4). The sibling test
`reconcile_pass_scales_to_1000_items_without_lagged_divergence` already
uses this Lagged-only pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stop_agent had the same order-of-operations bug fixed in the watchdog:
status flipped to Failed before the claude process was verified gone,
opening the idempotency window that allowed a duplicate spawn to race
in alongside the surviving process.
Now follows the three-step protocol:
1. Read worktree path under a read-only lock (no mutation).
2. SIGKILL the worktree's process tree via process_kill and block
until verified gone — start_agent's Running/Pending whitelist
continues to reject duplicate spawns throughout.
3. Only then mutate the agent record, abort the task handle, and
drop the child_killers entry.
Falls back to the old portable_pty SIGHUP path (with a warning) when
no worktree was recorded, matching the watchdog's behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `crate::process_kill` — reliable SIGKILL-with-verify primitives used
across the server in place of the various ad-hoc kill paths that ignored
their kill-effective return values. The module exposes three pieces:
- `sigkill_pids_and_verify(pids)`: SIGKILL each pid and block (up to 2s)
until every pid is verified gone. Returns survivors if not.
- `pids_matching(pattern)`: pgrep -f wrapper.
- `descendant_pids(root)`: recursive pgrep -P walker for process trees.
Wires the watchdog's limit-termination path through it, and reorders the
protocol to fix the duplicate-coder bug observed on story 1086 (2026-05-15):
Before: check_agent_limits set status=Failed before the kill ran. The
kill itself was `portable_pty::ChildKiller::kill()`, which sends SIGHUP
on Unix — claude-code ignores SIGHUP, so the process kept running while
the agent record was already marked terminated. The idempotency check
in `start_agent` whitelists Running/Pending, so the next auto-assign
pass spawned a fresh agent alongside the still-alive prior one. Two
claude PIDs sharing one session_id, racing on the same worktree.
After: status update is moved OUT of check_agent_limits and into the
caller AFTER the kill is verified. The kill itself is now SIGKILL-the-
process-tree-in-the-worktree, with explicit verification that every pid
is gone. The idempotency window is closed.
The existing watchdog test suite (14 tests) still passes; 7 new tests
cover the process_kill primitives directly.
`agents/pool/process.rs`'s `kill_all_children` and `kill_child_for_key`
still use the old portable_pty SIGHUP path — they have the same bug but
in lower-impact code paths (shutdown, operator stop). They will be
migrated under a separate story to keep this commit focused.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>