The merge gate classifier was matching trigger keywords like
`missing_doc_comments` inside passing-test name lines
(e.g. `test agents::gates::tests::classify_lint_from_missing_doc_comments ... ok`),
causing every gate failure to be mis-classified as Lint and bounced
back to a fixup coder. Strip `test … … ok` lines before scanning for
lint triggers. Also removes the temporary diagnostic block in
runner.rs that confirmed the bug.
Applied directly to master because the 1101 feature branch carried
stale work from an earlier incarnation of the story that semantically
conflicted with master's later diagnostic commit (`is_fixup` deleted
on the branch, referenced on master).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bug 1102 was created today with origin={kind:user, id:""} because
build_origin silently defaulted id to empty when the caller didn't pass
one — we couldn't tell who filed it. Bug 1088's origin field is useless
as audit if every caller can omit themselves.
Changes:
- build_origin (server/src/http/mcp/story_tools/mod.rs) now returns
Result<String, String> and rejects missing/empty/whitespace-only id
with an instructional error pointing at bug 1102 / story 1104.
- 5 create_* tool handlers (bug, spike, refactor, epic, story) now
resolve origin BEFORE create_*_file so an attribution-less call
leaves no half-state behind.
- 5 tool input schemas now advertise origin as a required object via
a shared origin_schema() helper. The schema description gives every
caller (coder agent, chat bot, user, system) a concrete example so
the LLM populates the field correctly on first sight.
- Test fixtures pass origin = {kind:"test", id:"test-suite"}.
Story 1104 (signed actions) is the longer-term replacement; this is the
quick attribution win agreed for master ahead of that design work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug 1101's reframed AC1: when a non-success merge runs, log the typed
GateFailureKind, the matched classifier-trigger substring (if any) and
~90 chars of surrounding context. Fires on every gate failure regardless
of routing, so the next fixup-loop bounce will tell us which substring is
fooling classify() into Fmt|Lint|SourceMapCheck on what's actually a Test
failure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test asserted msg_count == 0 on a process-global broadcast channel
(TRANSITION_TX is a single OnceLock<Sender> shared across the test
binary), so any concurrent test calling apply_transition could land
events in our receiver between the drain and the post-reconcile check.
Observed failure: 3 stray transitions from parallel tests.
Drop the strict count check. The real "never floods" invariant is
captured by the Lagged check alone: 1000 seeded items must not overflow
the 256-slot channel, which can only hold if the reconcile path
bypasses the broadcast (AC4). The sibling test
`reconcile_pass_scales_to_1000_items_without_lagged_divergence` already
uses this Lagged-only pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stop_agent had the same order-of-operations bug fixed in the watchdog:
status flipped to Failed before the claude process was verified gone,
opening the idempotency window that allowed a duplicate spawn to race
in alongside the surviving process.
Now follows the three-step protocol:
1. Read worktree path under a read-only lock (no mutation).
2. SIGKILL the worktree's process tree via process_kill and block
until verified gone — start_agent's Running/Pending whitelist
continues to reject duplicate spawns throughout.
3. Only then mutate the agent record, abort the task handle, and
drop the child_killers entry.
Falls back to the old portable_pty SIGHUP path (with a warning) when
no worktree was recorded, matching the watchdog's behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `crate::process_kill` — reliable SIGKILL-with-verify primitives used
across the server in place of the various ad-hoc kill paths that ignored
their kill-effective return values. The module exposes three pieces:
- `sigkill_pids_and_verify(pids)`: SIGKILL each pid and block (up to 2s)
until every pid is verified gone. Returns survivors if not.
- `pids_matching(pattern)`: pgrep -f wrapper.
- `descendant_pids(root)`: recursive pgrep -P walker for process trees.
Wires the watchdog's limit-termination path through it, and reorders the
protocol to fix the duplicate-coder bug observed on story 1086 (2026-05-15):
Before: check_agent_limits set status=Failed before the kill ran. The
kill itself was `portable_pty::ChildKiller::kill()`, which sends SIGHUP
on Unix — claude-code ignores SIGHUP, so the process kept running while
the agent record was already marked terminated. The idempotency check
in `start_agent` whitelists Running/Pending, so the next auto-assign
pass spawned a fresh agent alongside the still-alive prior one. Two
claude PIDs sharing one session_id, racing on the same worktree.
After: status update is moved OUT of check_agent_limits and into the
caller AFTER the kill is verified. The kill itself is now SIGKILL-the-
process-tree-in-the-worktree, with explicit verification that every pid
is gone. The idempotency window is closed.
The existing watchdog test suite (14 tests) still passes; 7 new tests
cover the process_kill primitives directly.
`agents/pool/process.rs`'s `kill_all_children` and `kill_child_for_key`
still use the old portable_pty SIGHUP path — they have the same bug but
in lower-impact code paths (shutdown, operator stop). They will be
migrated under a separate story to keep this commit focused.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Story 1066 (merged 2026-05-14 23:39) introduced a periodic reconciler that
calls `reconcile_worktree_create` every 30 seconds (default
`reconcile_interval_secs`). The reconciler's docstring promises it is a no-op
for stories whose worktree already exists — but the implementation calls
`create_worktree`, whose reuse path was running `run_setup_commands`
unconditionally. Setup includes destructive `npm ci` (rm -rf node_modules
then reinstall), so every Coding story got `npm ci` fired every 30 seconds.
When story 1086 hit a gate-failure retry loop on 2026-05-15, the merge gate's
own `npm install`/`npm run build` raced one of these reconciler-driven
`npm ci` runs that was wiping node_modules — leaving `.bin/tsc` as a broken
symlink pointing into a half-populated `typescript/` package and producing
`sh: 1: tsc: not found`. 37 npm ci fires for 1086 in 5 hours against only
3 real Coding transitions, a 12x amplification driven entirely by the
30-second reconcile cadence.
Fix: align `create_worktree`'s behaviour with the contract `reconcile_worktree_create`
already documents — reuse is a no-op for setup commands. Sparse checkout
and `.mcp.json` rewrite still run (both cheap and idempotent).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The orientation bundle was 96 KB per coder spawn with 85 KB of that being
source-map.json — a static symbol listing that drowned out the workflow rules
in AGENT.md and likely explains why PLAN.md ceremony is being skipped (the
instruction is ~5% of the bundle, buried under a wall of symbols). Agents are
excellent at grep on demand, so the source map adds little value as a preloaded
cheat sheet. File stays on disk for the merge-time source-map-check doc-coverage
gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merge 1061 added a replay_current_pipeline_state() call to the broadcast::Lagged
branch, but replay broadcasts one event per CRDT item (~997) into a 256-slot
channel, deterministically re-overflowing it and triggering another Lagged. The
loop pinned CPU and likely caused today's machine crash. Revert to the pre-1061
behaviour of logging and continuing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>