Bug 903 (run_tests attach instead of respawn) + 904 (MCP progress
notifications + SSE) together eliminate the transport-timeout error
mode from the agent's point of view: long test runs complete without
the MCP client ever observing a tool-call error. Production
verification (see d64f1e94 / ddc4228b deploy at 14:30 UTC today)
confirmed 78s and 65s test runs completing in single processes with
no respawn churn and no retry needed.
The "If run_tests errors with a transport timeout, call it again"
sentence in coder-1/2/3/opus system_prompts (added belt-and-braces
in a97a10fb) is now redundant. Removing it tightens the agent's
mental model down to: call run_tests, wait for the result. No
error-handling branch, no retry semantics to internalise.
This closes the last open AC on story 904.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-d64f1e94 the "call run_tests again — it attaches" guidance was a
lie (every call killed the prior job and spawned a fresh one). With
the attach fix in place, the contract is now real and safe to depend
on. Tighten the wording so agents see exactly what to do:
OLD: "Do not use ScheduleWakeup to wait for run_tests; if run_tests
appears to time out, call run_tests again — it attaches to the
in-flight test job and blocks until completion."
NEW: "If run_tests errors with a transport timeout, call it again —
it's idempotent and attaches to the same in-flight test job,
so retries are safe and eventually return a pass/fail result."
Improvements:
- "errors with a transport timeout" matches what the agent literally
observes (a tool-call error), not the vague "appears to time out".
- Explicit on idempotency so agents understand why retry is safe and
don't worry about double-running the suite.
- Drops the ScheduleWakeup clause — already enforced via the
`disallowed_tools` setting on coder-1/2/3/opus, so the prompt
reminder was redundant.
Applied uniformly across coder-1, coder-2, coder-3, coder-opus.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug 902: the Step 0 "resume from worktree state" instruction told coders
to call git_status / git_log / git_diff to discover prior session work,
which they then extended into hunting for the story `.md` file on disk
via find / ls — pointless post-865, since story content lives only in
the CRDT.
Update Step 0 in coder-1, coder-2, coder-3, and coder-opus to add an
explicit instruction: "To read story content, ACs, or description, call
the `get_story_todos` MCP tool — do NOT search for a story `.md` file
on disk; story content is CRDT-only."
Single substring replacement covers all four agents (identical Step 0
across them).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add `disallowed_tools` field to `AgentConfig` and render it as
`--disallowedTools` CLI flag in `render_agent_args`
- Set `disallowed_tools = ["ScheduleWakeup"]` on all four coder agents
(coder-1, coder-2, coder-3, coder-opus); QA and mergemaster unaffected
- Append instruction to all coder `system_prompt`s: do not use
ScheduleWakeup to wait for run_tests; if run_tests appears to time out,
call run_tests again — it attaches to the in-flight job and blocks
- Add tests: `render_agent_args_disallowed_tools` and
`coder_agents_disallow_schedule_wakeup`
`tool_run_tests` in `server/src/http/mcp/shell_tools/script.rs` is fully
blocking server-side: it spawns the test child, polls every 1s server-side
until exit (or `TEST_TIMEOUT_SECS = 1200s`), and returns the full
{passed, exit_code, output} directly. There is NO async/started-status
return path.
But two places told agents the wrong story:
1. `tools_list/system_tools.rs` description claimed "Returns immediately
with status: started. Poll get_test_result..." — agents read tool
descriptions for protocol semantics, so they followed this and burned
turns polling get_test_result.
2. `agents.toml` had been correctly saying it blocks, but my last commit
(776aad38) "fixed" it the wrong way based on a misread of the code.
Now both say: run_tests blocks server-side, returns the full result, do
not poll get_test_result. get_test_result remains for external observers
(UI checking on a job another caller started).
Reverts the prompt change in 776aad38 with the correct text.
Pre-f958f57e, run_tests blocked until completion. After that fix it became
a background-job starter, with get_test_result polling. The agent prompts
were never updated, so they still said "run_tests blocks until complete"
— and agents then waste turns polling.
Updated coder-1/2/3, coder-opus, and qa prompts to describe the actual
flow: run_tests is async, get_test_result blocks for up to 20s per call,
test suites typically take 1-5 minutes so expect a few polls.
Companion bug filed for bumping TEST_POLL_BLOCK_SECS so one poll covers
most test runs (root-cause fix; this commit is the prompt half).
Mergemaster needs more headroom for heavy merges (e.g. the slug-to-numeric
ID migration touching many files, or the FS-shadow deletion stories that
require fixing test setup across the codebase). 60 turns wasn't enough
for the larger ones.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Append to all coder/opus system_prompts: for delete/signature-change
refactors, delete first and let compiler errors guide the call-site
walk; do not pre-read files predicting breakage. Reduces exploration
overhead on mechanical refactors.
- Bump mergemaster inactivity_timeout_secs 300 -> 900 (15 min) so
mergemaster survives the 5-minute API rate-limit backoff. Without
this, mergemaster gets killed for inactivity while waiting on rate
limit clear, blocking all merges.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stories like the broadcaster-consumer migrations legitimately need ~60
substantive turns (16 ProjectConfig initializer sites + main.rs subscriber
+ reading existing patterns to mirror). 50 was too tight.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real cause of mergemaster turn-burnout: not merge conflicts, just polling
overhead. The server-side tool_merge_agent_work IS designed to block until
the merge completes, but the MCP client times out after 60s. The agent
then polls get_merge_status, with 30-60s sleeps between polls — each
poll cycle costs 2 turns (sleep + tool call). The merge takes 5-10 min
for a clean run, so the agent burns 10-20 turns just waiting.
Updated workflow tells mergemaster:
- 'operation timed out' is normal, do NOT immediately re-call (would queue
a duplicate merge)
- Use Bash sleep 300 (one 5-min wait = 1 turn) between polls
- Cap at 3 polls = 15 minutes total, plenty for any clean merge
- Reserve turns for actual fix-up work if gates fail
Combined with the earlier 30→60 turn / $5→$15 budget bump, this should
land any merge with no real conflicts in 3-5 turns total. Plenty of
headroom remaining for genuine gate-fix work.
30 turns is too tight for non-trivial merge gate failures. Combined with
the 3-retry cap, stories with any post-merge fix-up needed (cargo fmt
nits, slightly out-of-date diffs after parallel merges, etc.) get
permanently blocked.
This is a stopgap until story 668 lands (which will keep gates_passed=false
work in the coder stage entirely, so mergemaster only ever sees clean
diffs and the original 30 turns / $5 is fine again).
Agents now read specs/00_CONTEXT.md (what the project does) and
specs/tech/STACK.md (tech stack + source map) in addition to the
README. STACK.md rewritten to reflect current state — removes stale
references to biome, tauri-specta, .story_kit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Agents now know to add //! module comments and /// doc comments
to new public items, keeping documentation consistent with the
codebase-wide doc pass from story 542.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
run_tests MCP tool now spawns tests in the background and returns
immediately. Agents poll get_test_result to check completion. This
prevents zombie cargo processes from holding the build lock when the
CLI times out the MCP call before tests finish.
Also fixes agent permission mode: acceptEdits replaces invalid
allowFullAutoEdit that was causing agents to crash-loop on spawn.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Agents were spending entire $5 budgets grepping the codebase and reading
git history instead of making fixes when the story already specifies
exact file paths and function names. Changed bug workflow from
"investigate root cause first" to "trust the story, act fast" — go
directly to the specified location when the story tells you where.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Agents were running script/test directly through the PTY, streaming
the full output of npm install, cargo clippy, cargo test, and frontend
builds into session logs. This tripled session log sizes (~200KB to
~600KB per session) and contributed to CLI SIGABRT crashes.
The run_tests MCP tool already runs script/test server-side and returns
a truncated JSON summary. Agents now use it exclusively.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>