Mergemaster needs more headroom for heavy merges (e.g. the slug-to-numeric
ID migration touching many files, or the FS-shadow deletion stories that
require fixing test setup across the codebase). 60 turns wasn't enough
for the larger ones.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Append to all coder/opus system_prompts: for delete/signature-change
refactors, delete first and let compiler errors guide the call-site
walk; do not pre-read files predicting breakage. Reduces exploration
overhead on mechanical refactors.
- Bump mergemaster inactivity_timeout_secs 300 -> 900 (15 min) so
mergemaster survives the 5-minute API rate-limit backoff. Without
this, mergemaster gets killed for inactivity while waiting on rate
limit clear, blocking all merges.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stories like the broadcaster-consumer migrations legitimately need ~60
substantive turns (16 ProjectConfig initializer sites + main.rs subscriber
+ reading existing patterns to mirror). 50 was too tight.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full inventory of all gateway and project server endpoints with caller,
purpose, latency/freshness/durability requirements. Classifies each as
write/read/external-webhook/frontend-asset. Maps write endpoints to
target CRDT collections, proposes RPC frame shapes for read endpoints,
drafts the unsigned read-RPC protocol (envelope, correlation IDs, TTL,
error codes, peer-offline handling), lists in-memory state needing CRDT
migration with proposed types, and defines a wave-ordered migration plan
with explicit dependencies (story 665 Ed25519 auth as the blocker for
write migrations).
Master had 8 uncommitted single-line whitespace changes (blank-line trimming
in test mod headers, etc.) left over from a previous mergemaster cargo-fmt
run that didn't get committed. Each subsequent merge attempt hit:
cherry-pick failed: 'Your local changes to the following files would be
overwritten by merge. Please commit your changes or stash them.'
So merges had been silently un-mergeable for the last several rounds —
mergemaster correctly reported the issue but had no way to fix master's
own state from inside the merge_workspace.
Files affected (all whitespace-only):
- chat/transport/matrix/bot/messages/{handle_message,on_room_message}.rs
- chat/transport/slack/commands/{llm,mod}.rs
- http/mcp/agent_tools/worktree.rs
- http/workflow/story_ops/{create,criterion,update}.rs
cargo clippy --all-targets -- -D warnings: clean
cargo fmt --all --check: clean
2636 tests pass.
Real cause of mergemaster turn-burnout: not merge conflicts, just polling
overhead. The server-side tool_merge_agent_work IS designed to block until
the merge completes, but the MCP client times out after 60s. The agent
then polls get_merge_status, with 30-60s sleeps between polls — each
poll cycle costs 2 turns (sleep + tool call). The merge takes 5-10 min
for a clean run, so the agent burns 10-20 turns just waiting.
Updated workflow tells mergemaster:
- 'operation timed out' is normal, do NOT immediately re-call (would queue
a duplicate merge)
- Use Bash sleep 300 (one 5-min wait = 1 turn) between polls
- Cap at 3 polls = 15 minutes total, plenty for any clean merge
- Reserve turns for actual fix-up work if gates fail
Combined with the earlier 30→60 turn / $5→$15 budget bump, this should
land any merge with no real conflicts in 3-5 turns total. Plenty of
headroom remaining for genuine gate-fix work.
30 turns is too tight for non-trivial merge gate failures. Combined with
the 3-retry cap, stories with any post-merge fix-up needed (cargo fmt
nits, slightly out-of-date diffs after parallel merges, etc.) get
permanently blocked.
This is a stopgap until story 668 lands (which will keep gates_passed=false
work in the coder stage entirely, so mergemaster only ever sees clean
diffs and the original 30 turns / $5 is fine again).
Replaces the test-time GLOBAL_STATE_LOCK approach (which was just disguised
single-threading) with proper test isolation: each test thread gets its own
SnapshotState via a thread_local!.
Pattern matches crdt_state::CRDT_STATE_TL — production keeps the global
OnceLock; tests get a per-thread OnceLock that's accessed through a
snapshot_state() helper. The unsafe `&*ptr` cast to 'static is safe because
the thread_local lives as long as the spawning test thread.
The race: latest_snapshot_available_after_compaction captured at_seq from a
freshly-generated snapshot, then asserted it equalled SNAPSHOT_STATE's
latest.at_seq. With shared SNAPSHOT_STATE, another test thread's
apply_compaction could overwrite latest_snapshot between capture and read.
Per-thread state eliminates the race at its source.
ALL_OPS / VECTOR_CLOCK stay shared — the tests don't assert on absolute
counts, only on (this-thread's at_seq) == (this-thread's latest.at_seq).
5 consecutive default-parallel `cargo test --bin huskies` runs all green
at 2636/2636.
The crdt_snapshot tests share three global statics:
- SNAPSHOT_STATE (latest_snapshot, pending_acks, pending_at_seq) — coordination state
- crdt_state::ALL_OPS / VECTOR_CLOCK — op journal + vector clock
Only the per-thread CRDT is thread-local (init_for_test); these other globals
are shared across test threads. Under default cargo test parallelism, two tests
running concurrently interleave their op writes and snapshot generation, so
assertions like assert_eq!(at_seq, 4) fail with at_seq=5 (the other thread's
ops snuck in).
Add a module-level GLOBAL_STATE_LOCK that all 17 affected tests grab at the
top. unwrap_or_else(|e| e.into_inner()) handles the case where a prior test
panicked while holding the lock (poisoned).
Fixes bug 669 — these two tests were the silent killer behind every agent's
script/test failure (see also bug 668, which advanced agents to merge despite
gates_passed=false; that compounded this by sending failing-tests worktrees
to mergemaster).
All 2636 tests now pass under default parallel execution (no --test-threads=1
needed).
Closes#669.