The function was calling `read_content(story_id)`, which returns the
story's *description* text (e.g. "Bug: Coder exits code 0 with
uncommitted work — force a commit-only respawn..."). It then scanned
that for "Merge conflict" / "CONFLICT (content):", which obviously
never matched, so the auto-spawn-mergemaster-on-content-conflict guard
in `pool/auto_assign/merge.rs` always saw `false` and skipped.
The actual gate output (where the merge runner stores the failure
message including conflict markers) lives at
`format!("{story_id}:gate_output")` — that's the key
`pipeline/advance/mod.rs:207` writes to. Read from there instead.
Witnessed: 954's merge hit a real `CONFLICT (content)` in
tests_regression.rs at 08:57:40, no mergemaster spawned, story stayed
in MergeFailure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 5-line spread of `.unwrap_or_else(|| { ... })` in spawn.rs (from
the bd517f28 + 65416476 warm-resume work) doesn't match rustfmt's
preference for the short form. Was blocking every merge gate since
the warm-resume fix landed.
Follow-up to bd517f28. When --resume succeeds, claude-code restores the
full prior conversation — the agent already has its file reads, tool
results, and reasoning in context. Telling it to "read PLAN.md" forces
a redundant tool call to re-read a doc it wrote itself. PLAN.md is the
cold-start orientation doc (driven by AGENT.md); the resume -p prompt
should just be a continuation nudge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
respawns can actually warm
claude-code's --resume <session_id> requires either:
a) a deferred-tool marker in the resumed session (i.e. the prior
session paused mid-tool-call), or
b) a non-empty -p prompt to continue the conversation with.
Watchdog-killed sessions have neither: the kill is asynchronous and
leaves no deferred-tool marker, and our harness was passing an empty
-p (because `resume_context_owned` is None for the common respawn
case). claude-code then aborts with:
"Error: No deferred tool marker found in the resumed session.
Either the session was not deferred, the marker is stale (tool
already ran), or it exceeds the tail-scan window. Provide a
prompt to continue the conversation."
The harness sees an aborted CLI with no session, prunes the recorded
session_id, and respawns cold — paying the full prompt-cache miss for
EVERY respawn. The new session_store logging (commit 0b50a624) made
this 100% legible: every warm spawn we observed went `mode=warm` →
crash → prune → `mode=cold` within a couple of seconds.
Fix: when resuming with no failure-context to send, default the -p
prompt to a brief "continue from PLAN.md" line. claude-code now has a
valid continuation message and warm-resume should actually work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Helps explain WHY each spawn goes warm vs cold. The existing
`spawn mode=warm|cold` log only shows the outcome at the spawn point —
to count where warmth is being lost, we need to see:
- when a session_id is recorded (and for which key),
- what every lookup returns (key + Some/None),
- when remove_sessions_for_story prunes (which is currently the only
explicit cold-induction path beyond "first ever spawn").
After this lands a grep of "session_store" in the logs gives the full
warm-resume health picture: which (story,agent,model) triples have a
recorded session, which lookups are hitting it, and which prunes are
costing us a warm respawn.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without this, the only way to tell whether a watchdog-respawn went warm
(--resume <session_id>) vs cold (fresh CLI invocation) was to read the
args list of the existing "Spawning claude with args:" log and check
whether --resume was present. That made it impossible to count
cold-paths or distinguish "supposed-to-be-warm but resume_failed
fallback" from "first session" without source-diving.
This adds one slog! per spawn, prefixed `[agent:{sid}:{name}] spawn
mode=warm|cold session_id=...`, so grep "spawn mode=" answers it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 12 tests in `agents::pool::pipeline::merge::tests` share a
process-wide `server_start_time` (a `OnceLock` captured the first time
the merge subsystem runs) and the global merge-job CRDT log. Default
cargo parallelism has caught at least one interleaving on the merge
gate's Docker scheduler where `stale_running_merge_job_is_cleared_and_retry_succeeds`
flakes — `delete_merge_job` from one test lands while another is mid-
assertion. Couldn't reproduce locally despite many tries.
Each test now acquires a poison-tolerant `std::sync::Mutex` at entry,
so the 12 tests run serially relative to each other while the rest of
the suite (2862 tests) stays parallel. Module-level
`#![allow(clippy::await_holding_lock)]` covers the deliberate sync
guard across `.await`s.
Targeted isolation — not a global `--test-threads=1`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five tests pin down the contract of `migrate_legacy_stage_strings`:
rewrite of all pre-934 directory-style strings to clean wire form,
the lossy `7_frozen` → backlog + frozen-flag collapse, no-op on
already-clean items, idempotence, and graceful behaviour before
CRDT init. A test-only `seed_with_raw_stage` helper bypasses the
boundary normalisers (which can't produce legacy strings) by writing
directly to the CRDT register — the same shape we'll see in real
pre-migration data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The state machine's `Stage` enum becomes the source of truth for pipeline
state. Six stages of work land together:
1. Clean wire vocabulary (`coding`, `merge`, `merge_failure`, ...) replaces
legacy directory-style strings (`2_current`, `4_merge`, ...) on the wire.
`Stage::from_dir` accepted both during deployment; new writes always
emit the clean form via `stage_dir_name`. Lexicographic `dir >= "5_done"`
checks in lifecycle.rs become typed `matches!` checks since the new
vocabulary doesn't sort in pipeline order.
2. `crdt_state::write_item` takes typed `&Stage`, serialising via
`stage_dir_name` at the CRDT boundary. `#[cfg(test)] write_item_str`
parses legacy strings for test fixtures.
3. `WorkItem::stage()` returns typed `crdt_state::Stage`; `stage_str()`
is gone from the public API. Projection dispatches on the typed enum.
4. `frozen` becomes an orthogonal CRDT register. `Stage::Frozen` and
`PipelineEvent::Freeze`/`Unfreeze` are removed; `transition_to_frozen`/
`unfrozen` set the flag directly without touching the stage register.
5. Watcher sweep and `tool_update_story`'s `blocked` setter route through
`apply_transition` so the typed transition table validates every
stage change. `update_story` gains a `frozen` field for symmetry.
6. One-shot startup migration rewrites pre-934 directory-style stage
registers (and sets `frozen=true` on items previously at `7_frozen`).
`Stage::from_dir` drops legacy aliases. The db boundary keeps a small
normaliser so callers with legacy strings (MCP, tests) still work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final 929 sweep: every YAML-shaped helper is gone. No production code
parses or writes YAML front matter anywhere.
Surface removed:
- db/yaml_legacy.rs (FrontMatter/StoryMetadata structs, parse_front_matter,
set_front_matter_field, yaml_residue marker) — file deleted.
- ItemMeta::from_yaml — deleted; callers pass typed ItemMeta::named(...) or
ItemMeta::default() and use typed CRDT setters (set_depends_on,
set_blocked, set_retry_count, set_agent, set_qa_mode, set_review_hold,
set_item_type, set_epic, set_mergemaster_attempted) for the rest.
- write_coverage_baseline_to_story_file + read_coverage_percent_from_json —
the coverage_baseline YAML field was write-only (nothing read it back);
removed along with its caller in agent_tools/lifecycle.rs.
- update_story_in_file's generic `front_matter` HashMap parameter —
tool_update_story now intercepts every known field name and routes it
to a typed CRDT setter; unknown keys are rejected with an explicit error
pointing at the typed setters. The function only takes user_story /
description sections now.
- All 117 ItemMeta::from_yaml callsites migrated. Where tests previously
passed a YAML-shaped content blob and relied on the helper to extract
name/depends_on/blocked/agent/qa, they now pass:
write_item_with_content(id, stage, content, ItemMeta::named("Foo"))
crate::crdt_state::set_depends_on(id, &[...]) // when needed
crate::crdt_state::set_blocked(id, true) // when needed
crate::crdt_state::set_agent(id, Some("...")) // when needed
- write_story_content + write_story_file (test helper) now take an
explicit `name: Option<&str>` instead of parsing it from content.
- db::ops::move_item_stage stopped re-parsing YAML on every stage
transition; metadata is read straight from the CRDT view when mirroring
the row into SQLite.
New CRDT setters added for symmetry:
- crdt_state::set_name (mirrors set_agent — explicit name updates).
cargo fmt --check, clippy --all-targets -- -D warnings, and the
2830-test suite all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every stage transition was reading the content body's YAML front matter to
derive name/agent/blocked/depends_on, then writing those values straight
back into the CRDT registers — but the CRDT was already the source of
truth for all of these fields. The reparse was at best a no-op and at
worst could regress the CRDT to stale YAML values during transitions on
items whose YAML was out of date.
Now move_item_stage:
- writes the new stage to the CRDT with None for every other field, so
write_item leaves existing registers untouched.
- reads name/agent/blocked/depends_on back from the CRDT view when
mirroring the row into the SQLite shadow table (still needed because
the shadow stores a denormalised snapshot for read-side queries).
The yaml_legacy::parse_front_matter import is gone from db/ops.rs; the
only path still using it on the production side is ItemMeta::from_yaml,
which is a caller convenience (mostly used in test fixtures).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After 932 (review_hold register) and 933 (item_type + epic registers), the
remaining production yaml_legacy callers all had typed CRDT equivalents.
Migrated:
- agents/lifecycle.rs:
- transition_to_merge_failure writes to MergeJob.error CRDT entry instead
of YAML body. The legacy `merge_failure: "..."` front-matter write is gone.
- reject_story_from_qa inlines the QA-rejection notes append; no longer
needs yaml_legacy::write_rejection_notes_to_content.
- fields_to_clear_transform helper deleted along with all five callers —
blocked/retry_count/merge_failure are typed CRDT fields now, so clearing
the equivalent YAML keys is redundant.
- http/workflow/pipeline.rs:
- load_pipeline_state reads merge_failure from MergeJob.error (mirrors
status_tools.rs).
- validate_story_dirs checks the typed CRDT `name` register instead of
parsing YAML front matter.
- http/mcp/status_tools.rs: review_hold reads the typed CRDT register
(yaml_residue wrap was the last one in this file).
- http/mcp/story_tools/criteria.rs: story_name reads from CRDT.
- service/agents/mod.rs::get_work_item_content: name/agent come from CRDT.
- service/notifications/io/mod.rs::read_story_name: same.
- http/workflow/bug_ops/{bug,refactor}.rs: name-fallback paths drop YAML
parsing in favour of the CRDT-derived item.name.
Dead helpers removed from db/yaml_legacy.rs:
yaml_residue, write_merge_failure_in_content, write_rejection_notes_to_content,
clear_front_matter_field_in_content, write_review_hold_in_content,
clear_front_matter_field, write_review_hold (the last four shipped in 932).
Remaining surface: FrontMatter / StoryMetadata structs, parse_front_matter,
set_front_matter_field — kept for `coverage_baseline` writes via
test_results.rs and the generic update_story front_matter escape hatch.
Test fixtures rewritten to seed the CRDT register instead of relying on
YAML parsing during write_item_with_content:
- has_review_hold_returns_* tests
- item_type_from_id_uses_crdt_register_for_numeric_ids
- tool_list_epics_shows_member_rollup
- get_work_item_content (both copies — http/agents + service/agents)
- validate_story_dirs_missing_name_in_crdt
- server_side_merge_*_sets_merge_failure (assert MergeJob.error, not YAML)
cargo fmt --check, clippy --all-targets -- -D warnings, and the
2856-test suite all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the YAML-only `type: epic` / `epic: <id>` front-matter fields with
typed CRDT registers on PipelineItemCrdt. The epic-mechanism MCP tools
(`tool_list_epics`, `tool_show_epic`), the epic-context injection in agent
spawn, and the type-classifier helpers (`item_type_from_id`, `is_bug_item`,
`is_refactor_item`) now all read from the CRDT.
Schema:
- PipelineItemCrdt: `item_type: LwwRegisterCrdt<String>` and
`epic: LwwRegisterCrdt<String>` registers.
- WorkItem: typed `item_type()` and `epic()` accessors returning `Option<&str>`.
- crdt_state::set_item_type(story_id, Option<&str>) and
crdt_state::set_epic(story_id, Option<&str>) typed setters.
Write paths populate the new registers:
- create_story_file / create_bug_file / create_spike_file /
create_refactor_file / create_epic_file — each calls set_item_type after
write_story_content.
- tool_update_story intercepts `epic` and `type` fields and routes them to
the typed setters (same pattern as qa / depends_on).
Read paths migrated off yaml_legacy:
- http/mcp/story_tools/epic.rs: tool_list_epics + tool_show_epic.
- agents/lifecycle.rs::item_type_from_id (numeric-only IDs).
- agents/pool/start/spawn.rs epic-context injection.
- http/workflow/bug_ops/bug.rs::is_bug_item, refactor.rs::is_refactor_item.
- http/workflow/pipeline.rs::load_pipeline_state — review_hold/qa/epic_id
all come from the CRDT now; only merge_failure is still YAML (sweep in
929 stage 10).
All `yaml_residue(...)` wraps for item_type / epic are removed; the
remaining residue marker doc no longer references 933.
cargo fmt --check, clippy --all-targets -- -D warnings, and the 2857-test
suite all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
review_hold is now a typed bool register on PipelineItemCrdt alongside
blocked / mergemaster_attempted. Exposed via the typed setter
`crdt_state::set_review_hold(story_id, value)` and the
`WorkItem::review_hold()` accessor. Replaces the legacy
`review_hold: true` YAML front-matter field.
Migrated callers:
- http/mcp/qa_tools.rs::tool_approve_qa — clear via set_review_hold(false)
- agents/lifecycle.rs::reject_story_from_qa — clear via set_review_hold(false)
- agents/pool/pipeline/advance/helpers.rs::write_review_hold_to_store
— set via set_review_hold(true), no more content rewrite
- agents/pool/auto_assign/reconcile.rs (two callsites) — set via
set_review_hold(true) instead of FS YAML write
- agents/pool/auto_assign/story_checks.rs::has_review_hold — reads the
typed register instead of conflating with Stage::Frozen (real bug fix:
the legacy implementation returned `stage.is_frozen()`, which made
the auto-assigner treat *every* held-for-review item as frozen even
when it wasn't actually parked at the freeze stage).
Dead yaml_legacy helpers removed:
- write_review_hold(path), write_review_hold_in_content(content)
- clear_front_matter_field(path) — last caller was the qa_tools wrap
The yaml_residue marker doc now only mentions 933; the 932 line is gone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
io/watcher and io/watcher/sweep were already CRDT-only — the watcher only
watches .huskies/{project,agents}.toml, work-item events come from CRDT
subscribe — so the remaining FS shadow reader was the bug-503 archived-dep
warning in story_tools/story/create.rs (via check_archived_deps_from_list,
which scanned .huskies/work/6_archived/). Migrate that call to the
CRDT-direct `dep_is_archived_crdt`. Drop the now-unused helper and the
four dead imports in bug/spike/refactor/criteria.rs that referenced it.
io/story_metadata/deps.rs is reduced to a module-level comment pointing
callers at the crdt_state helpers; nothing in io/ now scans the FS shadow
tree.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The startup reconciler still pokes review_hold into the on-disk story file
when promoting human-QA items, because no CRDT register exists yet for
review_hold (filed as sub-story 932). The two write-side callsites in
reconcile.rs were the last bare yaml_legacy:: calls in production write
paths; wrap them in yaml_residue so the gap shows up in
`grep -rn yaml_residue` like the other 932/933 markers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
transition_to_frozen and transition_to_unfrozen no longer touch YAML; both
now just call apply_transition with no content_transform. Pairs with the
stage-6 read-side change in projection.rs.
Story 934 will obviate the entire resume_to mechanism by making frozen a
flag orthogonal to Stage (story stays in its current Stage when frozen).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
projection::project_stage was the last yaml_legacy reader in pipeline_state.
Drop the read_content+parse_front_matter detour for the "7_frozen" case and
always default resume_to to Stage::Coding. The YAML write side in apply.rs
goes in stage 7.
Story 934 (sibling refactor) will replace Stage::Frozen-with-payload with a
frozen flag orthogonal to Stage, so a story frozen in Qa stays in Stage::Qa
rather than encoding a "where to resume" target. After 934 lands the
resume_to payload disappears entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrate the last three callers of the FS-scanning dependency helpers to the
CRDT-direct equivalents and delete the dead helpers:
- agents/pool/auto_assign/story_checks.rs: has_unmet_dependencies and
check_archived_dependencies now wrap check_unmet_deps_crdt /
check_archived_deps_crdt directly. Tests rewritten to seed the CRDT.
- http/mcp/story_tools/story/update.rs: bug-503 archived-dep warning now
reads from CRDT instead of scanning 6_archived.
- agents/pool/pipeline/advance/helpers.rs: resolve_qa_mode_from_store is
CRDT-only (the FS fallback for content-store-empty stories is gone).
- io/story_metadata/parser.rs: resolve_qa_mode_from_content removed.
- io/story_metadata/deps.rs: check_unmet_deps and dep_is_done deleted,
along with the unused check_unmet_deps_from_list helper.
- io/story_metadata/mod.rs: re-exports trimmed accordingly.
check_archived_deps_from_list survives because story-creation still calls
it before the CRDT entry exists (used from story_tools/story/create.rs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Read-side migrations:
- agents/pool/auto_assign/backlog.rs: depends_on check now reads from
WorkItem.depends_on() instead of parse_front_matter.
- agents/pool/auto_assign/story_checks.rs: read_story_front_matter_agent
drops its YAML fallback — post-891 the CRDT entry is reliable, and
removing the fallback makes the contract honest. The now-unused
read_story_contents helper goes too.
- agents/pool/start/validation.rs: same shape — YAML fallback removed,
CRDT register is the only source for agent pinning.
- agents/pool/start/spawn.rs: epic-context injection wraps the
parse_front_matter call in `yaml_residue(...)` since `meta.epic` has no
CRDT analog (sub-story 933).
- agents/lifecycle.rs: item_type_from_id (numeric-only ID path) wraps its
parse_front_matter in `yaml_residue(...)` for the same reason (933).
The write-side `fields_to_clear_transform` calls in lifecycle.rs are
left for stage 8, when FS-shadow writes are deleted wholesale.
Test fix:
- start_agent_returns_error_when_front_matter_agent_busy now seeds the
CRDT entry (write_item with agent="coder-opus") instead of relying on
parse_front_matter reading the YAML on disk.
Filed earlier:
- 932 (review_hold register) — note: this turns out to be a real class-1
bug: write_review_hold_to_store still writes YAML but has_review_hold
reads Stage::Frozen, so the write goes into a void. 932 is the correct
fix.
All 2861 tests pass; fmt + clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three MCP files touched:
- status_tools.rs (story-status JSON dump): every field with a CRDT
equivalent now reads from WorkItem (name, agent, blocked, qa_mode,
retry_count, depends_on, claimed_by, claimed_at) or MergeJob.error
(merge_failure detail). One field — review_hold — has no CRDT register
yet (sub-story 932) and is wrapped in `yaml_residue(parse_front_matter(...))`
so the gap is visible at every code-search.
- qa_tools.rs:
• tool_approve_qa wraps the legacy `clear_front_matter_field("review_hold")`
write in `yaml_residue(...)` pending sub-story 932.
• tool_reject_qa now reads the agent name from the CRDT WorkItem instead
of parsing front matter on disk.
- story_tools/epic.rs: the entire epic feature (item_type, epic link)
has no CRDT analog — sub-story 933. Every parse_front_matter call here
is wrapped in `yaml_residue(...)`.
Also: new identity wrapper `db::yaml_legacy::yaml_residue<T>(v: T) -> T`
that marks a yaml_legacy callsite blocked on a CRDT-register gap. Pure
identity at runtime; the distinctive name makes the residue grep-findable
(`grep -rn yaml_residue`). Sub-stories 932 and 933 enumerate the gaps.
Filed:
- 932: Add CRDT register for review_hold
- 933: Add CRDT registers for the epic mechanism
All 2854 tests pass; fmt + clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
delete.rs, start.rs, assign.rs all looked up the story name by reading
the content from disk/store and parsing the front matter. Replaced with
`crdt_state::read_item(&story_id).and_then(|w| w.name())`. Each callsite's
fallback chain ("CRDT → content store → filesystem") still locates the
story_id; only the name extraction moved off YAML.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each chat command that previously read parse_front_matter for story
metadata (name, agent, depends_on, blocked, retry_count, merge_failure,
qa_mode) now reads from the typed CRDT API:
- WorkItem (via crdt_state::read_item) for pipeline-item registers.
- MergeJobView (via crdt_state::read_merge_job) for the merge failure
detail text, which has its own LWW-map CRDT entry.
Files migrated: depends.rs, freeze.rs, move_story.rs, overview.rs,
status/render.rs, triage.rs, unblock.rs, unreleased.rs.
unblock.rs: also removes the legacy front-matter cleanup branch that
fired when the typed Blocked→Coding transition failed. Post-929 there
is no YAML on disk to clean; the fallback now just resets retry_count
in the CRDT.
triage.rs: drops the YAML-only `review_hold` and `coverage_baseline`
fields from the dump. These have no CRDT register and were never
load-bearing on the triage output; if needed later, add a CRDT register
and surface it back.
Tests:
- The three status/render merge-failure rendering tests now seed a
MergeJob CRDT entry via write_merge_job instead of writing YAML.
- The unblock test that asserted YAML cleanup on disk is now an assertion
on the CRDT registers (blocked=false, retry_count=0). Also re-seeded
in `2_blocked` stage so the typed Blocked → Coding transition actually
fires (not the fallback path).
All 2855 tests pass; fmt clean; clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>