The 5-line spread of `.unwrap_or_else(|| { ... })` in spawn.rs (from
the bd517f28 + 65416476 warm-resume work) doesn't match rustfmt's
preference for the short form. Was blocking every merge gate since
the warm-resume fix landed.
Follow-up to bd517f28. When --resume succeeds, claude-code restores the
full prior conversation — the agent already has its file reads, tool
results, and reasoning in context. Telling it to "read PLAN.md" forces
a redundant tool call to re-read a doc it wrote itself. PLAN.md is the
cold-start orientation doc (driven by AGENT.md); the resume -p prompt
should just be a continuation nudge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
respawns can actually warm
claude-code's --resume <session_id> requires either:
a) a deferred-tool marker in the resumed session (i.e. the prior
session paused mid-tool-call), or
b) a non-empty -p prompt to continue the conversation with.
Watchdog-killed sessions have neither: the kill is asynchronous and
leaves no deferred-tool marker, and our harness was passing an empty
-p (because `resume_context_owned` is None for the common respawn
case). claude-code then aborts with:
"Error: No deferred tool marker found in the resumed session.
Either the session was not deferred, the marker is stale (tool
already ran), or it exceeds the tail-scan window. Provide a
prompt to continue the conversation."
The harness sees an aborted CLI with no session, prunes the recorded
session_id, and respawns cold — paying the full prompt-cache miss for
EVERY respawn. The new session_store logging (commit 0b50a624) made
this 100% legible: every warm spawn we observed went `mode=warm` →
crash → prune → `mode=cold` within a couple of seconds.
Fix: when resuming with no failure-context to send, default the -p
prompt to a brief "continue from PLAN.md" line. claude-code now has a
valid continuation message and warm-resume should actually work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Helps explain WHY each spawn goes warm vs cold. The existing
`spawn mode=warm|cold` log only shows the outcome at the spawn point —
to count where warmth is being lost, we need to see:
- when a session_id is recorded (and for which key),
- what every lookup returns (key + Some/None),
- when remove_sessions_for_story prunes (which is currently the only
explicit cold-induction path beyond "first ever spawn").
After this lands a grep of "session_store" in the logs gives the full
warm-resume health picture: which (story,agent,model) triples have a
recorded session, which lookups are hitting it, and which prunes are
costing us a warm respawn.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PLAN.md is supposed to be a per-worktree planning file written by coder
agents and gitignored at the project root (.gitignore line 21, added by
952). But two recent merges shipped it anyway (945, 919) because the
squash-merge pipeline doesn't filter gitignored paths from the feature
branch diff — and once tracked, .gitignore stops protecting it.
This commit just removes it from master's tree. The structural fix
(squash-merge respects root .gitignore) is filed as a separate bug. If
an in-flight feature branch commits PLAN.md before that lands, this
file will be back on master at the next merge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without this, the only way to tell whether a watchdog-respawn went warm
(--resume <session_id>) vs cold (fresh CLI invocation) was to read the
args list of the existing "Spawning claude with args:" log and check
whether --resume was present. That made it impossible to count
cold-paths or distinguish "supposed-to-be-warm but resume_failed
fallback" from "first session" without source-diving.
This adds one slog! per spawn, prefixed `[agent:{sid}:{name}] spawn
mode=warm|cold session_id=...`, so grep "spawn mode=" answers it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 12 tests in `agents::pool::pipeline::merge::tests` share a
process-wide `server_start_time` (a `OnceLock` captured the first time
the merge subsystem runs) and the global merge-job CRDT log. Default
cargo parallelism has caught at least one interleaving on the merge
gate's Docker scheduler where `stale_running_merge_job_is_cleared_and_retry_succeeds`
flakes — `delete_merge_job` from one test lands while another is mid-
assertion. Couldn't reproduce locally despite many tries.
Each test now acquires a poison-tolerant `std::sync::Mutex` at entry,
so the 12 tests run serially relative to each other while the rest of
the suite (2862 tests) stays parallel. Module-level
`#![allow(clippy::await_holding_lock)]` covers the deliberate sync
guard across `.await`s.
Targeted isolation — not a global `--test-threads=1`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five tests pin down the contract of `migrate_legacy_stage_strings`:
rewrite of all pre-934 directory-style strings to clean wire form,
the lossy `7_frozen` → backlog + frozen-flag collapse, no-op on
already-clean items, idempotence, and graceful behaviour before
CRDT init. A test-only `seed_with_raw_stage` helper bypasses the
boundary normalisers (which can't produce legacy strings) by writing
directly to the CRDT register — the same shape we'll see in real
pre-migration data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The state machine's `Stage` enum becomes the source of truth for pipeline
state. Six stages of work land together:
1. Clean wire vocabulary (`coding`, `merge`, `merge_failure`, ...) replaces
legacy directory-style strings (`2_current`, `4_merge`, ...) on the wire.
`Stage::from_dir` accepted both during deployment; new writes always
emit the clean form via `stage_dir_name`. Lexicographic `dir >= "5_done"`
checks in lifecycle.rs become typed `matches!` checks since the new
vocabulary doesn't sort in pipeline order.
2. `crdt_state::write_item` takes typed `&Stage`, serialising via
`stage_dir_name` at the CRDT boundary. `#[cfg(test)] write_item_str`
parses legacy strings for test fixtures.
3. `WorkItem::stage()` returns typed `crdt_state::Stage`; `stage_str()`
is gone from the public API. Projection dispatches on the typed enum.
4. `frozen` becomes an orthogonal CRDT register. `Stage::Frozen` and
`PipelineEvent::Freeze`/`Unfreeze` are removed; `transition_to_frozen`/
`unfrozen` set the flag directly without touching the stage register.
5. Watcher sweep and `tool_update_story`'s `blocked` setter route through
`apply_transition` so the typed transition table validates every
stage change. `update_story` gains a `frozen` field for symmetry.
6. One-shot startup migration rewrites pre-934 directory-style stage
registers (and sets `frozen=true` on items previously at `7_frozen`).
`Stage::from_dir` drops legacy aliases. The db boundary keeps a small
normaliser so callers with legacy strings (MCP, tests) still work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final 929 sweep: every YAML-shaped helper is gone. No production code
parses or writes YAML front matter anywhere.
Surface removed:
- db/yaml_legacy.rs (FrontMatter/StoryMetadata structs, parse_front_matter,
set_front_matter_field, yaml_residue marker) — file deleted.
- ItemMeta::from_yaml — deleted; callers pass typed ItemMeta::named(...) or
ItemMeta::default() and use typed CRDT setters (set_depends_on,
set_blocked, set_retry_count, set_agent, set_qa_mode, set_review_hold,
set_item_type, set_epic, set_mergemaster_attempted) for the rest.
- write_coverage_baseline_to_story_file + read_coverage_percent_from_json —
the coverage_baseline YAML field was write-only (nothing read it back);
removed along with its caller in agent_tools/lifecycle.rs.
- update_story_in_file's generic `front_matter` HashMap parameter —
tool_update_story now intercepts every known field name and routes it
to a typed CRDT setter; unknown keys are rejected with an explicit error
pointing at the typed setters. The function only takes user_story /
description sections now.
- All 117 ItemMeta::from_yaml callsites migrated. Where tests previously
passed a YAML-shaped content blob and relied on the helper to extract
name/depends_on/blocked/agent/qa, they now pass:
write_item_with_content(id, stage, content, ItemMeta::named("Foo"))
crate::crdt_state::set_depends_on(id, &[...]) // when needed
crate::crdt_state::set_blocked(id, true) // when needed
crate::crdt_state::set_agent(id, Some("...")) // when needed
- write_story_content + write_story_file (test helper) now take an
explicit `name: Option<&str>` instead of parsing it from content.
- db::ops::move_item_stage stopped re-parsing YAML on every stage
transition; metadata is read straight from the CRDT view when mirroring
the row into SQLite.
New CRDT setters added for symmetry:
- crdt_state::set_name (mirrors set_agent — explicit name updates).
cargo fmt --check, clippy --all-targets -- -D warnings, and the
2830-test suite all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every stage transition was reading the content body's YAML front matter to
derive name/agent/blocked/depends_on, then writing those values straight
back into the CRDT registers — but the CRDT was already the source of
truth for all of these fields. The reparse was at best a no-op and at
worst could regress the CRDT to stale YAML values during transitions on
items whose YAML was out of date.
Now move_item_stage:
- writes the new stage to the CRDT with None for every other field, so
write_item leaves existing registers untouched.
- reads name/agent/blocked/depends_on back from the CRDT view when
mirroring the row into the SQLite shadow table (still needed because
the shadow stores a denormalised snapshot for read-side queries).
The yaml_legacy::parse_front_matter import is gone from db/ops.rs; the
only path still using it on the production side is ItemMeta::from_yaml,
which is a caller convenience (mostly used in test fixtures).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After 932 (review_hold register) and 933 (item_type + epic registers), the
remaining production yaml_legacy callers all had typed CRDT equivalents.
Migrated:
- agents/lifecycle.rs:
- transition_to_merge_failure writes to MergeJob.error CRDT entry instead
of YAML body. The legacy `merge_failure: "..."` front-matter write is gone.
- reject_story_from_qa inlines the QA-rejection notes append; no longer
needs yaml_legacy::write_rejection_notes_to_content.
- fields_to_clear_transform helper deleted along with all five callers —
blocked/retry_count/merge_failure are typed CRDT fields now, so clearing
the equivalent YAML keys is redundant.
- http/workflow/pipeline.rs:
- load_pipeline_state reads merge_failure from MergeJob.error (mirrors
status_tools.rs).
- validate_story_dirs checks the typed CRDT `name` register instead of
parsing YAML front matter.
- http/mcp/status_tools.rs: review_hold reads the typed CRDT register
(yaml_residue wrap was the last one in this file).
- http/mcp/story_tools/criteria.rs: story_name reads from CRDT.
- service/agents/mod.rs::get_work_item_content: name/agent come from CRDT.
- service/notifications/io/mod.rs::read_story_name: same.
- http/workflow/bug_ops/{bug,refactor}.rs: name-fallback paths drop YAML
parsing in favour of the CRDT-derived item.name.
Dead helpers removed from db/yaml_legacy.rs:
yaml_residue, write_merge_failure_in_content, write_rejection_notes_to_content,
clear_front_matter_field_in_content, write_review_hold_in_content,
clear_front_matter_field, write_review_hold (the last four shipped in 932).
Remaining surface: FrontMatter / StoryMetadata structs, parse_front_matter,
set_front_matter_field — kept for `coverage_baseline` writes via
test_results.rs and the generic update_story front_matter escape hatch.
Test fixtures rewritten to seed the CRDT register instead of relying on
YAML parsing during write_item_with_content:
- has_review_hold_returns_* tests
- item_type_from_id_uses_crdt_register_for_numeric_ids
- tool_list_epics_shows_member_rollup
- get_work_item_content (both copies — http/agents + service/agents)
- validate_story_dirs_missing_name_in_crdt
- server_side_merge_*_sets_merge_failure (assert MergeJob.error, not YAML)
cargo fmt --check, clippy --all-targets -- -D warnings, and the
2856-test suite all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the YAML-only `type: epic` / `epic: <id>` front-matter fields with
typed CRDT registers on PipelineItemCrdt. The epic-mechanism MCP tools
(`tool_list_epics`, `tool_show_epic`), the epic-context injection in agent
spawn, and the type-classifier helpers (`item_type_from_id`, `is_bug_item`,
`is_refactor_item`) now all read from the CRDT.
Schema:
- PipelineItemCrdt: `item_type: LwwRegisterCrdt<String>` and
`epic: LwwRegisterCrdt<String>` registers.
- WorkItem: typed `item_type()` and `epic()` accessors returning `Option<&str>`.
- crdt_state::set_item_type(story_id, Option<&str>) and
crdt_state::set_epic(story_id, Option<&str>) typed setters.
Write paths populate the new registers:
- create_story_file / create_bug_file / create_spike_file /
create_refactor_file / create_epic_file — each calls set_item_type after
write_story_content.
- tool_update_story intercepts `epic` and `type` fields and routes them to
the typed setters (same pattern as qa / depends_on).
Read paths migrated off yaml_legacy:
- http/mcp/story_tools/epic.rs: tool_list_epics + tool_show_epic.
- agents/lifecycle.rs::item_type_from_id (numeric-only IDs).
- agents/pool/start/spawn.rs epic-context injection.
- http/workflow/bug_ops/bug.rs::is_bug_item, refactor.rs::is_refactor_item.
- http/workflow/pipeline.rs::load_pipeline_state — review_hold/qa/epic_id
all come from the CRDT now; only merge_failure is still YAML (sweep in
929 stage 10).
All `yaml_residue(...)` wraps for item_type / epic are removed; the
remaining residue marker doc no longer references 933.
cargo fmt --check, clippy --all-targets -- -D warnings, and the 2857-test
suite all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
review_hold is now a typed bool register on PipelineItemCrdt alongside
blocked / mergemaster_attempted. Exposed via the typed setter
`crdt_state::set_review_hold(story_id, value)` and the
`WorkItem::review_hold()` accessor. Replaces the legacy
`review_hold: true` YAML front-matter field.
Migrated callers:
- http/mcp/qa_tools.rs::tool_approve_qa — clear via set_review_hold(false)
- agents/lifecycle.rs::reject_story_from_qa — clear via set_review_hold(false)
- agents/pool/pipeline/advance/helpers.rs::write_review_hold_to_store
— set via set_review_hold(true), no more content rewrite
- agents/pool/auto_assign/reconcile.rs (two callsites) — set via
set_review_hold(true) instead of FS YAML write
- agents/pool/auto_assign/story_checks.rs::has_review_hold — reads the
typed register instead of conflating with Stage::Frozen (real bug fix:
the legacy implementation returned `stage.is_frozen()`, which made
the auto-assigner treat *every* held-for-review item as frozen even
when it wasn't actually parked at the freeze stage).
Dead yaml_legacy helpers removed:
- write_review_hold(path), write_review_hold_in_content(content)
- clear_front_matter_field(path) — last caller was the qa_tools wrap
The yaml_residue marker doc now only mentions 933; the 932 line is gone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
io/watcher and io/watcher/sweep were already CRDT-only — the watcher only
watches .huskies/{project,agents}.toml, work-item events come from CRDT
subscribe — so the remaining FS shadow reader was the bug-503 archived-dep
warning in story_tools/story/create.rs (via check_archived_deps_from_list,
which scanned .huskies/work/6_archived/). Migrate that call to the
CRDT-direct `dep_is_archived_crdt`. Drop the now-unused helper and the
four dead imports in bug/spike/refactor/criteria.rs that referenced it.
io/story_metadata/deps.rs is reduced to a module-level comment pointing
callers at the crdt_state helpers; nothing in io/ now scans the FS shadow
tree.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The startup reconciler still pokes review_hold into the on-disk story file
when promoting human-QA items, because no CRDT register exists yet for
review_hold (filed as sub-story 932). The two write-side callsites in
reconcile.rs were the last bare yaml_legacy:: calls in production write
paths; wrap them in yaml_residue so the gap shows up in
`grep -rn yaml_residue` like the other 932/933 markers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
transition_to_frozen and transition_to_unfrozen no longer touch YAML; both
now just call apply_transition with no content_transform. Pairs with the
stage-6 read-side change in projection.rs.
Story 934 will obviate the entire resume_to mechanism by making frozen a
flag orthogonal to Stage (story stays in its current Stage when frozen).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
projection::project_stage was the last yaml_legacy reader in pipeline_state.
Drop the read_content+parse_front_matter detour for the "7_frozen" case and
always default resume_to to Stage::Coding. The YAML write side in apply.rs
goes in stage 7.
Story 934 (sibling refactor) will replace Stage::Frozen-with-payload with a
frozen flag orthogonal to Stage, so a story frozen in Qa stays in Stage::Qa
rather than encoding a "where to resume" target. After 934 lands the
resume_to payload disappears entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrate the last three callers of the FS-scanning dependency helpers to the
CRDT-direct equivalents and delete the dead helpers:
- agents/pool/auto_assign/story_checks.rs: has_unmet_dependencies and
check_archived_dependencies now wrap check_unmet_deps_crdt /
check_archived_deps_crdt directly. Tests rewritten to seed the CRDT.
- http/mcp/story_tools/story/update.rs: bug-503 archived-dep warning now
reads from CRDT instead of scanning 6_archived.
- agents/pool/pipeline/advance/helpers.rs: resolve_qa_mode_from_store is
CRDT-only (the FS fallback for content-store-empty stories is gone).
- io/story_metadata/parser.rs: resolve_qa_mode_from_content removed.
- io/story_metadata/deps.rs: check_unmet_deps and dep_is_done deleted,
along with the unused check_unmet_deps_from_list helper.
- io/story_metadata/mod.rs: re-exports trimmed accordingly.
check_archived_deps_from_list survives because story-creation still calls
it before the CRDT entry exists (used from story_tools/story/create.rs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Read-side migrations:
- agents/pool/auto_assign/backlog.rs: depends_on check now reads from
WorkItem.depends_on() instead of parse_front_matter.
- agents/pool/auto_assign/story_checks.rs: read_story_front_matter_agent
drops its YAML fallback — post-891 the CRDT entry is reliable, and
removing the fallback makes the contract honest. The now-unused
read_story_contents helper goes too.
- agents/pool/start/validation.rs: same shape — YAML fallback removed,
CRDT register is the only source for agent pinning.
- agents/pool/start/spawn.rs: epic-context injection wraps the
parse_front_matter call in `yaml_residue(...)` since `meta.epic` has no
CRDT analog (sub-story 933).
- agents/lifecycle.rs: item_type_from_id (numeric-only ID path) wraps its
parse_front_matter in `yaml_residue(...)` for the same reason (933).
The write-side `fields_to_clear_transform` calls in lifecycle.rs are
left for stage 8, when FS-shadow writes are deleted wholesale.
Test fix:
- start_agent_returns_error_when_front_matter_agent_busy now seeds the
CRDT entry (write_item with agent="coder-opus") instead of relying on
parse_front_matter reading the YAML on disk.
Filed earlier:
- 932 (review_hold register) — note: this turns out to be a real class-1
bug: write_review_hold_to_store still writes YAML but has_review_hold
reads Stage::Frozen, so the write goes into a void. 932 is the correct
fix.
All 2861 tests pass; fmt + clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three MCP files touched:
- status_tools.rs (story-status JSON dump): every field with a CRDT
equivalent now reads from WorkItem (name, agent, blocked, qa_mode,
retry_count, depends_on, claimed_by, claimed_at) or MergeJob.error
(merge_failure detail). One field — review_hold — has no CRDT register
yet (sub-story 932) and is wrapped in `yaml_residue(parse_front_matter(...))`
so the gap is visible at every code-search.
- qa_tools.rs:
• tool_approve_qa wraps the legacy `clear_front_matter_field("review_hold")`
write in `yaml_residue(...)` pending sub-story 932.
• tool_reject_qa now reads the agent name from the CRDT WorkItem instead
of parsing front matter on disk.
- story_tools/epic.rs: the entire epic feature (item_type, epic link)
has no CRDT analog — sub-story 933. Every parse_front_matter call here
is wrapped in `yaml_residue(...)`.
Also: new identity wrapper `db::yaml_legacy::yaml_residue<T>(v: T) -> T`
that marks a yaml_legacy callsite blocked on a CRDT-register gap. Pure
identity at runtime; the distinctive name makes the residue grep-findable
(`grep -rn yaml_residue`). Sub-stories 932 and 933 enumerate the gaps.
Filed:
- 932: Add CRDT register for review_hold
- 933: Add CRDT registers for the epic mechanism
All 2854 tests pass; fmt + clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
delete.rs, start.rs, assign.rs all looked up the story name by reading
the content from disk/store and parsing the front matter. Replaced with
`crdt_state::read_item(&story_id).and_then(|w| w.name())`. Each callsite's
fallback chain ("CRDT → content store → filesystem") still locates the
story_id; only the name extraction moved off YAML.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each chat command that previously read parse_front_matter for story
metadata (name, agent, depends_on, blocked, retry_count, merge_failure,
qa_mode) now reads from the typed CRDT API:
- WorkItem (via crdt_state::read_item) for pipeline-item registers.
- MergeJobView (via crdt_state::read_merge_job) for the merge failure
detail text, which has its own LWW-map CRDT entry.
Files migrated: depends.rs, freeze.rs, move_story.rs, overview.rs,
status/render.rs, triage.rs, unblock.rs, unreleased.rs.
unblock.rs: also removes the legacy front-matter cleanup branch that
fired when the typed Blocked→Coding transition failed. Post-929 there
is no YAML on disk to clean; the fallback now just resets retry_count
in the CRDT.
triage.rs: drops the YAML-only `review_hold` and `coverage_baseline`
fields from the dump. These have no CRDT register and were never
load-bearing on the triage output; if needed later, add a CRDT register
and surface it back.
Tests:
- The three status/render merge-failure rendering tests now seed a
MergeJob CRDT entry via write_merge_job instead of writing YAML.
- The unblock test that asserted YAML cleanup on disk is now an assertion
on the CRDT registers (blocked=false, retry_count=0). Also re-seeded
in `2_blocked` stage so the typed Blocked → Coding transition actually
fires (not the fallback path).
All 2855 tests pass; fmt clean; clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>