huskies

Author	SHA1	Message	Date
dave	12ae7ec8bb	huskies: merge 936	2026-05-12 21:48:39 +00:00
dave	937792f208	huskies: merge 898	2026-05-12 21:33:41 +00:00
Timmy	d78dd9e8f9	feat(934): typed Stage enum replaces directory-string state model The state machine's `Stage` enum becomes the source of truth for pipeline state. Six stages of work land together: 1. Clean wire vocabulary (`coding`, `merge`, `merge_failure`, ...) replaces legacy directory-style strings (`2_current`, `4_merge`, ...) on the wire. `Stage::from_dir` accepted both during deployment; new writes always emit the clean form via `stage_dir_name`. Lexicographic `dir >= "5_done"` checks in lifecycle.rs become typed `matches!` checks since the new vocabulary doesn't sort in pipeline order. 2. `crdt_state::write_item` takes typed `&Stage`, serialising via `stage_dir_name` at the CRDT boundary. `#[cfg(test)] write_item_str` parses legacy strings for test fixtures. 3. `WorkItem::stage()` returns typed `crdt_state::Stage`; `stage_str()` is gone from the public API. Projection dispatches on the typed enum. 4. `frozen` becomes an orthogonal CRDT register. `Stage::Frozen` and `PipelineEvent::Freeze`/`Unfreeze` are removed; `transition_to_frozen`/ `unfrozen` set the flag directly without touching the stage register. 5. Watcher sweep and `tool_update_story`'s `blocked` setter route through `apply_transition` so the typed transition table validates every stage change. `update_story` gains a `frozen` field for symmetry. 6. One-shot startup migration rewrites pre-934 directory-style stage registers (and sets `frozen=true` on items previously at `7_frozen`). `Stage::from_dir` drops legacy aliases. The db boundary keeps a small normaliser so callers with legacy strings (MCP, tests) still work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 22:31:59 +01:00
dave	93443e2ff1	huskies: merge 921	2026-05-12 21:09:52 +00:00
Timmy	69d91d7707	feat(929): delete db/yaml_legacy.rs entirely — CRDT is the sole source of truth Final 929 sweep: every YAML-shaped helper is gone. No production code parses or writes YAML front matter anywhere. Surface removed: - db/yaml_legacy.rs (FrontMatter/StoryMetadata structs, parse_front_matter, set_front_matter_field, yaml_residue marker) — file deleted. - ItemMeta::from_yaml — deleted; callers pass typed ItemMeta::named(...) or ItemMeta::default() and use typed CRDT setters (set_depends_on, set_blocked, set_retry_count, set_agent, set_qa_mode, set_review_hold, set_item_type, set_epic, set_mergemaster_attempted) for the rest. - write_coverage_baseline_to_story_file + read_coverage_percent_from_json — the coverage_baseline YAML field was write-only (nothing read it back); removed along with its caller in agent_tools/lifecycle.rs. - update_story_in_file's generic `front_matter` HashMap parameter — tool_update_story now intercepts every known field name and routes it to a typed CRDT setter; unknown keys are rejected with an explicit error pointing at the typed setters. The function only takes user_story / description sections now. - All 117 ItemMeta::from_yaml callsites migrated. Where tests previously passed a YAML-shaped content blob and relied on the helper to extract name/depends_on/blocked/agent/qa, they now pass: write_item_with_content(id, stage, content, ItemMeta::named("Foo")) crate::crdt_state::set_depends_on(id, &[...]) // when needed crate::crdt_state::set_blocked(id, true) // when needed crate::crdt_state::set_agent(id, Some("...")) // when needed - write_story_content + write_story_file (test helper) now take an explicit `name: Option<&str>` instead of parsing it from content. - db::ops::move_item_stage stopped re-parsing YAML on every stage transition; metadata is read straight from the CRDT view when mirroring the row into SQLite. New CRDT setters added for symmetry: - crdt_state::set_name (mirrors set_agent — explicit name updates). cargo fmt --check, clippy --all-targets -- -D warnings, and the 2830-test suite all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:55:25 +01:00
Timmy	6c62e0fa31	refactor(929): drop redundant YAML re-parse in db::ops::move_item_stage Every stage transition was reading the content body's YAML front matter to derive name/agent/blocked/depends_on, then writing those values straight back into the CRDT registers — but the CRDT was already the source of truth for all of these fields. The reparse was at best a no-op and at worst could regress the CRDT to stale YAML values during transitions on items whose YAML was out of date. Now move_item_stage: - writes the new stage to the CRDT with None for every other field, so write_item leaves existing registers untouched. - reads name/agent/blocked/depends_on back from the CRDT view when mirroring the row into the SQLite shadow table (still needed because the shadow stores a denormalised snapshot for read-side queries). The yaml_legacy::parse_front_matter import is gone from db/ops.rs; the only path still using it on the production side is ItemMeta::from_yaml, which is a caller convenience (mostly used in test fixtures). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:15:08 +01:00
Timmy	4888f051c3	wip(929): stage 10 sweep — production callsites move to CRDT, yaml_legacy shrinks After 932 (review_hold register) and 933 (item_type + epic registers), the remaining production yaml_legacy callers all had typed CRDT equivalents. Migrated: - agents/lifecycle.rs: - transition_to_merge_failure writes to MergeJob.error CRDT entry instead of YAML body. The legacy `merge_failure: "..."` front-matter write is gone. - reject_story_from_qa inlines the QA-rejection notes append; no longer needs yaml_legacy::write_rejection_notes_to_content. - fields_to_clear_transform helper deleted along with all five callers — blocked/retry_count/merge_failure are typed CRDT fields now, so clearing the equivalent YAML keys is redundant. - http/workflow/pipeline.rs: - load_pipeline_state reads merge_failure from MergeJob.error (mirrors status_tools.rs). - validate_story_dirs checks the typed CRDT `name` register instead of parsing YAML front matter. - http/mcp/status_tools.rs: review_hold reads the typed CRDT register (yaml_residue wrap was the last one in this file). - http/mcp/story_tools/criteria.rs: story_name reads from CRDT. - service/agents/mod.rs::get_work_item_content: name/agent come from CRDT. - service/notifications/io/mod.rs::read_story_name: same. - http/workflow/bug_ops/{bug,refactor}.rs: name-fallback paths drop YAML parsing in favour of the CRDT-derived item.name. Dead helpers removed from db/yaml_legacy.rs: yaml_residue, write_merge_failure_in_content, write_rejection_notes_to_content, clear_front_matter_field_in_content, write_review_hold_in_content, clear_front_matter_field, write_review_hold (the last four shipped in 932). Remaining surface: FrontMatter / StoryMetadata structs, parse_front_matter, set_front_matter_field — kept for `coverage_baseline` writes via test_results.rs and the generic update_story front_matter escape hatch. Test fixtures rewritten to seed the CRDT register instead of relying on YAML parsing during write_item_with_content: - has_review_hold_returns_* tests - item_type_from_id_uses_crdt_register_for_numeric_ids - tool_list_epics_shows_member_rollup - get_work_item_content (both copies — http/agents + service/agents) - validate_story_dirs_missing_name_in_crdt - server_side_merge_*_sets_merge_failure (assert MergeJob.error, not YAML) cargo fmt --check, clippy --all-targets -- -D warnings, and the 2856-test suite all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:13:17 +01:00
Timmy	7d7ab85994	feat(933): add item_type + epic CRDT registers + migrate epic mechanism Replaces the YAML-only `type: epic` / `epic: <id>` front-matter fields with typed CRDT registers on PipelineItemCrdt. The epic-mechanism MCP tools (`tool_list_epics`, `tool_show_epic`), the epic-context injection in agent spawn, and the type-classifier helpers (`item_type_from_id`, `is_bug_item`, `is_refactor_item`) now all read from the CRDT. Schema: - PipelineItemCrdt: `item_type: LwwRegisterCrdt<String>` and `epic: LwwRegisterCrdt<String>` registers. - WorkItem: typed `item_type()` and `epic()` accessors returning `Option<&str>`. - crdt_state::set_item_type(story_id, Option<&str>) and crdt_state::set_epic(story_id, Option<&str>) typed setters. Write paths populate the new registers: - create_story_file / create_bug_file / create_spike_file / create_refactor_file / create_epic_file — each calls set_item_type after write_story_content. - tool_update_story intercepts `epic` and `type` fields and routes them to the typed setters (same pattern as qa / depends_on). Read paths migrated off yaml_legacy: - http/mcp/story_tools/epic.rs: tool_list_epics + tool_show_epic. - agents/lifecycle.rs::item_type_from_id (numeric-only IDs). - agents/pool/start/spawn.rs epic-context injection. - http/workflow/bug_ops/bug.rs::is_bug_item, refactor.rs::is_refactor_item. - http/workflow/pipeline.rs::load_pipeline_state — review_hold/qa/epic_id all come from the CRDT now; only merge_failure is still YAML (sweep in 929 stage 10). All `yaml_residue(...)` wraps for item_type / epic are removed; the remaining residue marker doc no longer references 933. cargo fmt --check, clippy --all-targets -- -D warnings, and the 2857-test suite all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 19:58:43 +01:00
Timmy	aadbb1b2af	feat(932): add review_hold CRDT register + migrate callers off yaml_legacy review_hold is now a typed bool register on PipelineItemCrdt alongside blocked / mergemaster_attempted. Exposed via the typed setter `crdt_state::set_review_hold(story_id, value)` and the `WorkItem::review_hold()` accessor. Replaces the legacy `review_hold: true` YAML front-matter field. Migrated callers: - http/mcp/qa_tools.rs::tool_approve_qa — clear via set_review_hold(false) - agents/lifecycle.rs::reject_story_from_qa — clear via set_review_hold(false) - agents/pool/pipeline/advance/helpers.rs::write_review_hold_to_store — set via set_review_hold(true), no more content rewrite - agents/pool/auto_assign/reconcile.rs (two callsites) — set via set_review_hold(true) instead of FS YAML write - agents/pool/auto_assign/story_checks.rs::has_review_hold — reads the typed register instead of conflating with Stage::Frozen (real bug fix: the legacy implementation returned `stage.is_frozen()`, which made the auto-assigner treat every held-for-review item as frozen even when it wasn't actually parked at the freeze stage). Dead yaml_legacy helpers removed: - write_review_hold(path), write_review_hold_in_content(content) - clear_front_matter_field(path) — last caller was the qa_tools wrap The yaml_residue marker doc now only mentions 933; the 932 line is gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 19:49:36 +01:00
dave	f9f16d6a14	huskies: merge 925	2026-05-12 18:33:13 +00:00
Timmy	7660a460a5	wip(929): stage 9 — drop FS-archived-deps scan; story_tools/story/create.rs reads CRDT io/watcher and io/watcher/sweep were already CRDT-only — the watcher only watches .huskies/{project,agents}.toml, work-item events come from CRDT subscribe — so the remaining FS shadow reader was the bug-503 archived-dep warning in story_tools/story/create.rs (via check_archived_deps_from_list, which scanned .huskies/work/6_archived/). Migrate that call to the CRDT-direct `dep_is_archived_crdt`. Drop the now-unused helper and the four dead imports in bug/spike/refactor/criteria.rs that referenced it. io/story_metadata/deps.rs is reduced to a module-level comment pointing callers at the crdt_state helpers; nothing in io/ now scans the FS shadow tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 19:25:47 +01:00
Timmy	37877db38d	wip(929): stage 8 — wrap reconcile review_hold FS writes in yaml_residue The startup reconciler still pokes review_hold into the on-disk story file when promoting human-QA items, because no CRDT register exists yet for review_hold (filed as sub-story 932). The two write-side callsites in reconcile.rs were the last bare yaml_legacy:: calls in production write paths; wrap them in yaml_residue so the gap shows up in `grep -rn yaml_residue` like the other 932/933 markers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 19:22:26 +01:00
Timmy	23f58f5762	wip(929): stage 7 — drop resume_to_stage FS write from freeze/unfreeze transition_to_frozen and transition_to_unfrozen no longer touch YAML; both now just call apply_transition with no content_transform. Pairs with the stage-6 read-side change in projection.rs. Story 934 will obviate the entire resume_to mechanism by making frozen a flag orthogonal to Stage (story stays in its current Stage when frozen). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 19:18:27 +01:00
Timmy	bfea832402	wip(929): stage 6 — drop resume_to_stage YAML lookup from projection layer projection::project_stage was the last yaml_legacy reader in pipeline_state. Drop the read_content+parse_front_matter detour for the "7_frozen" case and always default resume_to to Stage::Coding. The YAML write side in apply.rs goes in stage 7. Story 934 (sibling refactor) will replace Stage::Frozen-with-payload with a frozen flag orthogonal to Stage, so a story frozen in Qa stays in Stage::Qa rather than encoding a "where to resume" target. After 934 lands the resume_to payload disappears entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 19:17:10 +01:00
Timmy	6e704a33b7	wip(929): stage 5 — drop FS-based dep checks and qa-mode parser from io/story_metadata Migrate the last three callers of the FS-scanning dependency helpers to the CRDT-direct equivalents and delete the dead helpers: - agents/pool/auto_assign/story_checks.rs: has_unmet_dependencies and check_archived_dependencies now wrap check_unmet_deps_crdt / check_archived_deps_crdt directly. Tests rewritten to seed the CRDT. - http/mcp/story_tools/story/update.rs: bug-503 archived-dep warning now reads from CRDT instead of scanning 6_archived. - agents/pool/pipeline/advance/helpers.rs: resolve_qa_mode_from_store is CRDT-only (the FS fallback for content-store-empty stories is gone). - io/story_metadata/parser.rs: resolve_qa_mode_from_content removed. - io/story_metadata/deps.rs: check_unmet_deps and dep_is_done deleted, along with the unused check_unmet_deps_from_list helper. - io/story_metadata/mod.rs: re-exports trimmed accordingly. check_archived_deps_from_list survives because story-creation still calls it before the CRDT entry exists (used from story_tools/story/create.rs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 19:14:54 +01:00
Timmy	f775f4cfb9	wip(929): stage 4 — migrate agents/pool/* + lifecycle.rs read sides off yaml_legacy Read-side migrations: - agents/pool/auto_assign/backlog.rs: depends_on check now reads from WorkItem.depends_on() instead of parse_front_matter. - agents/pool/auto_assign/story_checks.rs: read_story_front_matter_agent drops its YAML fallback — post-891 the CRDT entry is reliable, and removing the fallback makes the contract honest. The now-unused read_story_contents helper goes too. - agents/pool/start/validation.rs: same shape — YAML fallback removed, CRDT register is the only source for agent pinning. - agents/pool/start/spawn.rs: epic-context injection wraps the parse_front_matter call in `yaml_residue(...)` since `meta.epic` has no CRDT analog (sub-story 933). - agents/lifecycle.rs: item_type_from_id (numeric-only ID path) wraps its parse_front_matter in `yaml_residue(...)` for the same reason (933). The write-side `fields_to_clear_transform` calls in lifecycle.rs are left for stage 8, when FS-shadow writes are deleted wholesale. Test fix: - start_agent_returns_error_when_front_matter_agent_busy now seeds the CRDT entry (write_item with agent="coder-opus") instead of relying on parse_front_matter reading the YAML on disk. Filed earlier: - 932 (review_hold register) — note: this turns out to be a real class-1 bug: write_review_hold_to_store still writes YAML but has_review_hold reads Stage::Frozen, so the write goes into a void. 932 is the correct fix. All 2861 tests pass; fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 19:03:51 +01:00
dave	03a99b3cf1	huskies: merge 927	2026-05-12 17:55:12 +00:00
Timmy	b8945654bf	wip(929): stage 3 — migrate http/mcp/* off yaml_legacy + introduce yaml_residue marker Three MCP files touched: - status_tools.rs (story-status JSON dump): every field with a CRDT equivalent now reads from WorkItem (name, agent, blocked, qa_mode, retry_count, depends_on, claimed_by, claimed_at) or MergeJob.error (merge_failure detail). One field — review_hold — has no CRDT register yet (sub-story 932) and is wrapped in `yaml_residue(parse_front_matter(...))` so the gap is visible at every code-search. - qa_tools.rs: • tool_approve_qa wraps the legacy `clear_front_matter_field("review_hold")` write in `yaml_residue(...)` pending sub-story 932. • tool_reject_qa now reads the agent name from the CRDT WorkItem instead of parsing front matter on disk. - story_tools/epic.rs: the entire epic feature (item_type, epic link) has no CRDT analog — sub-story 933. Every parse_front_matter call here is wrapped in `yaml_residue(...)`. Also: new identity wrapper `db::yaml_legacy::yaml_residue<T>(v: T) -> T` that marks a yaml_legacy callsite blocked on a CRDT-register gap. Pure identity at runtime; the distinctive name makes the residue grep-findable (`grep -rn yaml_residue`). Sub-stories 932 and 933 enumerate the gaps. Filed: - 932: Add CRDT register for review_hold - 933: Add CRDT registers for the epic mechanism All 2854 tests pass; fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:54:32 +01:00
Timmy	9eb5116f7e	wip(929): stage 2 — migrate chat/transport/matrix/* off yaml_legacy delete.rs, start.rs, assign.rs all looked up the story name by reading the content from disk/store and parsing the front matter. Replaced with `crdt_state::read_item(&story_id).and_then(\|w\| w.name())`. Each callsite's fallback chain ("CRDT → content store → filesystem") still locates the story_id; only the name extraction moved off YAML. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:45:25 +01:00
Timmy	a49a1cf7cb	wip(929): stage 1 — migrate chat/commands/* off yaml_legacy Each chat command that previously read parse_front_matter for story metadata (name, agent, depends_on, blocked, retry_count, merge_failure, qa_mode) now reads from the typed CRDT API: - WorkItem (via crdt_state::read_item) for pipeline-item registers. - MergeJobView (via crdt_state::read_merge_job) for the merge failure detail text, which has its own LWW-map CRDT entry. Files migrated: depends.rs, freeze.rs, move_story.rs, overview.rs, status/render.rs, triage.rs, unblock.rs, unreleased.rs. unblock.rs: also removes the legacy front-matter cleanup branch that fired when the typed Blocked→Coding transition failed. Post-929 there is no YAML on disk to clean; the fallback now just resets retry_count in the CRDT. triage.rs: drops the YAML-only `review_hold` and `coverage_baseline` fields from the dump. These have no CRDT register and were never load-bearing on the triage output; if needed later, add a CRDT register and surface it back. Tests: - The three status/render merge-failure rendering tests now seed a MergeJob CRDT entry via write_merge_job instead of writing YAML. - The unblock test that asserted YAML cleanup on disk is now an assertion on the CRDT registers (blocked=false, retry_count=0). Also re-seeded in `2_blocked` stage so the typed Blocked → Coding transition actually fires (not the fallback path). All 2855 tests pass; fmt clean; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:41:43 +01:00
dave	b940b95ec3	huskies: merge 906	2026-05-12 17:21:16 +00:00
dave	148ce37beb	huskies: merge 891	2026-05-12 17:09:01 +00:00
dave	b76633b79b	huskies: merge 892	2026-05-12 16:51:23 +00:00
dave	c3144b7937	huskies: merge 900	2026-05-12 16:46:33 +00:00
dave	86e8f2441f	huskies: merge 920	2026-05-12 16:41:24 +00:00
dave	19b7edb60c	huskies: merge 918	2026-05-12 16:36:09 +00:00
Timmy	6feb68f3e3	fix(923): watchdog counts only tool-using turns; narration-only turns no longer burn budget Observed: stories 917, 918, 920, 910 all turn-limit-killed despite producing real commits. Tally across their session logs shows 30–55% of assistant turns were pure narration ("I'll read X next", "Now let me check Y") with no tool_use. At 80 max_turns the effective work budget was ~44 tool calls, not enough for a typical bug fix's edit + test + check_criterion cycle. Changes: - New optional AgentConfig field max_tool_turns. When set the watchdog uses it instead of max_turns; only assistant messages whose data.message.content has at least one tool_use block count. - count_turns_in_log in agents/pool/auto_assign/watchdog/limits.rs filters on tool_use. Existing test helper write_fake_session_log now emits tool_use blocks; added write_fake_mixed_session_log for the narration regression test. - agents.toml: coders/coder-opus get max_turns=200 (claude-code's own --max-turns cap, sized to never bite before the watchdog) and max_tool_turns=80. qa: 120 / 40. mergemaster: 250 / 100. Budgets unchanged — the dollar cap remains the runaway-loop backstop, with ~$3-5 worst-case waste if an agent narrates indefinitely. - Two new regression tests: * watchdog_does_not_count_narration_only_turns: 5 tool + 30 narration under max_tool_turns=10 stays Running. * watchdog_max_tool_turns_overrides_max_turns: 4 tool turns at max_tool_turns=3 / max_turns=200 still terminates with TurnLimit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:25:11 +01:00
dave	ce07c4d7b7	huskies: merge 917	2026-05-12 16:22:33 +00:00
dave	916dc2b11d	huskies: merge 910	2026-05-12 16:02:49 +00:00
Timmy	e65f6ace84	fix: get_agent_output no longer panics on tool_result content with multi-byte UTF-8 at byte 500 agent_log::format::format_log_entry_as_text was truncating long tool_result strings via the naive byte slice `&content_str[..500]`. When byte 500 fell inside a multi-byte UTF-8 codepoint (box-drawing chars like '─', smart quotes, emoji), the slice panicked, propagating up through the MCP get_agent_output dispatcher and surfacing as an internal-error response. This blocked any diagnostic readout of a coder's session that had emitted tool output containing those chars. Walk back to the nearest char boundary with `is_char_boundary` before slicing. Regression test asserts the formatter doesn't panic on a 599-byte string with a 3-byte '─' straddling byte 500. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:01:24 +01:00
dave	3891de685c	huskies: merge 888	2026-05-12 15:48:38 +00:00
Timmy	d04facd24f	style: cargo fmt on pty/mod.rs (916 landed with a manually line-broken string literal) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:41:58 +01:00
dave	734597902f	huskies: merge 915	2026-05-12 15:38:25 +00:00
Timmy	38df9c78af	test(916): use far-future reset_at in inactivity-extension regression test to avoid spawn-time race The original `90b31fc8` test computed reset_at = now + 3s in the test thread, then relied on the script spawning fast enough that the rate_limit_event arrived while reset_at was still meaningfully in the future. Under cargo-test load the spawn could take long enough that block_until - now clamped to 0 and the inactivity timeout killed the script before its sleep finished. Pin reset_at to 2099-01-01 (matching the existing rate_limit_hard_block_sends_watcher_hard_block_event test) so the extension is essentially infinite and the assertion isolates the extension-vs-no-extension behavior from wall-clock slack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:36:24 +01:00
dave	a34c9796b5	huskies: merge 913	2026-05-12 15:30:23 +00:00
Timmy	90b31fc84f	fix(916): rate-limit hard block extends inactivity deadline so the watchdog doesn't kill mid-wait When claude-code emits a rate_limit_event with status != allowed_warning, the subprocess waits internally for the limit to clear before retrying. No PTY output flows during that window, so the inactivity timeout in the PTY runner would fire and kill the agent — mergemaster especially, whose 15-minute inactivity window is shorter than typical rate-limit backoffs. Track `block_until = Some(reset_at)` on hard-block events and add the remaining time-until-reset to the per-iteration recv timeout. Once reset_at passes (or an earlier emit arrives), the extension implicitly drops to 0 and the base inactivity timeout resumes. Turn/budget counts aren't affected — they come from the session log and only advance when API calls actually complete, so a stalled retry doesn't burn either. Regression test in agents/pty/mod.rs spawns a script that emits a hard-block with reset_at = now+3s, sleeps 3s, then exits, with inactivity_timeout_secs = 1. Without the fix the runner kills the script at 1s; with the fix the deadline is bumped past the sleep and the run completes cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:22:21 +01:00
Timmy	8421104645	fix(914): thread-local ALL_OPS/VECTOR_CLOCK in cfg(test) so compaction tests don't race Root cause was not the persist channel (the test-mode channel is unbounded and its receiver is leaked, so sends never fail). It was that `ALL_OPS` and `VECTOR_CLOCK` were process-wide `OnceLock` globals while `CRDT_STATE` was already thread-local — so one test thread's `apply_compaction` would prune another test thread's freshly-written ops out of the shared journal, and the subsequent `all_ops_json()` read in `compaction_reduces_ops` would return fewer than the 5 it had just written. Mirror the pattern already used for `CRDT_STATE` and `SnapshotState`: in `cfg(test)` use thread-local `OnceLock<Mutex<...>>`s for the op journal and vector clock, accessed via new `all_ops_lock()` / `vector_clock_lock()` helpers. Production code path is unchanged (still the global statics set during `init()`). Touches ops/read/snapshot call sites to go through the helpers. Note in passing that this overlaps backlog story 518; that story is about the production-side persist path, this is the cfg(test)-only journal-isolation slice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:09:38 +01:00
dave	379ff16d3e	huskies: merge 905	2026-05-12 15:02:58 +00:00
dave	2c5326f339	huskies: merge 890	2026-05-12 14:48:52 +00:00
Timmy	bb845d17cf	docs(904): drop run_tests retry-on-timeout clause from coder prompts Bug 903 (run_tests attach instead of respawn) + 904 (MCP progress notifications + SSE) together eliminate the transport-timeout error mode from the agent's point of view: long test runs complete without the MCP client ever observing a tool-call error. Production verification (see `d64f1e94` / `ddc4228b` deploy at 14:30 UTC today) confirmed 78s and 65s test runs completing in single processes with no respawn churn and no retry needed. The "If run_tests errors with a transport timeout, call it again" sentence in coder-1/2/3/opus system_prompts (added belt-and-braces in `a97a10fb`) is now redundant. Removing it tightens the agent's mental model down to: call run_tests, wait for the result. No error-handling branch, no retry semantics to internalise. This closes the last open AC on story 904. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:36:53 +01:00
Timmy	734d3f2eb0	fix(gateway): bot.toml is read; perm_rx channel stays open Two latent bugs in `service/gateway/io.rs::spawn_gateway_bot`, exposed today after a long-overdue gateway rebuild: 1. The permission channel sender was bound as `_perm_tx` (the underscore prefix signalling "unused") and dropped at function return. The matrix bot's permission_listener task — which holds `perm_rx` for its lifetime per story 884 — then saw the channel close immediately and exited with "perm_rx channel closed" 1s after starting. Net effect: the listener was effectively absent on every gateway boot, so non-MCP tool permission requests had no destination at all (separate from the architectural mismatch that 898 will fix; this was a strictly worse "listener never even ran" version of the same problem). Bind as `perm_tx` and `mem::forget` it to keep the channel open for the gateway's lifetime, mirroring the existing `shutdown_tx` pattern two lines below. 2. `bot_name` was hardcoded to `"Assistant"`, ignoring `bot.toml::display_name`. So the gateway's matrix bot announced itself as "Assistant" and treated user messages addressed to "Timmy" (the actual configured display_name) as unaddressed, silently dropping them. `ambient_rooms` and `permission_timeout_secs` were similarly ignored. Load `BotConfig::load(config_dir)` and apply the same field plumbing the standard-mode initialisation in `main.rs:211-232` already uses. Symptoms seen in production today: - gateway.log: `Sending startup announcement: Assistant is online.` followed by repeated `Ignoring unaddressed message from @yossarian:crashlabs.io` lines. - gateway.log: `permission listener started` immediately followed (same timestamp) by `permission listener exiting (perm_rx channel closed)`. After this lands, rebuild the gateway binary and restart so it picks up `bot.toml` correctly and the listener stays alive for the bot's lifetime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:28:06 +01:00
Timmy	ddc4228b10	feat(904): MCP progress notifications + SSE for long-running tool calls Follow-up to bug 903. The attach fix made run_tests retries safe, but agents still observed the underlying MCP transport timeout as a tool-call error and had to handle it via retry. Implement the proper fix: MCP `notifications/progress` events keep the client's transport timer alive so the call never errors from the agent's perspective. What changed: server/src/http/mcp/progress.rs (new) - `ProgressEmitter` (progressToken + mpsc sender) installed in a `tokio::task_local!` scope by the SSE response path. - `emit_progress(progress, total, message)` builds a JSON-RPC `notifications/progress` message and sends it via the channel. No-op when no emitter is in scope (plain JSON path / tests / API runtimes), so tool handlers can call it unconditionally. server/src/http/mcp/mod.rs - mcp_post_handler now detects `Accept: text/event-stream` AND a `params._meta.progressToken` on tools/call. When both are present, routes through `sse_tools_call` instead of the plain JSON path. - sse_tools_call: spawns the dispatch task with the emitter installed, builds an SSE stream that interleaves incoming progress events with the final JSON-RPC response, with a 15s keep-alive interval as a backstop for tools that don't emit their own progress. - Plain JSON behaviour is unchanged for non-SSE clients and for everything other than tools/call. server/src/http/mcp/shell_tools/script.rs - tool_run_tests poll loop emits `notifications/progress` every 25s of elapsed time (well below the typical ~60s MCP transport timeout). Attached callers (the bug 903 fix path) also emit so their MCP socket stays alive while waiting for the in-flight job. - Output filtering: on a passing run the response now returns a one-line summary ("All N tests passed.") instead of the full `cargo test` stdout, which was pure noise that burned agent tokens. Failure output is unchanged (truncated tail with the `failures:` section and final test_result line). CRDT entry stores the same filtered value so attached callers see it too. Tests (3 new): - emit_progress_no_op_without_emitter — calling outside scope is safe - emit_progress_sends_notification_when_emitter_installed — full path - emit_progress_omits_optional_fields — total/message optional Not changed: coder system_prompts still tell agents to retry on transport-timeout errors. That advice is now belt-and-braces — if claude-code's HTTP MCP client honours progress notifications, no agent will ever observe the error; if not, retry is still safe post-903. We can drop the retry advice once we've observed the SSE path working in the field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:05:04 +01:00
Timmy	a97a10fba2	docs(903): coder system_prompts — clarify run_tests retry contract Pre-d64f1e94 the "call run_tests again — it attaches" guidance was a lie (every call killed the prior job and spawned a fresh one). With the attach fix in place, the contract is now real and safe to depend on. Tighten the wording so agents see exactly what to do: OLD: "Do not use ScheduleWakeup to wait for run_tests; if run_tests appears to time out, call run_tests again — it attaches to the in-flight test job and blocks until completion." NEW: "If run_tests errors with a transport timeout, call it again — it's idempotent and attaches to the same in-flight test job, so retries are safe and eventually return a pass/fail result." Improvements: - "errors with a transport timeout" matches what the agent literally observes (a tool-call error), not the vague "appears to time out". - Explicit on idempotency so agents understand why retry is safe and don't worry about double-running the suite. - Drops the ScheduleWakeup clause — already enforced via the `disallowed_tools` setting on coder-1/2/3/opus, so the prompt reminder was redundant. Applied uniformly across coder-1, coder-2, coder-3, coder-opus. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:54:34 +01:00
Timmy	d64f1e94ff	fix(903): run_tests attaches to in-flight job instead of kill+respawn Bug 903: every `run_tests` MCP call killed the prior `cargo test` child for the same worktree and spawned a fresh one. Combined with the ~60s MCP client-side timeout and the 896 agent prompt that told agents to "call run_tests again — it attaches to the in-flight test job", this produced a respawn loop: agent calls, MCP times out at 60s, agent retries, run_tests kills the running build and starts a new one. The test suite never reaches the finish line. Server log evidence: "Started test job for <worktree> (pid N)" with a new PID every ~60-90s for the same worktree. Fix: when `run_tests` is called and a job is already in flight for that worktree, ATTACH to it instead of killing+respawning. The original job's poll loop already writes the final status to the CRDT `test_jobs` collection; attached callers just poll that CRDT entry (the same pattern `get_test_result` uses) and return the result when the in-flight job transitions out of "running". The 896 prompt's claim is now actually true. Worktrees remain isolated from each other and may run `cargo test` concurrently — there is no cross-worktree serialisation. The single invariant is "at most one test job per worktree at a time". New test: `tool_run_tests_concurrent_calls_attach_to_single_job` spawns two concurrent calls on the same worktree against a 2s `sleep`-based script and asserts total elapsed stays close to 2s (attach) rather than 4s (respawn). Note: the cross-worktree linker-OOM symptom Timmy reported in the field was downstream of the respawn loop. Killed-but-not-fully-reaped cargo invocations stack memory pressure beyond the nominal N worktrees. With the attach fix, each worktree runs exactly one in-flight build at a time and old builds finish cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:22:35 +01:00
dave	22bf203853	huskies: merge 894	2026-05-12 13:02:53 +00:00
Timmy	f06492f540	feat: add Blocked → Backlog legal transition (Demote) Pipeline gap: the state machine refused `move_story(... target='backlog')` from a Blocked story, leaving stuck items with no way to be parked while waiting on dependent fixes — operators had to either Unblock (which re-enters the active flow) or Archive (which loses the item). Extend the existing Demote rule so `Blocked + Demote → Backlog` is a legal transition, alongside the existing `Coding/Qa/Merge + Demote`. Also update `map_stage_move_to_event` in agents/lifecycle.rs so the chat/MCP `move_story` API recognises Blocked → backlog and routes it through `PipelineEvent::Demote`. Tests: - `blocked_demote_returns_to_backlog` — happy path. - `cannot_demote_from_done` / `cannot_demote_from_upcoming` — sanity checks that the broadened rule does NOT permit Demote from terminal or pre-triage stages. Pattern follows 892 (MergeFailure → Done) and 893 (MergeFailure → Coding) — pure transition.rs extension plus matching event mapping in lifecycle.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 13:13:18 +01:00
Timmy	e955250474	fix(902): coder system_prompts steer to get_story_todos for story content Bug 902: the Step 0 "resume from worktree state" instruction told coders to call git_status / git_log / git_diff to discover prior session work, which they then extended into hunting for the story `.md` file on disk via find / ls — pointless post-865, since story content lives only in the CRDT. Update Step 0 in coder-1, coder-2, coder-3, and coder-opus to add an explicit instruction: "To read story content, ACs, or description, call the `get_story_todos` MCP tool — do NOT search for a story `.md` file on disk; story content is CRDT-only." Single substring replacement covers all four agents (identical Step 0 across them). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 13:13:08 +01:00
Timmy	98d496b1ad	fix(901): unblock_story works on CRDT-only stories post-865 Bug 901: `unblock_story` (and the chat `unblock` command) routed through `parse_front_matter` and errored with "Missing front matter" on any post-865 story (story content is now CRDT-only with no YAML on disk). In `chat/commands/unblock.rs::unblock_by_story_id`: - Drop the early `parse_front_matter` gate. - Read story name and blocked state from the CRDT register API instead of parsed YAML (`crdt_state::read_item`, `pipeline_state::read_typed`). - Keep the legacy fallback cleanup, but gate it on the content actually starting with a `---` YAML block, so CRDT-only stories don't hit a parse error there either. - Remove the now-unused `parse_front_matter` import. Surfaced a second sub-bug: even when the state-machine transition fired (`Blocked + Unblock → Coding`), the CRDT `blocked` register was never explicitly cleared. Pre-865 the YAML-strip content_transform cleared it as a side effect; post-865 there is no YAML to strip. - Add `crdt_state::set_blocked(story_id, bool)` parallel to `set_retry_count`. Wired through `crdt_state::write` and the crate-level re-export. - `agents::lifecycle::transition_to_unblocked` now calls `set_blocked(story_id, false)` alongside `set_retry_count(0)` so the legacy register stays in sync with the typed stage. Test: `unblock_command_works_on_crdt_only_story_no_yaml` seeds a CRDT entry with no YAML on disk, runs unblock, asserts success + cleared blocked + retry_count=0. All 10 existing unblock tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 13:13:01 +01:00
Timmy	cd12cb5e2c	fix: Bash(:) is invalid; use unconstrained Bash instead Claude Code rejects "Bash(:)" with "Prefix cannot be empty before :" — the rule is silently skipped, which since `5b48f0d0` left no Bash entry in the allowlist at all. Every coder agent's Bash call has been auto-denying since that commit landed (~840 of 1.4k denials in the sled log). The canonical form for "allow all bash commands" is the tool name alone: "Bash" (no parens). Apply it in three places that `5b48f0d0` touched: - .claude/settings.json (project root, inherited by new worktrees) - server/src/io/fs/scaffold/templates.rs (huskies init template) - server/src/io/fs/scaffold/tests.rs (assertion now checks "Bash") The gateway settings.json at ~/Desktop/huskies/.claude/settings.json and the four live worktrees (810, 888, 890, 894) were also corrected — not in this commit since they live outside the repo. Surfaced via /doctor; reported with rule "Invalid permission rule Bash(:) was skipped: Prefix cannot be empty before :*". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:46:34 +01:00
dave	9be438e6d3	huskies: merge 865	2026-05-08 14:29:06 +00:00

1 2 3 4 5 ...

3628 Commits