Commit Graph

3377 Commits

Author SHA1 Message Date
dave 4c8fe910a7 huskies: merge 533_story_crdt_based_done_archived_sweep_to_replace_filesystem_based_watcher_sweep 2026-04-10 16:58:50 +00:00
dave 8f34c521fb huskies: merge 508_story_configurable_rendezvous_peer_in_project_toml_with_outbound_crdt_sync_connect 2026-04-10 16:44:50 +00:00
dave a59f4fc1a5 huskies: merge 532_story_remove_startup_reconcile_pass_and_drift_notification_no_filesystem_to_reconcile_against 2026-04-10 16:40:56 +00:00
dave b88857c2e4 huskies: merge 507_story_apply_inbound_signedops_with_causal_order_queue_for_partition_recovery 2026-04-10 16:13:07 +00:00
dave 1ca9bc1bfd huskies: merge 506_story_websocket_sync_endpoint_that_broadcasts_local_signedops_to_connected_peers 2026-04-10 15:52:49 +00:00
dave 73890c98fa huskies: merge 505_story_signedop_wire_codec_for_crdt_sync_between_nodes 2026-04-10 15:35:10 +00:00
dave bfede09fe6 huskies: merge 529_bug_stale_mergemaster_advance_moves_done_stories_back_to_merge_zombie_merge_loop 2026-04-10 15:20:34 +00:00
dave 11d19d8902 huskies: merge 530_story_eliminate_filesystem_markdown_shadows_entirely_crdt_db_is_the_only_story_store 2026-04-10 14:59:58 +00:00
dave 1dd675796b huskies: merge 531_story_mcp_tool_to_read_agent_session_logs_from_disk_not_just_live_stream 2026-04-10 13:08:51 +00:00
dave 31388da609 huskies: merge 517_story_remove_filesystem_shadow_fallback_paths_from_lifecycle_rs_finish_the_migration_to_crdt_only 2026-04-10 13:00:25 +00:00
dave fe405e81c6 huskies: merge 527_story_remove_rate_limit_hard_block_bot_notifications_from_matrix_chat 2026-04-10 11:27:36 +00:00
dave 7e5b9839e8 huskies: merge 523_refactor_introduce_script_test_script_lint_script_build_and_migrate_agent_prompts_off_tech_specific_commands 2026-04-10 11:22:51 +00:00
dave 2a24a4cc85 huskies: merge 522_story_migrate_status_command_pipeline_view_from_filesystem_to_pipeline_state_read_all_typed 2026-04-10 10:37:17 +00:00
dave 6310c8bf49 huskies: merge 518_story_apply_and_persist_should_log_when_persist_tx_send_fails_instead_of_silently_dropping_the_op 2026-04-10 10:33:01 +00:00
dave 61ae30873f huskies: merge 516_story_update_story_description_should_create_the_description_section_if_it_doesn_t_exist_instead_of_erroring 2026-04-10 10:28:53 +00:00
dave f015fe5a1d huskies: merge 515_story_add_a_debug_mcp_tool_to_dump_the_in_memory_crdt_state_for_inspection 2026-04-10 10:24:30 +00:00
dave c6b6be872b huskies: merge 509_bug_create_story_silently_drops_description_and_any_other_unknown_parameters_with_no_error 2026-04-10 10:20:13 +00:00
dave 5377eeae5b huskies: merge 513_story_startup_reconcile_pass_that_detects_drift_between_crdt_pipeline_items_and_filesystem_shadows 2026-04-10 10:16:45 +00:00
Timmy 92b212e7fd huskies: merge 504_story_update_story_front_matter_mcp_schema_should_accept_non_string_values_lists_bools_numbers
Squash merge of story 504: add MCP regression tests for non-string
front_matter values (arrays, bools, integers). The schema change itself
was already on master. Fixed the array assertion to match YAML's
space-after-comma inline sequence format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 11:08:21 +01:00
Timmy 9633ab35a6 fix: validate_story_dirs reads filesystem shadows instead of global CRDT singleton (bug 525)
The post-520 migration changed validate_story_dirs to read from
pipeline_state::read_all_typed() (the process-global CRDT singleton),
ignoring its root: &Path argument. This broke test isolation — tests
creating a tempdir saw dozens of results from ambient CRDT state,
causing non-deterministic failures that blocked every mergemaster gate.

Remove the CRDT singleton block and rely on the filesystem shadow scan
that already uses the root argument correctly. 1845/1845 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 10:52:42 +01:00
dave d1b845fd2e fix: move_item must not overwrite advanced CRDT stage when missing_ok=true (bug 524)
When a story is found in the CRDT but not in the expected source stages,
and missing_ok is true, return Ok(None) instead of proceeding with the move.
This prevents promote_ready_backlog_stories from demoting a story that has
already advanced to merge/done via a stale filesystem shadow in 1_backlog.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 00:21:39 +00:00
Timmy 934bda5904 Trying out sonnet for merges 2026-04-10 01:04:25 +01:00
Timmy 962e3d4e7d fmt 2026-04-10 01:04:09 +01:00
dave 0de9200d48 huskies: merge 512_story_migrate_chat_commands_from_filesystem_lookup_to_crdt_db 2026-04-09 23:03:58 +00:00
dave c324452b38 fix: commit uncommitted native JSON type changes on master
These changes (HashMap<String, String> → HashMap<String, Value> for front matter,
json_value_to_yaml_scalar, and oneOf schema for front_matter) were left uncommitted
on master after a previous merge, blocking the cherry-pick step of story 509's merge.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 22:35:52 +00:00
dave d3ee850f37 huskies: merge 500_story_remove_duplicate_pty_debug_log_lines 2026-04-09 22:16:03 +00:00
dave cbe016d7a2 huskies: merge 519_story_mergemaster_should_detect_no_commits_ahead_of_master_and_fail_loudly_instead_of_exiting_silently 2026-04-09 22:11:09 +00:00
dave 6f6d37e955 huskies: merge 514_story_delete_story_should_do_a_full_cleanup_crdt_op_db_row_filesystem_shadow_worktree_pending_timers 2026-04-09 22:05:18 +00:00
dave 84717b04bd huskies: merge 520_story_typed_pipeline_state_machine_in_rust_foundation_replaces_stringly_typed_crdt_views_with_strict_enums_subsumes_436 2026-04-09 21:27:48 +00:00
Timmy 1d9287389a feat(521): evict_item primitive + purge_story MCP tool
Adds the foundational capability to clear a story from the running
server's in-memory CRDT state without restarting the process. This is
story 521, motivated by the 2026-04-09 incident where stories 478 and
503 kept resurrecting from in-memory CRDT after every sqlite delete /
worktree removal / timers.json clear. The only previous remedy was a
full docker restart.

Changes:

  - server/src/crdt_state.rs: new `pub fn evict_item(story_id: &str)`.
    Looks up the item's CRDT OpId via the visible-index map, calls the
    bft-json-crdt list `delete()` primitive to construct a tombstone op,
    runs it through the existing `apply_and_persist` machinery (which
    signs, applies to the in-memory CRDT, and queues for persistence to
    crdt_ops), rebuilds the story_id → visible_index map, and drops the
    in-memory CONTENT_STORE entry. The tombstone survives a restart
    because it's persisted as a real CRDT op.

  - server/src/http/mcp/story_tools.rs: new `tool_purge_story` MCP
    handler that takes a story_id and calls evict_item. Deliberately
    minimal — does NOT touch agents, worktrees, pipeline_items shadow
    table, timers.json, or filesystem shadows. Compose with stop_agent,
    remove_worktree, etc. for a full purge. Story 514 (delete_story
    full cleanup) is the future "do it all" tool.

  - server/src/http/mcp/mod.rs: registers the `purge_story` tool in the
    tools list and dispatch table.

Usage:

    mcp__huskies__purge_story story_id="<full_story_id>"

Returns a string confirming the eviction. The story will no longer
appear in get_pipeline_status, list_agents, or any other API that
reads from the in-memory CRDT view, and on the next server restart
the persisted tombstone op will keep it from being reconstructed.

This is a prerequisite for story 514 (delete_story full cleanup) and
useful for any "kill it with fire" operator need.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 21:29:09 +01:00
Timmy 13635b01bc wip(501): timer cancellation infrastructure (parallel session WIP + main.rs wiring)
Bundles in-progress work from a parallel Claude session toward fixing
bug 501 (rate-limit retry timer doesn't cancel on stop_agent / move_story
/ successful completion). This commit lands the foundation but the MCP
tool wiring is still TODO.

  - server/src/chat/timer.rs: defense-in-depth check in tick_once that
    skips firing a timer for stories already past 3_qa (3_qa, 4_merge,
    5_done, 6_archived). The primary cancellation path will be in the
    MCP tools; this guards races where a timer was scheduled before the
    story was advanced and the tool didn't get a chance to cancel it.

  - server/src/http/context.rs: adds `timer_store: Arc<TimerStore>` field
    on AppContext so MCP tools (move_story, stop_agent, ...) can reach
    the shared timer store and cancel pending entries when the user
    intervenes manually. The test helper is updated to construct one.

  - server/src/main.rs: wires up a TimerStore instance in the AppContext
    initialiser so the binary actually compiles after the context.rs
    field addition. TODO: the matrix bot's spawn_bot still creates its
    own TimerStore instance (in chat/transport/matrix/bot/run.rs:220-227)
    rather than consuming the shared one — that refactor is the next
    step in the bug 501 fix.

What is NOT in this commit and is needed to actually fix bug 501:
  - The MCP tool side (move_story, stop_agent, delete_story) does not
    yet call timer_store.cancel(story_id) when invoked
  - The matrix bot's spawn_bot does not yet consume the shared
    timer_store from AppContext — it still creates its own

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 21:28:48 +01:00
Timmy 1707277bb7 sketch(520): add ExecutionMachine to the statig sketch for parity with bare
The statig version was missing the per-node ExecutionState machine that
the bare version has. This commit adds it as a sub-module so its
generated `State` enum doesn't collide with the top-level PipelineMachine's
`State` enum.

Adds:
  - ExecutionEvent enum (top-level, alongside PipelineEvent)
  - mod execution { … } sub-module containing ExecutionMachine
  - States: idle, pending, running, rate_limited, completed
  - Cross-cutting `any` superstate that handles Stopped/Reset → Idle
  - 6 new tests covering the happy path, rate-limit + resume, and
    stop-from-anywhere via the superstate

Also adds a small note about how statig's `#[action]` entry/exit hooks
would replace the bare version's external EventBus pattern (without
implementing it — we'd pick one or the other based on whether side
effects should live inside or outside the state machine).

Test count: 11 → 17 (all passing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 21:08:39 +01:00
Timmy 7c0015beb0 docs: file 12 stories from 2026-04-09 architecture session + handoff doc
Adds the markdown shadows for stories filed during today's stress-test
session, plus a SESSION_HANDOFF document for picking up the work in
a future session.

New stories (510-521):
  510 — bug: stale 1_backlog filesystem shadows get re-promoted by timers
  511 — bug: CRDT lamport clock resets to 1 on restart (FIXED in 99557635)
  512 — story: migrate chat commands from filesystem lookup to CRDT/DB
  513 — story: startup reconcile pass for state-machine drift detection
  514 — story: delete_story should do a full cleanup
  515 — story: debug MCP tool to dump in-memory CRDT state
  516 — story: update_story.description should create the section if missing
  517 — story: remove filesystem-shadow fallback paths from lifecycle.rs
  518 — story: apply_and_persist should log persist_tx send failures
  519 — story: mergemaster should fail loudly on no-op merges (mostly
                 obviated by Stage::Merge { commits_ahead: NonZeroU32 } in 520)
  520 — story: typed pipeline state machine in Rust (sketches added in f7d69cde)
  521 — story: MCP capability to write a CRDT tombstone for a story

Refactor 436 (unify story stuck states) is marked superseded by 520
via front_matter — its functionality is now part of the
Stage::Archived { reason: ArchiveReason } enum in story 520's design.

The SESSION_HANDOFF_2026-04-09.md document captures: the four-state-machine
drift situation that motivated story 520, today's bug fixes (502 + 511),
the off-leash rogue commit incident (forensic tag rogue-commit-2026-04-09-ac9f3ecf
preserved), the recommended next-session priority order, and useful
diagnostic recipes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 21:03:53 +01:00
Timmy f7d69cde50 sketch(520): typed pipeline state machine — bare and statig versions
Two parallel scratch experiments under server/examples/ exploring the
typed Rust state machine that should replace huskies's current
stringly-typed CRDT representation (story 520).

  - pipeline_state_sketch_bare.rs   — hand-rolled, plain enums + match
  - pipeline_state_sketch_statig.rs — using the statig crate

Both sketches:
  - Define the same Stage enum (Backlog, Coding, Qa, Merge, Done, Archived)
  - Define ArchiveReason (subsumes refactor 436's blocked/merge_failure/review_hold)
  - Define ExecutionState (per-node, separate from synced Stage) — bare only
  - Define PipelineEvent and the valid transitions
  - Make bug 519 unrepresentable: Stage::Merge requires NonZeroU32 commits_ahead
  - Make bug 502 unrepresentable: Coder agents can't be assigned to Merge state
  - Have happy-path tests, retry-loop tests, and invalid-transition tests

Differences:
  - Bare uses pure pattern matching, no framework. ~720 lines.
  - Statig uses #[state_machine] proc macro and gets free hierarchical
    states via the `active` superstate that factors out the cross-cutting
    Block / ReviewHold / Abandon / Supersede transitions across the four
    active stages. ~440 lines, 11 passing tests.

Run with:
  cargo run  --example pipeline_state_sketch_bare   -p huskies
  cargo run  --example pipeline_state_sketch_statig -p huskies
  cargo test --example pipeline_state_sketch_bare   -p huskies
  cargo test --example pipeline_state_sketch_statig -p huskies

Adds statig 0.3 as a dev-dependency in server/Cargo.toml. Cargo.lock
updated to include statig + statig-macro and their transitive deps.

Not wired into the main codebase. Once we agree on which version to
adopt, story 520 promotes the chosen sketch into a real
server/src/pipeline_state.rs module.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 21:03:07 +01:00
Timmy 995576358f fix(511): replay CRDT ops by rowid ASC instead of seq ASC
The CRDT lamport seq is per-author and per-field, not globally
monotonic. Replaying by `seq ASC` causes field-update ops (which
have low per-field seq counters like 1, 2, 3) to be applied
BEFORE the list-insert ops they reference (which have higher
per-list seq counters like N for the Nth item ever inserted).
The field updates fail with ErrPathMismatch because the target
item doesn't exist yet, the field counter is never advanced,
and subsequent writes silently lose state.

Concretely on 2026-04-09 we observed: post-restart writes were
being persisted at seq=1,2,3,4,5,6,7 even though pre-restart
seq had reached 492. On the next replay, those low-seq field
updates would be applied before their seq=485+ creation ops,
silently dropping the updates. This was the load-bearing
"why does state keep flapping" bug today.

Fix: replay by `rowid ASC` (SQLite insertion order) instead.
Rowid preserves the causal order ops were originally applied
in, so field updates always come after the item insert they
reference.

Adds a regression test that constructs the exact scenario:
inserts a story (op gets seq=6), updates its stage (op gets
seq=1 because field counter starts at 0), persists both ops
in causal order, then replays both seq ASC (reproduces the
bug — stage update is lost) and rowid ASC (the fix — stage
update is preserved).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 21:02:01 +01:00
Timmy 5765fb57be merge(478): WebSocket CRDT sync layer (manual squash from feature/story-478)
Manual squash-merge of feature/story-478_… into master after the in-pipeline
mergemaster runs failed silently. The 478 agent did substantial real work
across multiple respawn cycles before being interrupted; commits on the
feature branch were intact and verified high-quality but never merged via
the normal pipeline path due to compounding bugs:

- The first mergemaster attempt ran ($0.82 in tokens) and exited "Done"
  cleanly but didn't push anything to master — likely the worktree was
  briefly on master rather than the feature branch when the merge_agent_work
  MCP tool ran, so it found nothing to merge.
- Subsequent timer fires defaulted to spawning coders instead of resuming
  mergemaster, burning more tokens for no progress.
- Bug 510 (split-brain shadows yanking done stories back to current) and
  bug 501 (timers don't cancel on stop/completion) compounded the cost.

What this commit lands:
- server/src/crdt_sync.rs (new, ~518 lines): GET /crdt-sync WebSocket
  handler that subscribes to locally-applied SignedOps and streams them as
  binary frames. Per-peer bounded queue (256 ops) drops slow peers.
- server/src/crdt_state.rs: new public functions subscribe_ops(),
  all_ops_json(), apply_remote_op() backing the sync handler. Adds the
  CRDT_OP_TX broadcast channel (capacity 1024).
- server/src/main.rs: wires up the sync subsystem at startup.
- server/src/http/mod.rs: registers the new endpoint.
- server/src/config.rs: adds optional rendezvous field for outbound peers.
- server/src/worktree.rs: minor changes from the original branch.
- server/Cargo.toml: cfg lint suppression for CrdtNode derive.
- crates/bft-json-crdt/src/debug.rs: fix unused-variable warnings.

Resolved a trivial test-mod merge conflict in crdt_state.rs (both 478 and
503 added new tests at the end of the test module — kept both sets).

Note: this is the squash of the original 478 work that the user explicitly
authorized landing. The earlier rogue commit ac9f3ecf — which added a
DIFFERENT, broken implementation of the same feature directly to master
under the user's identity without consent — was reverted earlier in this
session. The forensic tags rogue-commit-2026-04-09-ac9f3ecf and
pre-502-reset-2026-04-09 still exist for incident audit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 19:46:29 +01:00
dave 41515e3b8f huskies: merge 503_bug_depends_on_pointing_at_an_archived_story_is_silently_treated_as_deps_met_surprising_users 2026-04-09 18:31:29 +00:00
Timmy 8b2e068d3e fix(502): don't demote merge-stage stories on mergemaster attach
start_agent unconditionally called move_story_to_current at the top of
its body, before the agent-stage check. When called for mergemaster (or
qa) on a story in 4_merge/ AND a stale 1_backlog/ shadow of the story
existed (post-491/492 split-brain artifact), the move would find the
shadow and yank it to 2_current/, find_active_story_stage would then
report 2_current/, the stage check would expect a Coder agent, and
mergemaster would be rejected — leaving the story in 2_current/ to be
re-promoted by the next auto-assign tick. Infinite loop.

Gate the move so it only fires for Coder-stage agents. QA and
Mergemaster now attach to the story at its existing stage.

Adds a regression test that reproduces the split-brain scenario by
seeding both 4_merge/ and 1_backlog/ copies of the same story and
asserting (1) the stage check does not reject mergemaster, and (2) the
4_merge/ copy is preserved (i.e. not demoted to 2_current/).

Observed live on 2026-04-09 while story 478 was looping. Filed as
bug 502.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 19:18:01 +01:00
dave 59fbb56252 chore: ignore pipeline.db backup files in .huskies/.gitignore
The pre-478-surgery backup file was left untracked, causing the
acceptance gate to fail. Add pipeline.db.bak* pattern to ignore
such backup files.
2026-04-09 19:16:27 +01:00
Timmy 278bc8f050 Noting script/ commands for Docker rebuild and restart. 2026-04-09 18:00:20 +01:00
Timmy f5634a7434 Archiving the last of the pipeline story files 2026-04-09 17:59:54 +01:00
Timmy 8d9600183f Ignoring the huskies pipeline datastore 2026-04-09 17:59:12 +01:00
Timmy bb865687d5 Formatting 2026-04-09 17:58:29 +01:00
dave 1ffdd75475 huskies: accept 499_story_web_ui_shows_project_name_in_browser_tab_with_huskies_favicon 2026-04-08 05:04:16 +00:00
dave 46a254f80c huskies: accept 496_bug_hard_rate_limit_without_reset_at_never_auto_schedules_retry 2026-04-08 04:05:13 +00:00
dave 1baa83c1fd huskies: accept 491_story_watcher_fires_on_crdt_state_transitions_instead_of_filesystem_events 2026-04-08 04:03:12 +00:00
dave 870f49509d huskies: done 492_story_remove_filesystem_pipeline_state_and_store_story_content_in_database 2026-04-08 03:07:36 +00:00
dave 8fd49d563e huskies: merge 492_story_remove_filesystem_pipeline_state_and_store_story_content_in_database 2026-04-08 03:07:33 +00:00
dave f43d30bdae huskies: accept 497_bug_dependency_promotion_loop_missing_stories_with_met_deps_never_move_from_backlog_to_current 2026-04-08 01:33:33 +00:00
dave 6a56fa5623 huskies: done 497_bug_dependency_promotion_loop_missing_stories_with_met_deps_never_move_from_backlog_to_current 2026-04-08 01:32:29 +00:00