diff --git a/.huskies/specs/tech/STATE_MACHINE.md b/.huskies/specs/tech/STATE_MACHINE.md new file mode 100644 index 00000000..18c30055 --- /dev/null +++ b/.huskies/specs/tech/STATE_MACHINE.md @@ -0,0 +1,331 @@ +# Pipeline State Machine + +This document describes the huskies pipeline state machine in two halves: +**(a)** the model that runs in production today, and **(b)** transitions, refinements, +and corrections we have identified as needed but not yet implemented. + +The codebase is in a deliberate transitional state: a typed CRDT state machine +exists at `server/src/pipeline_state.rs` (introduced by story 520) with strict Rust +enums for every stage, archive reason, execution state, and event. It is fully +defined and tested but **not yet called from non-test code** (`#![allow(dead_code)]` +at the top of the module). Consumers will migrate incrementally. + +The model that is actually doing work is the older **filesystem-stage-string + +front-matter-flag** model. Section (a) below documents both representations and +the migration intent. + +--- + +## (a) The current state machine + +### Stages (production: filesystem string; future: typed enum) + +| Filesystem (production) | Typed (future) | Meaning | +|---|---|---| +| `work/1_backlog/` | `Stage::Backlog` | Story exists, waiting for dependencies or auto-assign promotion | +| `work/2_current/` | `Stage::Coding` | Coder agent is running (or about to) | +| `work/3_qa/` | `Stage::Qa` | Coder finished; gates / human review running | +| `work/4_merge/` | `Stage::Merge { feature_branch, commits_ahead: NonZeroU32 }` | Gates passed, mergemaster ready to squash | +| `work/5_done/` | `Stage::Done { merged_at, merge_commit }` | Mergemaster squashed to master | +| `work/6_archived/` | `Stage::Archived { archived_at, reason: ArchiveReason }` | Out of the active flow | + +`5_done` auto-sweeps to `6_archived` after four hours. The typed `Stage::Done` +variant always carries the merge SHA and timestamp; `Stage::Merge`'s +`commits_ahead: NonZeroU32` makes "Merge with nothing to merge" structurally +impossible (eliminates bug 519). + +### Archive reasons (`pipeline_state.rs::ArchiveReason`) + +The typed model already enumerates the reasons a story can leave the active flow +(subsumes the legacy `blocked`, `merge_failure`, and `review_hold` front-matter +fields per story 436): + +- `Completed` — happy-path +- `Abandoned` — user explicitly abandoned +- `Superseded { by: StoryId }` — replaced by another story +- `Blocked { reason: String }` — manually blocked, awaiting human resolution +- `MergeFailed { reason: String }` — mergemaster gave up after retry budget +- `ReviewHeld { reason: String }` — held for human review at user request + +### Per-node execution state (`pipeline_state.rs::ExecutionState`) + +Stage is shared/CRDT-replicated. Execution state is per-node and lives under +each node's pubkey in the CRDT, so there are no inter-author merge conflicts: + +- `Idle` +- `Pending { agent, since }` — worktree being created, agent about to start +- `Running { agent, started_at, last_heartbeat }` +- `RateLimited { agent, resume_at }` +- `Completed { agent, exit_code, completed_at }` + +### Pipeline events (`pipeline_state.rs::PipelineEvent`) + +The typed model defines every event that drives a Stage transition. Each variant +carries the data needed to construct the destination state, so a transition +function can never accidentally land in an underspecified state: + +- `DepsMet` — dependencies met; promote from backlog +- `GatesStarted` — coder starting gates +- `GatesPassed { feature_branch, commits_ahead }` +- `GatesFailed { reason }` +- `QaSkipped { feature_branch, commits_ahead }` — qa-mode = "server"; skip QA, go to merge +- `MergeSucceeded { merge_commit }` +- `MergeFailedFinal { reason }` +- `Accepted` — Done → Archived(Completed) + +### Transitions (current production = MCP verb shape) + +#### Backlog → Coding (a.k.a. backlog → 2_current) + +- **Auto path**: `AgentPool::auto_assign_available_work` calls + `promote_ready_backlog_stories`. A backlog story is promoted iff (a) it has + an explicit non-empty `depends_on` AND (b) every dep is in `5_done` or + `6_archived`. Stories with no `depends_on` are NOT auto-promoted — they wait + for human scheduling. + - Implemented in `server/src/agents/pool/auto_assign/auto_assign.rs::promote_ready_backlog_stories`. +- **Manual path**: `mcp__huskies__move_story story_id=X target_stage=current`, + or `mcp__huskies__start_agent` (which moves the story to current as a + side-effect of starting an agent). +- **Archived-dep warning**: if a dep was satisfied via `6_archived` rather than + `5_done` (e.g. abandoned/superseded), the auto-assigner logs a prominent + warning so the user can see the promotion was triggered by an archived dep. + +#### Coding → Qa (current → 3_qa) + +- Triggered when the coder agent finishes (gates start running). +- `mcp__huskies__request_qa` is the manual verb. + +#### Qa → Coding (qa → current — rejection path) + +- `mcp__huskies__reject_qa story_id=X notes="..."` moves qa → current, + **clears `review_hold`**, and writes the rejection notes + (`agents/lifecycle.rs:210`). +- Used when a qa agent fails or a human reviewer rejects the work. + +#### Qa → Merge (qa → 4_merge) + +- Triggered when QA gates pass. `mcp__huskies__move_story_to_merge` is the + dedicated verb. +- For server-mode QA: typed-side `PipelineEvent::QaSkipped` allows going from + Coding → Merge directly without entering Qa. + +#### Merge → Done (merge → 5_done) + +- Mergemaster picks up a story in `4_merge/`, squashes the feature branch onto + master, then transitions to `5_done`. +- `mcp__huskies__move_story_to_merge` queues; mergemaster does the actual work. + +#### Done → Archived(Completed) (5_done → 6_archived) + +- Auto-sweep after four hours, OR +- `mcp__huskies__accept_story` (immediate manual archive). + +#### Any-stage → Archived(other reasons) + +- **Abandoned / Superseded**: today done by `mcp__huskies__move_story + target_stage=done` (no first-class verbs for these reasons; see (b) below). +- **Blocked**: `blocked: true` flag in front matter is set on retry-limit + exceedance. `mcp__huskies__unblock_story` clears the flag and resets + retry_count. +- **MergeFailed**: written to front matter when mergemaster fails; auto-assign + skips these stories (`has_merge_failure` check). +- **ReviewHeld**: `review_hold: true` flag is set automatically on spike + completion; auto-assign skips these stories until the flag is cleared. + +#### Tombstone / purge + +- `mcp__huskies__delete_story` and `mcp__huskies__purge_story` permanently + remove. Purge writes a CRDT tombstone. + +### Auto-assign skip conditions (current production) + +`auto_assign_available_work` walks `2_current/`, `3_qa/`, `4_merge/` in order +and attempts to dispatch a free agent to each unassigned story. It **skips** +any story that: + +1. Has `review_hold: true` in front matter (spikes after QA, manual hold). +2. Is `frozen` (`is_story_frozen` — pipeline advancement suspended for this story). +3. Has `blocked: true` (retry limit exceeded; cleared via `unblock_story`). +4. Has unmet `depends_on` dependencies. +5. (Merge stage only) Has a recorded merge failure (`has_merge_failure`). +6. (Merge stage only) Has an empty diff on the feature branch — auto-writes + `merge_failure` and blocks immediately rather than wasting a mergemaster turn. + +### Front-matter fields that gate transitions + +| Field | Type | Effect | +|---|---|---| +| `depends_on` | list of story IDs | Blocks backlog → current promotion until all deps are in 5_done or 6_archived | +| `agent` | string (e.g. `coder-opus`) | Pins the preferred agent for next assignment | +| `review_hold` | bool | Auto-assign skips this story; cleared by `reject_qa` or manual unblock | +| `blocked` | bool | Auto-assign skips this story; cleared by `unblock_story` | +| `frozen` | bool | Auto-assign skips this story; manual unfreeze required | +| `merge_failure` | string | Auto-assign skips merge-stage agents on this story | +| `retry_count` | int | Local-only (not in CRDT); incremented by orchestrator | + +### Spike-specific behavior + +Per the typical lifecycle, a spike runs through `current → qa` like any work +item, then **stops** in qa awaiting human review (`spikes skip merge`). This +is implemented via `review_hold: true` being written automatically when a +spike's qa gates pass. The user accepts (move qa → done) or rejects (move +qa → current). Spikes do NOT auto-promote to merge. + +### Mergemaster lifecycle + +The mergemaster agent only runs against stories in `4_merge/`. It: + +1. Verifies the feature branch has commits (or the story is auto-blocked). +2. Squashes the feature branch onto master with a deterministic commit message. +3. Transitions the story to `5_done` with `merged_at` and `merge_commit`. +4. On failure beyond the retry budget, writes `merge_failure` and blocks the + story (auto-assign then skips it). + +### Watchdog (current production) + +The "watchdog" at `server/src/agents/pool/auto_assign/watchdog.rs` runs every +30 ticks of the unified background loop. Today it does **one** thing: detect +orphaned agents whose tokio task is `is_finished()` but whose status is still +`Running` or `Pending`, and mark them `Failed` with an `AgentEvent::Error` +emission. Bug 624 (now merged) extends it to also enforce `max_turns` and +`max_budget_usd` limits — an agent over either limit is killed via the +existing `kill_child_for_key` path and recorded with a typed termination +reason. + +--- + +## (b) Transitions and behaviors that don't yet exist (or are only partially wired) + +### Migration of consumers off legacy strings to typed `Stage` enum + +The biggest outstanding piece. `pipeline_state.rs` is `#![allow(dead_code)]`. +Every consumer (auto-assign, mergemaster, MCP tools, chat commands) still +works with stage strings (`"2_current"`, `"4_merge"`) and front-matter flags. +The projection layer (`TryFrom for PipelineItem` and +friends) exists but isn't called outside tests. Migration is intentionally +incremental. + +**Opportunity**: pick a leaf consumer (e.g. one MCP tool that reads the stage +string) and migrate it to read `Stage` instead. Pattern repeats outward until +all consumers go through the typed projection and the legacy stage-string +code can be deleted. + +### First-class verbs for archive reasons + +`ArchiveReason` already has six variants but only `Completed` (via +`accept_story`) and `Blocked` (via the `blocked: true` flag) have dedicated +MCP verbs. Today, `Abandoned`, `Superseded`, `MergeFailed`, and `ReviewHeld` +are reached either via `move_story target_stage=done` (which doesn't carry +the reason) or via setting front-matter flags on the live story. + +**Missing transitions**: + +- `mcp__huskies__supersede_story story_id=X by=Y` — sets stage to + `Archived { reason: Superseded { by: Y } }`. Today we use + `move_story → done`, losing the `by` reference. (Came up 2026-04-25 with + spike 621 → refactor 623.) +- `mcp__huskies__abandon_story story_id=X reason="..."` — sets + `Archived { reason: Abandoned }`. Today done via `move_story → done` or + `purge_story`. +- `mcp__huskies__hold_for_review story_id=X reason="..."` — explicitly puts + a story in `Archived { reason: ReviewHeld }` rather than relying on the + auto-set `review_hold` flag. + +### Type-conversion transitions + +Spike → story conversion is a real workflow (we do it when a spike's scope +grows into an implementation story). Today, converting type via `update_story +front_matter={"type": "story"}` does not bootstrap the +`## Acceptance Criteria` section, and `add_criterion` then permanently fails +on that story (see **bug 625** filed 2026-04-25). The `type` field passed via +front_matter is also silently dropped — same silent-drop bug class as +`acceptance_criteria`. The state machine should treat type conversion as a +transition with side effects — at minimum, ensuring the AC section exists +when transitioning to a type that requires it, and the displayed type +reflects the new value (today the display chip is parsed from the immutable +story_id prefix; story 578 in backlog will fix this by switching to +numeric-only IDs). + +### Limit-based agent termination (turn / budget) + +Pre-624 master: `max_turns` and `max_budget_usd` per-agent config were read +by the metric tool (`tool_get_agent_remaining_turns_and_budget`) but **not +enforced** anywhere. Observed `coder-1` running 282/50 turns and $10.05/$5.00 +USD on story 623 before a human stopped it (bug 624, now merged). + +The bug 624 fix adds enforcement to the watchdog. The state-machine impact: +introduces a new agent-termination path distinct from "Failed (orphan)" — +something like `Failed(LimitExceeded { kind: Turns | Budget })`. The +`ExecutionState` enum may want a corresponding terminal variant so it can be +distinguished from generic `Failed`. + +### Pinned-agent honoring under contention + +When a story has `agent: coder-opus` pinned but `coder-opus` is busy, today's +auto-assign behavior is to leave the story unassigned — but if a human stops +the running attempt and the story sits in `current/`, auto-assign **re-grabs +it with the default coder** rather than waiting for the pinned agent. +Observed multiple times on 2026-04-25 with story 623: pinning `coder-opus` +did not prevent `coder-1` (sonnet) from being auto-assigned during opus's +busy window. + +**Missing behavior**: auto-assign should treat a pinned agent as a hard +filter ("only this agent can take this story"), not a preference. Today the +workaround is to also set `depends_on` on a phantom story, or move the story +back to backlog and let the dependency system gate it. + +### Honoring the `blocked` flag (bug 559) + +`559_bug_mergemaster_ignores_blocked_flag_and_keeps_respawning_on_blocked_stories` +is in backlog. Even though `blocked: true` is documented as a skip condition +in `auto_assign_available_work`, mergemaster's spawn path apparently checks +something different (or earlier) and respawns on blocked merge-stage stories. +The state machine should make `Stage::Archived { reason: Blocked }` a single +authoritative source so no consumer can incidentally bypass it. + +### Formal "ghost story recovery" transition + +The `move_story` MCP tool description mentions "recovering a ghost story by +moving it back to current" as a valid use. Ghost stories are CRDT entries +with no corresponding filesystem stage directory (or the inverse). Today this +is an `update_story + move_story` ad-hoc dance. A first-class +`recover_ghost_story` verb that reconciles the CRDT and filesystem would +formalize the recovery path. + +### Operator-level visibility / observability + +There is no UI, CLI, or doc that shows "the state machine as a diagram." The +typed enums are the closest thing to a canonical specification, but they +aren't rendered anywhere a human can see at a glance: which stages exist, +which transitions are valid, which events trigger them. A generated state +diagram (graphviz or mermaid, dumped into this doc on each release) would +help both new contributors and operators triaging stuck pipelines. + +### Review-hold cleanup verb + +`review_hold: true` is set automatically on spike completion. Clearing it is +done as a side effect of `reject_qa` (which also moves the story qa → +current) or by manually editing front matter. There is no clean "I have +reviewed this, release the hold" verb that doesn't also move the story. + +### Cross-node concurrency for execution state + +`ExecutionState` is per-node (keyed by pubkey) so two nodes can't fight over +who's running an agent. But there is no formal transition that says "node A +hands the story to node B" if node A goes offline. The state machine's +distributed semantics for this case are not yet specified. + +--- + +## How to update this document + +Whenever you discover a transition that doesn't yet exist, or a flag that +behaves surprisingly, add it to **section (b)** with: + +- A short description of the desired behavior +- Citation of the work item or incident that surfaced it +- Pointer to the place in `pipeline_state.rs` where it should be modeled (or + note "needs a new variant" if it doesn't fit any existing enum yet) + +When a transition from (b) ships, move it to (a) with the relevant file:line +citations.