# Pipeline State Machine This document describes the huskies pipeline state machine in two halves: **(a)** the model that runs in production today, and **(b)** transitions, refinements, and corrections we have identified as needed but not yet implemented. The codebase is in a deliberate transitional state: a typed CRDT state machine exists at `server/src/pipeline_state.rs` (introduced by story 520) with strict Rust enums for every stage, archive reason, execution state, and event. It is fully defined and tested but **not yet called from non-test code** (`#![allow(dead_code)]` at the top of the module). Consumers will migrate incrementally. The model that is actually doing work is the older **filesystem-stage-string + front-matter-flag** model. Section (a) below documents both representations and the migration intent. --- ## (a) The current state machine ### Stages (production: filesystem string; future: typed enum) | Filesystem (production) | Typed (future) | Meaning | |---|---|---| | `work/1_backlog/` | `Stage::Backlog` | Story exists, waiting for dependencies or auto-assign promotion | | `work/2_current/` | `Stage::Coding` | Coder agent is running (or about to) | | `work/3_qa/` | `Stage::Qa` | Coder finished; gates / human review running | | `work/4_merge/` | `Stage::Merge { feature_branch, commits_ahead: NonZeroU32 }` | Gates passed, mergemaster ready to squash | | `work/5_done/` | `Stage::Done { merged_at, merge_commit }` | Mergemaster squashed to master | | `work/6_archived/` | `Stage::Archived { archived_at, reason: ArchiveReason }` | Out of the active flow | `5_done` auto-sweeps to `6_archived` after four hours. The typed `Stage::Done` variant always carries the merge SHA and timestamp; `Stage::Merge`'s `commits_ahead: NonZeroU32` makes "Merge with nothing to merge" structurally impossible (eliminates bug 519). ### Archive reasons (`pipeline_state.rs::ArchiveReason`) The typed model already enumerates the reasons a story can leave the active flow (subsumes the legacy `blocked`, `merge_failure`, and `review_hold` front-matter fields per story 436): - `Completed` — happy-path - `Abandoned` — user explicitly abandoned - `Superseded { by: StoryId }` — replaced by another story - `Blocked { reason: String }` — manually blocked, awaiting human resolution - `MergeFailed { reason: String }` — mergemaster gave up after retry budget - `ReviewHeld { reason: String }` — held for human review at user request ### Per-node execution state (`pipeline_state.rs::ExecutionState`) Stage is shared/CRDT-replicated. Execution state is per-node and lives under each node's pubkey in the CRDT, so there are no inter-author merge conflicts: - `Idle` - `Pending { agent, since }` — worktree being created, agent about to start - `Running { agent, started_at, last_heartbeat }` - `RateLimited { agent, resume_at }` - `Completed { agent, exit_code, completed_at }` ### Pipeline events (`pipeline_state.rs::PipelineEvent`) The typed model defines every event that drives a Stage transition. Each variant carries the data needed to construct the destination state, so a transition function can never accidentally land in an underspecified state: - `DepsMet` — dependencies met; promote from backlog - `GatesStarted` — coder starting gates - `GatesPassed { feature_branch, commits_ahead }` - `GatesFailed { reason }` - `QaSkipped { feature_branch, commits_ahead }` — qa-mode = "server"; skip QA, go to merge - `MergeSucceeded { merge_commit }` - `MergeFailedFinal { reason }` - `Accepted` — Done → Archived(Completed) ### Transitions (current production = MCP verb shape) #### Backlog → Coding (a.k.a. backlog → 2_current) - **Auto path**: `AgentPool::auto_assign_available_work` calls `promote_ready_backlog_stories`. A backlog story is promoted iff (a) it has an explicit non-empty `depends_on` AND (b) every dep is in `5_done` or `6_archived`. Stories with no `depends_on` are NOT auto-promoted — they wait for human scheduling. - Implemented in `server/src/agents/pool/auto_assign/auto_assign.rs::promote_ready_backlog_stories`. - **Manual path**: `mcp__huskies__move_story story_id=X target_stage=current`, or `mcp__huskies__start_agent` (which moves the story to current as a side-effect of starting an agent). - **Archived-dep warning**: if a dep was satisfied via `6_archived` rather than `5_done` (e.g. abandoned/superseded), the auto-assigner logs a prominent warning so the user can see the promotion was triggered by an archived dep. #### Coding → Qa (current → 3_qa) - Triggered when the coder agent finishes (gates start running). - `mcp__huskies__request_qa` is the manual verb. #### Qa → Coding (qa → current — rejection path) - `mcp__huskies__reject_qa story_id=X notes="..."` moves qa → current, **clears `review_hold`**, and writes the rejection notes (`agents/lifecycle.rs:210`). - Used when a qa agent fails or a human reviewer rejects the work. #### Qa → Merge (qa → 4_merge) - Triggered when QA gates pass. `mcp__huskies__move_story_to_merge` is the dedicated verb. - For server-mode QA: typed-side `PipelineEvent::QaSkipped` allows going from Coding → Merge directly without entering Qa. #### Merge → Done (merge → 5_done) - Mergemaster picks up a story in `4_merge/`, squashes the feature branch onto master, then transitions to `5_done`. - `mcp__huskies__move_story_to_merge` queues; mergemaster does the actual work. #### Done → Archived(Completed) (5_done → 6_archived) - Auto-sweep after four hours, OR - `mcp__huskies__accept_story` (immediate manual archive). #### Any-stage → Archived(other reasons) - **Abandoned / Superseded**: today done by `mcp__huskies__move_story target_stage=done` (no first-class verbs for these reasons; see (b) below). - **Blocked**: `blocked: true` flag in front matter is set on retry-limit exceedance. `mcp__huskies__unblock_story` clears the flag and resets retry_count. - **MergeFailed**: written to front matter when mergemaster fails; auto-assign skips these stories (`has_merge_failure` check). - **ReviewHeld**: `review_hold: true` flag is set automatically on spike completion; auto-assign skips these stories until the flag is cleared. #### Tombstone / purge - `mcp__huskies__delete_story` and `mcp__huskies__purge_story` permanently remove. Purge writes a CRDT tombstone. ### Auto-assign skip conditions (current production) `auto_assign_available_work` walks `2_current/`, `3_qa/`, `4_merge/` in order and attempts to dispatch a free agent to each unassigned story. It **skips** any story that: 1. Has `review_hold: true` in front matter (spikes after QA, manual hold). 2. Is `frozen` (`is_story_frozen` — pipeline advancement suspended for this story). 3. Has `blocked: true` (retry limit exceeded; cleared via `unblock_story`). 4. Has unmet `depends_on` dependencies. 5. (Merge stage only) Has a recorded merge failure (`has_merge_failure`). 6. (Merge stage only) Has an empty diff on the feature branch — auto-writes `merge_failure` and blocks immediately rather than wasting a mergemaster turn. ### Front-matter fields that gate transitions | Field | Type | Effect | |---|---|---| | `depends_on` | list of story IDs | Blocks backlog → current promotion until all deps are in 5_done or 6_archived | | `agent` | string (e.g. `coder-opus`) | Pins the preferred agent for next assignment | | `review_hold` | bool | Auto-assign skips this story; cleared by `reject_qa` or manual unblock | | `blocked` | bool | Auto-assign skips this story; cleared by `unblock_story` | | `frozen` | bool | Auto-assign skips this story; manual unfreeze required | | `merge_failure` | string | Auto-assign skips merge-stage agents on this story | | `retry_count` | int | Local-only (not in CRDT); incremented by orchestrator | ### Spike-specific behavior Per the typical lifecycle, a spike runs through `current → qa` like any work item, then **stops** in qa awaiting human review (`spikes skip merge`). This is implemented via `review_hold: true` being written automatically when a spike's qa gates pass. The user accepts (move qa → done) or rejects (move qa → current). Spikes do NOT auto-promote to merge. ### Mergemaster lifecycle The mergemaster agent only runs against stories in `4_merge/`. It: 1. Verifies the feature branch has commits (or the story is auto-blocked). 2. Squashes the feature branch onto master with a deterministic commit message. 3. Transitions the story to `5_done` with `merged_at` and `merge_commit`. 4. On failure beyond the retry budget, writes `merge_failure` and blocks the story (auto-assign then skips it). ### Agent terminated with committed work (bug 645 recovery path) When a coder agent terminates abnormally (e.g. the Claude Code CLI's `output.write(&bytes).is_ok()` PTY write assertion fires mid-session), the server-owned completion path detects the crash and checks for surviving work: 1. If the worktree is dirty but has commits ahead of master, reset the uncommitted files (`git checkout . && git clean -fd`) and run gates against the committed code. 2. If gates still fail but `git log master..HEAD` shows commits and `cargo check` passes, **advance to QA** instead of entering the retry/block path. This is the "work survived" check, implemented in `server/src/agents/pool/pipeline/advance.rs`. 3. Agents that die WITHOUT committed work (no commits ahead of master) still follow the existing retry → block path unchanged. This prevents false-positive blocking of stories where the agent completed meaningful work before crashing. ### Watchdog (current production) The "watchdog" at `server/src/agents/pool/auto_assign/watchdog.rs` runs every 30 ticks of the unified background loop. Today it does **one** thing: detect orphaned agents whose tokio task is `is_finished()` but whose status is still `Running` or `Pending`, and mark them `Failed` with an `AgentEvent::Error` emission. Bug 624 (now merged) extends it to also enforce `max_turns` and `max_budget_usd` limits — an agent over either limit is killed via the existing `kill_child_for_key` path and recorded with a typed termination reason. --- ## (b) Transitions and behaviors that don't yet exist (or are only partially wired) ### Migration of consumers off legacy strings to typed `Stage` enum The biggest outstanding piece. `pipeline_state.rs` is `#![allow(dead_code)]`. Every consumer (auto-assign, mergemaster, MCP tools, chat commands) still works with stage strings (`"2_current"`, `"4_merge"`) and front-matter flags. The projection layer (`TryFrom for PipelineItem` and friends) exists but isn't called outside tests. Migration is intentionally incremental. **Opportunity**: pick a leaf consumer (e.g. one MCP tool that reads the stage string) and migrate it to read `Stage` instead. Pattern repeats outward until all consumers go through the typed projection and the legacy stage-string code can be deleted. ### First-class verbs for archive reasons `ArchiveReason` already has six variants but only `Completed` (via `accept_story`) and `Blocked` (via the `blocked: true` flag) have dedicated MCP verbs. Today, `Abandoned`, `Superseded`, `MergeFailed`, and `ReviewHeld` are reached either via `move_story target_stage=done` (which doesn't carry the reason) or via setting front-matter flags on the live story. **Missing transitions**: - `mcp__huskies__supersede_story story_id=X by=Y` — sets stage to `Archived { reason: Superseded { by: Y } }`. Today we use `move_story → done`, losing the `by` reference. (Came up 2026-04-25 with spike 621 → refactor 623.) - `mcp__huskies__abandon_story story_id=X reason="..."` — sets `Archived { reason: Abandoned }`. Today done via `move_story → done` or `purge_story`. - `mcp__huskies__hold_for_review story_id=X reason="..."` — explicitly puts a story in `Archived { reason: ReviewHeld }` rather than relying on the auto-set `review_hold` flag. ### Type-conversion transitions Spike → story conversion is a real workflow (we do it when a spike's scope grows into an implementation story). Today, converting type via `update_story front_matter={"type": "story"}` does not bootstrap the `## Acceptance Criteria` section, and `add_criterion` then permanently fails on that story (see **bug 625** filed 2026-04-25). The `type` field passed via front_matter is also silently dropped — same silent-drop bug class as `acceptance_criteria`. The state machine should treat type conversion as a transition with side effects — at minimum, ensuring the AC section exists when transitioning to a type that requires it, and the displayed type reflects the new value (today the display chip is parsed from the immutable story_id prefix; story 578 in backlog will fix this by switching to numeric-only IDs). ### Limit-based agent termination (turn / budget) Pre-624 master: `max_turns` and `max_budget_usd` per-agent config were read by the metric tool (`tool_get_agent_remaining_turns_and_budget`) but **not enforced** anywhere. Observed `coder-1` running 282/50 turns and $10.05/$5.00 USD on story 623 before a human stopped it (bug 624, now merged). The bug 624 fix adds enforcement to the watchdog. The state-machine impact: introduces a new agent-termination path distinct from "Failed (orphan)" — something like `Failed(LimitExceeded { kind: Turns | Budget })`. The `ExecutionState` enum may want a corresponding terminal variant so it can be distinguished from generic `Failed`. ### Pinned-agent honoring under contention When a story has `agent: coder-opus` pinned but `coder-opus` is busy, today's auto-assign behavior is to leave the story unassigned — but if a human stops the running attempt and the story sits in `current/`, auto-assign **re-grabs it with the default coder** rather than waiting for the pinned agent. Observed multiple times on 2026-04-25 with story 623: pinning `coder-opus` did not prevent `coder-1` (sonnet) from being auto-assigned during opus's busy window. **Missing behavior**: auto-assign should treat a pinned agent as a hard filter ("only this agent can take this story"), not a preference. Today the workaround is to also set `depends_on` on a phantom story, or move the story back to backlog and let the dependency system gate it. ### Honoring the `blocked` flag (bug 559) `559_bug_mergemaster_ignores_blocked_flag_and_keeps_respawning_on_blocked_stories` is in backlog. Even though `blocked: true` is documented as a skip condition in `auto_assign_available_work`, mergemaster's spawn path apparently checks something different (or earlier) and respawns on blocked merge-stage stories. The state machine should make `Stage::Archived { reason: Blocked }` a single authoritative source so no consumer can incidentally bypass it. ### Formal "ghost story recovery" transition The `move_story` MCP tool description mentions "recovering a ghost story by moving it back to current" as a valid use. Ghost stories are CRDT entries with no corresponding filesystem stage directory (or the inverse). Today this is an `update_story + move_story` ad-hoc dance. A first-class `recover_ghost_story` verb that reconciles the CRDT and filesystem would formalize the recovery path. ### Operator-level visibility / observability There is no UI, CLI, or doc that shows "the state machine as a diagram." The typed enums are the closest thing to a canonical specification, but they aren't rendered anywhere a human can see at a glance: which stages exist, which transitions are valid, which events trigger them. A generated state diagram (graphviz or mermaid, dumped into this doc on each release) would help both new contributors and operators triaging stuck pipelines. ### Review-hold cleanup verb `review_hold: true` is set automatically on spike completion. Clearing it is done as a side effect of `reject_qa` (which also moves the story qa → current) or by manually editing front matter. There is no clean "I have reviewed this, release the hold" verb that doesn't also move the story. ### Cross-node concurrency for execution state `ExecutionState` is per-node (keyed by pubkey) so two nodes can't fight over who's running an agent. But there is no formal transition that says "node A hands the story to node B" if node A goes offline. The state machine's distributed semantics for this case are not yet specified. --- ## How to update this document Whenever you discover a transition that doesn't yet exist, or a flag that behaves surprisingly, add it to **section (b)** with: - A short description of the desired behavior - Citation of the work item or incident that surfaced it - Pointer to the place in `pipeline_state.rs` where it should be modeled (or note "needs a new variant" if it doesn't fit any existing enum yet) When a transition from (b) ships, move it to (a) with the relevant file:line citations.