16 KiB
Pipeline State Machine
This document describes the huskies pipeline state machine in two halves: (a) the model that runs in production today, and (b) transitions, refinements, and corrections we have identified as needed but not yet implemented.
The codebase is in a deliberate transitional state: a typed CRDT state machine
exists at server/src/pipeline_state.rs (introduced by story 520) with strict Rust
enums for every stage, archive reason, execution state, and event. It is fully
defined and tested but not yet called from non-test code (#![allow(dead_code)]
at the top of the module). Consumers will migrate incrementally.
The model that is actually doing work is the older filesystem-stage-string + front-matter-flag model. Section (a) below documents both representations and the migration intent.
(a) The current state machine
Stages (production: filesystem string; future: typed enum)
| Filesystem (production) | Typed (future) | Meaning |
|---|---|---|
work/1_backlog/ |
Stage::Backlog |
Story exists, waiting for dependencies or auto-assign promotion |
work/2_current/ |
Stage::Coding |
Coder agent is running (or about to) |
work/3_qa/ |
Stage::Qa |
Coder finished; gates / human review running |
work/4_merge/ |
Stage::Merge { feature_branch, commits_ahead: NonZeroU32 } |
Gates passed, mergemaster ready to squash |
work/5_done/ |
Stage::Done { merged_at, merge_commit } |
Mergemaster squashed to master |
work/6_archived/ |
Stage::Archived { archived_at, reason: ArchiveReason } |
Out of the active flow |
5_done auto-sweeps to 6_archived after four hours. The typed Stage::Done
variant always carries the merge SHA and timestamp; Stage::Merge's
commits_ahead: NonZeroU32 makes "Merge with nothing to merge" structurally
impossible (eliminates bug 519).
Archive reasons (pipeline_state.rs::ArchiveReason)
The typed model already enumerates the reasons a story can leave the active flow
(subsumes the legacy blocked, merge_failure, and review_hold front-matter
fields per story 436):
Completed— happy-pathAbandoned— user explicitly abandonedSuperseded { by: StoryId }— replaced by another storyBlocked { reason: String }— manually blocked, awaiting human resolutionMergeFailed { reason: String }— mergemaster gave up after retry budgetReviewHeld { reason: String }— held for human review at user request
Per-node execution state (pipeline_state.rs::ExecutionState)
Stage is shared/CRDT-replicated. Execution state is per-node and lives under each node's pubkey in the CRDT, so there are no inter-author merge conflicts:
IdlePending { agent, since }— worktree being created, agent about to startRunning { agent, started_at, last_heartbeat }RateLimited { agent, resume_at }Completed { agent, exit_code, completed_at }
Pipeline events (pipeline_state.rs::PipelineEvent)
The typed model defines every event that drives a Stage transition. Each variant carries the data needed to construct the destination state, so a transition function can never accidentally land in an underspecified state:
DepsMet— dependencies met; promote from backlogGatesStarted— coder starting gatesGatesPassed { feature_branch, commits_ahead }GatesFailed { reason }QaSkipped { feature_branch, commits_ahead }— qa-mode = "server"; skip QA, go to mergeMergeSucceeded { merge_commit }MergeFailedFinal { reason }Accepted— Done → Archived(Completed)
Transitions (current production = MCP verb shape)
Backlog → Coding (a.k.a. backlog → 2_current)
- Auto path:
AgentPool::auto_assign_available_workcallspromote_ready_backlog_stories. A backlog story is promoted iff (a) it has an explicit non-emptydepends_onAND (b) every dep is in5_doneor6_archived. Stories with nodepends_onare NOT auto-promoted — they wait for human scheduling.- Implemented in
server/src/agents/pool/auto_assign/auto_assign.rs::promote_ready_backlog_stories.
- Implemented in
- Manual path:
mcp__huskies__move_story story_id=X target_stage=current, ormcp__huskies__start_agent(which moves the story to current as a side-effect of starting an agent). - Archived-dep warning: if a dep was satisfied via
6_archivedrather than5_done(e.g. abandoned/superseded), the auto-assigner logs a prominent warning so the user can see the promotion was triggered by an archived dep.
Coding → Qa (current → 3_qa)
- Triggered when the coder agent finishes (gates start running).
mcp__huskies__request_qais the manual verb.
Qa → Coding (qa → current — rejection path)
mcp__huskies__reject_qa story_id=X notes="..."moves qa → current, clearsreview_hold, and writes the rejection notes (agents/lifecycle.rs:210).- Used when a qa agent fails or a human reviewer rejects the work.
Qa → Merge (qa → 4_merge)
- Triggered when QA gates pass.
mcp__huskies__move_story_to_mergeis the dedicated verb. - For server-mode QA: typed-side
PipelineEvent::QaSkippedallows going from Coding → Merge directly without entering Qa.
Merge → Done (merge → 5_done)
- Mergemaster picks up a story in
4_merge/, squashes the feature branch onto master, then transitions to5_done. mcp__huskies__move_story_to_mergequeues; mergemaster does the actual work.
Done → Archived(Completed) (5_done → 6_archived)
- Auto-sweep after four hours, OR
mcp__huskies__accept_story(immediate manual archive).
Any-stage → Archived(other reasons)
- Abandoned / Superseded: today done by
mcp__huskies__move_story target_stage=done(no first-class verbs for these reasons; see (b) below). - Blocked:
blocked: trueflag in front matter is set on retry-limit exceedance.mcp__huskies__unblock_storyclears the flag and resets retry_count. - MergeFailed: written to front matter when mergemaster fails; auto-assign
skips these stories (
has_merge_failurecheck). - ReviewHeld:
review_hold: trueflag is set automatically on spike completion; auto-assign skips these stories until the flag is cleared.
Tombstone / purge
mcp__huskies__delete_storyandmcp__huskies__purge_storypermanently remove. Purge writes a CRDT tombstone.
Auto-assign skip conditions (current production)
auto_assign_available_work walks 2_current/, 3_qa/, 4_merge/ in order
and attempts to dispatch a free agent to each unassigned story. It skips
any story that:
- Has
review_hold: truein front matter (spikes after QA, manual hold). - Is
frozen(is_story_frozen— pipeline advancement suspended for this story). - Has
blocked: true(retry limit exceeded; cleared viaunblock_story). - Has unmet
depends_ondependencies. - (Merge stage only) Has a recorded merge failure (
has_merge_failure). - (Merge stage only) Has an empty diff on the feature branch — auto-writes
merge_failureand blocks immediately rather than wasting a mergemaster turn.
Front-matter fields that gate transitions
| Field | Type | Effect |
|---|---|---|
depends_on |
list of story IDs | Blocks backlog → current promotion until all deps are in 5_done or 6_archived |
agent |
string (e.g. coder-opus) |
Pins the preferred agent for next assignment |
review_hold |
bool | Auto-assign skips this story; cleared by reject_qa or manual unblock |
blocked |
bool | Auto-assign skips this story; cleared by unblock_story |
frozen |
bool | Auto-assign skips this story; manual unfreeze required |
merge_failure |
string | Auto-assign skips merge-stage agents on this story |
retry_count |
int | Local-only (not in CRDT); incremented by orchestrator |
Spike-specific behavior
Per the typical lifecycle, a spike runs through current → qa like any work
item, then stops in qa awaiting human review (spikes skip merge). This
is implemented via review_hold: true being written automatically when a
spike's qa gates pass. The user accepts (move qa → done) or rejects (move
qa → current). Spikes do NOT auto-promote to merge.
Mergemaster lifecycle
The mergemaster agent only runs against stories in 4_merge/. It:
- Verifies the feature branch has commits (or the story is auto-blocked).
- Squashes the feature branch onto master with a deterministic commit message.
- Transitions the story to
5_donewithmerged_atandmerge_commit. - On failure beyond the retry budget, writes
merge_failureand blocks the story (auto-assign then skips it).
Agent terminated with committed work (bug 645 recovery path)
When a coder agent terminates abnormally (e.g. the Claude Code CLI's
output.write(&bytes).is_ok() PTY write assertion fires mid-session), the
server-owned completion path detects the crash and checks for surviving work:
- If the worktree is dirty but has commits ahead of master, reset the
uncommitted files (
git checkout . && git clean -fd) and run gates against the committed code. - If gates still fail but
git log master..HEADshows commits andcargo checkpasses, advance to QA instead of entering the retry/block path. This is the "work survived" check, implemented inserver/src/agents/pool/pipeline/advance.rs. - Agents that die WITHOUT committed work (no commits ahead of master) still follow the existing retry → block path unchanged.
This prevents false-positive blocking of stories where the agent completed meaningful work before crashing.
Watchdog (current production)
The "watchdog" at server/src/agents/pool/auto_assign/watchdog.rs runs every
30 ticks of the unified background loop. Today it does one thing: detect
orphaned agents whose tokio task is is_finished() but whose status is still
Running or Pending, and mark them Failed with an AgentEvent::Error
emission. Bug 624 (now merged) extends it to also enforce max_turns and
max_budget_usd limits — an agent over either limit is killed via the
existing kill_child_for_key path and recorded with a typed termination
reason.
(b) Transitions and behaviors that don't yet exist (or are only partially wired)
Migration of consumers off legacy strings to typed Stage enum
The biggest outstanding piece. pipeline_state.rs is #![allow(dead_code)].
Every consumer (auto-assign, mergemaster, MCP tools, chat commands) still
works with stage strings ("2_current", "4_merge") and front-matter flags.
The projection layer (TryFrom<PipelineItemView> for PipelineItem and
friends) exists but isn't called outside tests. Migration is intentionally
incremental.
Opportunity: pick a leaf consumer (e.g. one MCP tool that reads the stage
string) and migrate it to read Stage instead. Pattern repeats outward until
all consumers go through the typed projection and the legacy stage-string
code can be deleted.
First-class verbs for archive reasons
ArchiveReason already has six variants but only Completed (via
accept_story) and Blocked (via the blocked: true flag) have dedicated
MCP verbs. Today, Abandoned, Superseded, MergeFailed, and ReviewHeld
are reached either via move_story target_stage=done (which doesn't carry
the reason) or via setting front-matter flags on the live story.
Missing transitions:
mcp__huskies__supersede_story story_id=X by=Y— sets stage toArchived { reason: Superseded { by: Y } }. Today we usemove_story → done, losing thebyreference. (Came up 2026-04-25 with spike 621 → refactor 623.)mcp__huskies__abandon_story story_id=X reason="..."— setsArchived { reason: Abandoned }. Today done viamove_story → doneorpurge_story.mcp__huskies__hold_for_review story_id=X reason="..."— explicitly puts a story inArchived { reason: ReviewHeld }rather than relying on the auto-setreview_holdflag.
Type-conversion transitions
Spike → story conversion is a real workflow (we do it when a spike's scope
grows into an implementation story). Today, converting type via update_story front_matter={"type": "story"} does not bootstrap the
## Acceptance Criteria section, and add_criterion then permanently fails
on that story (see bug 625 filed 2026-04-25). The type field passed via
front_matter is also silently dropped — same silent-drop bug class as
acceptance_criteria. The state machine should treat type conversion as a
transition with side effects — at minimum, ensuring the AC section exists
when transitioning to a type that requires it, and the displayed type
reflects the new value (today the display chip is parsed from the immutable
story_id prefix; story 578 in backlog will fix this by switching to
numeric-only IDs).
Limit-based agent termination (turn / budget)
Pre-624 master: max_turns and max_budget_usd per-agent config were read
by the metric tool (tool_get_agent_remaining_turns_and_budget) but not
enforced anywhere. Observed coder-1 running 282/50 turns and $10.05/$5.00
USD on story 623 before a human stopped it (bug 624, now merged).
The bug 624 fix adds enforcement to the watchdog. The state-machine impact:
introduces a new agent-termination path distinct from "Failed (orphan)" —
something like Failed(LimitExceeded { kind: Turns | Budget }). The
ExecutionState enum may want a corresponding terminal variant so it can be
distinguished from generic Failed.
Pinned-agent honoring under contention
When a story has agent: coder-opus pinned but coder-opus is busy, today's
auto-assign behavior is to leave the story unassigned — but if a human stops
the running attempt and the story sits in current/, auto-assign re-grabs
it with the default coder rather than waiting for the pinned agent.
Observed multiple times on 2026-04-25 with story 623: pinning coder-opus
did not prevent coder-1 (sonnet) from being auto-assigned during opus's
busy window.
Missing behavior: auto-assign should treat a pinned agent as a hard
filter ("only this agent can take this story"), not a preference. Today the
workaround is to also set depends_on on a phantom story, or move the story
back to backlog and let the dependency system gate it.
Honoring the blocked flag (bug 559)
559_bug_mergemaster_ignores_blocked_flag_and_keeps_respawning_on_blocked_stories
is in backlog. Even though blocked: true is documented as a skip condition
in auto_assign_available_work, mergemaster's spawn path apparently checks
something different (or earlier) and respawns on blocked merge-stage stories.
The state machine should make Stage::Archived { reason: Blocked } a single
authoritative source so no consumer can incidentally bypass it.
Formal "ghost story recovery" transition
The move_story MCP tool description mentions "recovering a ghost story by
moving it back to current" as a valid use. Ghost stories are CRDT entries
with no corresponding filesystem stage directory (or the inverse). Today this
is an update_story + move_story ad-hoc dance. A first-class
recover_ghost_story verb that reconciles the CRDT and filesystem would
formalize the recovery path.
Operator-level visibility / observability
There is no UI, CLI, or doc that shows "the state machine as a diagram." The typed enums are the closest thing to a canonical specification, but they aren't rendered anywhere a human can see at a glance: which stages exist, which transitions are valid, which events trigger them. A generated state diagram (graphviz or mermaid, dumped into this doc on each release) would help both new contributors and operators triaging stuck pipelines.
Review-hold cleanup verb
review_hold: true is set automatically on spike completion. Clearing it is
done as a side effect of reject_qa (which also moves the story qa →
current) or by manually editing front matter. There is no clean "I have
reviewed this, release the hold" verb that doesn't also move the story.
Cross-node concurrency for execution state
ExecutionState is per-node (keyed by pubkey) so two nodes can't fight over
who's running an agent. But there is no formal transition that says "node A
hands the story to node B" if node A goes offline. The state machine's
distributed semantics for this case are not yet specified.
How to update this document
Whenever you discover a transition that doesn't yet exist, or a flag that behaves surprisingly, add it to section (b) with:
- A short description of the desired behavior
- Citation of the work item or incident that surfaced it
- Pointer to the place in
pipeline_state.rswhere it should be modeled (or note "needs a new variant" if it doesn't fit any existing enum yet)
When a transition from (b) ships, move it to (a) with the relevant file:line citations.