Files
huskies/.huskies/specs/tech/STATE_MACHINE.md
T
dave 2097787e1f docs: add pipeline state machine reference (current + planned transitions)
Captures the dual representation we have today (legacy filesystem stage
strings + front-matter flags vs the typed Stage/ArchiveReason/ExecutionState
enums in pipeline_state.rs that are defined-but-not-wired) and itemises the
transitions and behaviours we have identified as missing or partially
implemented (first-class supersede/abandon/hold verbs, type-conversion side
effects, pinned-agent honouring under contention, blocked-flag enforcement
beyond auto-assign, ghost-story recovery, etc.).

Section (b) is intended as a living dumping ground — append new
transitions and incidents as they come up so that the state-machine
roadmap (spike 613 in backlog) has a ready-made input.
2026-04-25 13:33:57 +00:00

16 KiB

Pipeline State Machine

This document describes the huskies pipeline state machine in two halves: (a) the model that runs in production today, and (b) transitions, refinements, and corrections we have identified as needed but not yet implemented.

The codebase is in a deliberate transitional state: a typed CRDT state machine exists at server/src/pipeline_state.rs (introduced by story 520) with strict Rust enums for every stage, archive reason, execution state, and event. It is fully defined and tested but not yet called from non-test code (#![allow(dead_code)] at the top of the module). Consumers will migrate incrementally.

The model that is actually doing work is the older filesystem-stage-string + front-matter-flag model. Section (a) below documents both representations and the migration intent.


(a) The current state machine

Stages (production: filesystem string; future: typed enum)

Filesystem (production) Typed (future) Meaning
work/1_backlog/ Stage::Backlog Story exists, waiting for dependencies or auto-assign promotion
work/2_current/ Stage::Coding Coder agent is running (or about to)
work/3_qa/ Stage::Qa Coder finished; gates / human review running
work/4_merge/ Stage::Merge { feature_branch, commits_ahead: NonZeroU32 } Gates passed, mergemaster ready to squash
work/5_done/ Stage::Done { merged_at, merge_commit } Mergemaster squashed to master
work/6_archived/ Stage::Archived { archived_at, reason: ArchiveReason } Out of the active flow

5_done auto-sweeps to 6_archived after four hours. The typed Stage::Done variant always carries the merge SHA and timestamp; Stage::Merge's commits_ahead: NonZeroU32 makes "Merge with nothing to merge" structurally impossible (eliminates bug 519).

Archive reasons (pipeline_state.rs::ArchiveReason)

The typed model already enumerates the reasons a story can leave the active flow (subsumes the legacy blocked, merge_failure, and review_hold front-matter fields per story 436):

  • Completed — happy-path
  • Abandoned — user explicitly abandoned
  • Superseded { by: StoryId } — replaced by another story
  • Blocked { reason: String } — manually blocked, awaiting human resolution
  • MergeFailed { reason: String } — mergemaster gave up after retry budget
  • ReviewHeld { reason: String } — held for human review at user request

Per-node execution state (pipeline_state.rs::ExecutionState)

Stage is shared/CRDT-replicated. Execution state is per-node and lives under each node's pubkey in the CRDT, so there are no inter-author merge conflicts:

  • Idle
  • Pending { agent, since } — worktree being created, agent about to start
  • Running { agent, started_at, last_heartbeat }
  • RateLimited { agent, resume_at }
  • Completed { agent, exit_code, completed_at }

Pipeline events (pipeline_state.rs::PipelineEvent)

The typed model defines every event that drives a Stage transition. Each variant carries the data needed to construct the destination state, so a transition function can never accidentally land in an underspecified state:

  • DepsMet — dependencies met; promote from backlog
  • GatesStarted — coder starting gates
  • GatesPassed { feature_branch, commits_ahead }
  • GatesFailed { reason }
  • QaSkipped { feature_branch, commits_ahead } — qa-mode = "server"; skip QA, go to merge
  • MergeSucceeded { merge_commit }
  • MergeFailedFinal { reason }
  • Accepted — Done → Archived(Completed)

Transitions (current production = MCP verb shape)

Backlog → Coding (a.k.a. backlog → 2_current)

  • Auto path: AgentPool::auto_assign_available_work calls promote_ready_backlog_stories. A backlog story is promoted iff (a) it has an explicit non-empty depends_on AND (b) every dep is in 5_done or 6_archived. Stories with no depends_on are NOT auto-promoted — they wait for human scheduling.
    • Implemented in server/src/agents/pool/auto_assign/auto_assign.rs::promote_ready_backlog_stories.
  • Manual path: mcp__huskies__move_story story_id=X target_stage=current, or mcp__huskies__start_agent (which moves the story to current as a side-effect of starting an agent).
  • Archived-dep warning: if a dep was satisfied via 6_archived rather than 5_done (e.g. abandoned/superseded), the auto-assigner logs a prominent warning so the user can see the promotion was triggered by an archived dep.

Coding → Qa (current → 3_qa)

  • Triggered when the coder agent finishes (gates start running).
  • mcp__huskies__request_qa is the manual verb.

Qa → Coding (qa → current — rejection path)

  • mcp__huskies__reject_qa story_id=X notes="..." moves qa → current, clears review_hold, and writes the rejection notes (agents/lifecycle.rs:210).
  • Used when a qa agent fails or a human reviewer rejects the work.

Qa → Merge (qa → 4_merge)

  • Triggered when QA gates pass. mcp__huskies__move_story_to_merge is the dedicated verb.
  • For server-mode QA: typed-side PipelineEvent::QaSkipped allows going from Coding → Merge directly without entering Qa.

Merge → Done (merge → 5_done)

  • Mergemaster picks up a story in 4_merge/, squashes the feature branch onto master, then transitions to 5_done.
  • mcp__huskies__move_story_to_merge queues; mergemaster does the actual work.

Done → Archived(Completed) (5_done → 6_archived)

  • Auto-sweep after four hours, OR
  • mcp__huskies__accept_story (immediate manual archive).

Any-stage → Archived(other reasons)

  • Abandoned / Superseded: today done by mcp__huskies__move_story target_stage=done (no first-class verbs for these reasons; see (b) below).
  • Blocked: blocked: true flag in front matter is set on retry-limit exceedance. mcp__huskies__unblock_story clears the flag and resets retry_count.
  • MergeFailed: written to front matter when mergemaster fails; auto-assign skips these stories (has_merge_failure check).
  • ReviewHeld: review_hold: true flag is set automatically on spike completion; auto-assign skips these stories until the flag is cleared.

Tombstone / purge

  • mcp__huskies__delete_story and mcp__huskies__purge_story permanently remove. Purge writes a CRDT tombstone.

Auto-assign skip conditions (current production)

auto_assign_available_work walks 2_current/, 3_qa/, 4_merge/ in order and attempts to dispatch a free agent to each unassigned story. It skips any story that:

  1. Has review_hold: true in front matter (spikes after QA, manual hold).
  2. Is frozen (is_story_frozen — pipeline advancement suspended for this story).
  3. Has blocked: true (retry limit exceeded; cleared via unblock_story).
  4. Has unmet depends_on dependencies.
  5. (Merge stage only) Has a recorded merge failure (has_merge_failure).
  6. (Merge stage only) Has an empty diff on the feature branch — auto-writes merge_failure and blocks immediately rather than wasting a mergemaster turn.

Front-matter fields that gate transitions

Field Type Effect
depends_on list of story IDs Blocks backlog → current promotion until all deps are in 5_done or 6_archived
agent string (e.g. coder-opus) Pins the preferred agent for next assignment
review_hold bool Auto-assign skips this story; cleared by reject_qa or manual unblock
blocked bool Auto-assign skips this story; cleared by unblock_story
frozen bool Auto-assign skips this story; manual unfreeze required
merge_failure string Auto-assign skips merge-stage agents on this story
retry_count int Local-only (not in CRDT); incremented by orchestrator

Spike-specific behavior

Per the typical lifecycle, a spike runs through current → qa like any work item, then stops in qa awaiting human review (spikes skip merge). This is implemented via review_hold: true being written automatically when a spike's qa gates pass. The user accepts (move qa → done) or rejects (move qa → current). Spikes do NOT auto-promote to merge.

Mergemaster lifecycle

The mergemaster agent only runs against stories in 4_merge/. It:

  1. Verifies the feature branch has commits (or the story is auto-blocked).
  2. Squashes the feature branch onto master with a deterministic commit message.
  3. Transitions the story to 5_done with merged_at and merge_commit.
  4. On failure beyond the retry budget, writes merge_failure and blocks the story (auto-assign then skips it).

Watchdog (current production)

The "watchdog" at server/src/agents/pool/auto_assign/watchdog.rs runs every 30 ticks of the unified background loop. Today it does one thing: detect orphaned agents whose tokio task is is_finished() but whose status is still Running or Pending, and mark them Failed with an AgentEvent::Error emission. Bug 624 (now merged) extends it to also enforce max_turns and max_budget_usd limits — an agent over either limit is killed via the existing kill_child_for_key path and recorded with a typed termination reason.


(b) Transitions and behaviors that don't yet exist (or are only partially wired)

Migration of consumers off legacy strings to typed Stage enum

The biggest outstanding piece. pipeline_state.rs is #![allow(dead_code)]. Every consumer (auto-assign, mergemaster, MCP tools, chat commands) still works with stage strings ("2_current", "4_merge") and front-matter flags. The projection layer (TryFrom<PipelineItemView> for PipelineItem and friends) exists but isn't called outside tests. Migration is intentionally incremental.

Opportunity: pick a leaf consumer (e.g. one MCP tool that reads the stage string) and migrate it to read Stage instead. Pattern repeats outward until all consumers go through the typed projection and the legacy stage-string code can be deleted.

First-class verbs for archive reasons

ArchiveReason already has six variants but only Completed (via accept_story) and Blocked (via the blocked: true flag) have dedicated MCP verbs. Today, Abandoned, Superseded, MergeFailed, and ReviewHeld are reached either via move_story target_stage=done (which doesn't carry the reason) or via setting front-matter flags on the live story.

Missing transitions:

  • mcp__huskies__supersede_story story_id=X by=Y — sets stage to Archived { reason: Superseded { by: Y } }. Today we use move_story → done, losing the by reference. (Came up 2026-04-25 with spike 621 → refactor 623.)
  • mcp__huskies__abandon_story story_id=X reason="..." — sets Archived { reason: Abandoned }. Today done via move_story → done or purge_story.
  • mcp__huskies__hold_for_review story_id=X reason="..." — explicitly puts a story in Archived { reason: ReviewHeld } rather than relying on the auto-set review_hold flag.

Type-conversion transitions

Spike → story conversion is a real workflow (we do it when a spike's scope grows into an implementation story). Today, converting type via update_story front_matter={"type": "story"} does not bootstrap the ## Acceptance Criteria section, and add_criterion then permanently fails on that story (see bug 625 filed 2026-04-25). The type field passed via front_matter is also silently dropped — same silent-drop bug class as acceptance_criteria. The state machine should treat type conversion as a transition with side effects — at minimum, ensuring the AC section exists when transitioning to a type that requires it, and the displayed type reflects the new value (today the display chip is parsed from the immutable story_id prefix; story 578 in backlog will fix this by switching to numeric-only IDs).

Limit-based agent termination (turn / budget)

Pre-624 master: max_turns and max_budget_usd per-agent config were read by the metric tool (tool_get_agent_remaining_turns_and_budget) but not enforced anywhere. Observed coder-1 running 282/50 turns and $10.05/$5.00 USD on story 623 before a human stopped it (bug 624, now merged).

The bug 624 fix adds enforcement to the watchdog. The state-machine impact: introduces a new agent-termination path distinct from "Failed (orphan)" — something like Failed(LimitExceeded { kind: Turns | Budget }). The ExecutionState enum may want a corresponding terminal variant so it can be distinguished from generic Failed.

Pinned-agent honoring under contention

When a story has agent: coder-opus pinned but coder-opus is busy, today's auto-assign behavior is to leave the story unassigned — but if a human stops the running attempt and the story sits in current/, auto-assign re-grabs it with the default coder rather than waiting for the pinned agent. Observed multiple times on 2026-04-25 with story 623: pinning coder-opus did not prevent coder-1 (sonnet) from being auto-assigned during opus's busy window.

Missing behavior: auto-assign should treat a pinned agent as a hard filter ("only this agent can take this story"), not a preference. Today the workaround is to also set depends_on on a phantom story, or move the story back to backlog and let the dependency system gate it.

Honoring the blocked flag (bug 559)

559_bug_mergemaster_ignores_blocked_flag_and_keeps_respawning_on_blocked_stories is in backlog. Even though blocked: true is documented as a skip condition in auto_assign_available_work, mergemaster's spawn path apparently checks something different (or earlier) and respawns on blocked merge-stage stories. The state machine should make Stage::Archived { reason: Blocked } a single authoritative source so no consumer can incidentally bypass it.

Formal "ghost story recovery" transition

The move_story MCP tool description mentions "recovering a ghost story by moving it back to current" as a valid use. Ghost stories are CRDT entries with no corresponding filesystem stage directory (or the inverse). Today this is an update_story + move_story ad-hoc dance. A first-class recover_ghost_story verb that reconciles the CRDT and filesystem would formalize the recovery path.

Operator-level visibility / observability

There is no UI, CLI, or doc that shows "the state machine as a diagram." The typed enums are the closest thing to a canonical specification, but they aren't rendered anywhere a human can see at a glance: which stages exist, which transitions are valid, which events trigger them. A generated state diagram (graphviz or mermaid, dumped into this doc on each release) would help both new contributors and operators triaging stuck pipelines.

Review-hold cleanup verb

review_hold: true is set automatically on spike completion. Clearing it is done as a side effect of reject_qa (which also moves the story qa → current) or by manually editing front matter. There is no clean "I have reviewed this, release the hold" verb that doesn't also move the story.

Cross-node concurrency for execution state

ExecutionState is per-node (keyed by pubkey) so two nodes can't fight over who's running an agent. But there is no formal transition that says "node A hands the story to node B" if node A goes offline. The state machine's distributed semantics for this case are not yet specified.


How to update this document

Whenever you discover a transition that doesn't yet exist, or a flag that behaves surprisingly, add it to section (b) with:

  • A short description of the desired behavior
  • Citation of the work item or incident that surfaced it
  • Pointer to the place in pipeline_state.rs where it should be modeled (or note "needs a new variant" if it doesn't fit any existing enum yet)

When a transition from (b) ships, move it to (a) with the relevant file:line citations.