Files
huskies/.huskies/specs/tech/STATE_MACHINE.md
T

351 lines
16 KiB
Markdown

# Pipeline State Machine
This document describes the huskies pipeline state machine in two halves:
**(a)** the model that runs in production today, and **(b)** transitions, refinements,
and corrections we have identified as needed but not yet implemented.
The codebase is in a deliberate transitional state: a typed CRDT state machine
exists at `server/src/pipeline_state.rs` (introduced by story 520) with strict Rust
enums for every stage, archive reason, execution state, and event. It is fully
defined and tested but **not yet called from non-test code** (`#![allow(dead_code)]`
at the top of the module). Consumers will migrate incrementally.
The model that is actually doing work is the older **filesystem-stage-string +
front-matter-flag** model. Section (a) below documents both representations and
the migration intent.
---
## (a) The current state machine
### Stages (production: filesystem string; future: typed enum)
| Filesystem (production) | Typed (future) | Meaning |
|---|---|---|
| `work/1_backlog/` | `Stage::Backlog` | Story exists, waiting for dependencies or auto-assign promotion |
| `work/2_current/` | `Stage::Coding` | Coder agent is running (or about to) |
| `work/3_qa/` | `Stage::Qa` | Coder finished; gates / human review running |
| `work/4_merge/` | `Stage::Merge { feature_branch, commits_ahead: NonZeroU32 }` | Gates passed, mergemaster ready to squash |
| `work/5_done/` | `Stage::Done { merged_at, merge_commit }` | Mergemaster squashed to master |
| `work/6_archived/` | `Stage::Archived { archived_at, reason: ArchiveReason }` | Out of the active flow |
`5_done` auto-sweeps to `6_archived` after four hours. The typed `Stage::Done`
variant always carries the merge SHA and timestamp; `Stage::Merge`'s
`commits_ahead: NonZeroU32` makes "Merge with nothing to merge" structurally
impossible (eliminates bug 519).
### Archive reasons (`pipeline_state.rs::ArchiveReason`)
The typed model already enumerates the reasons a story can leave the active flow
(subsumes the legacy `blocked`, `merge_failure`, and `review_hold` front-matter
fields per story 436):
- `Completed` — happy-path
- `Abandoned` — user explicitly abandoned
- `Superseded { by: StoryId }` — replaced by another story
- `Blocked { reason: String }` — manually blocked, awaiting human resolution
- `MergeFailed { reason: String }` — mergemaster gave up after retry budget
- `ReviewHeld { reason: String }` — held for human review at user request
### Per-node execution state (`pipeline_state.rs::ExecutionState`)
Stage is shared/CRDT-replicated. Execution state is per-node and lives under
each node's pubkey in the CRDT, so there are no inter-author merge conflicts:
- `Idle`
- `Pending { agent, since }` — worktree being created, agent about to start
- `Running { agent, started_at, last_heartbeat }`
- `RateLimited { agent, resume_at }`
- `Completed { agent, exit_code, completed_at }`
### Pipeline events (`pipeline_state.rs::PipelineEvent`)
The typed model defines every event that drives a Stage transition. Each variant
carries the data needed to construct the destination state, so a transition
function can never accidentally land in an underspecified state:
- `DepsMet` — dependencies met; promote from backlog
- `GatesStarted` — coder starting gates
- `GatesPassed { feature_branch, commits_ahead }`
- `GatesFailed { reason }`
- `QaSkipped { feature_branch, commits_ahead }` — qa-mode = "server"; skip QA, go to merge
- `MergeSucceeded { merge_commit }`
- `MergeFailedFinal { reason }`
- `Accepted` — Done → Archived(Completed)
### Transitions (current production = MCP verb shape)
#### Backlog → Coding (a.k.a. backlog → 2_current)
- **Auto path**: `AgentPool::auto_assign_available_work` calls
`promote_ready_backlog_stories`. A backlog story is promoted iff (a) it has
an explicit non-empty `depends_on` AND (b) every dep is in `5_done` or
`6_archived`. Stories with no `depends_on` are NOT auto-promoted — they wait
for human scheduling.
- Implemented in `server/src/agents/pool/auto_assign/auto_assign.rs::promote_ready_backlog_stories`.
- **Manual path**: `mcp__huskies__move_story story_id=X target_stage=current`,
or `mcp__huskies__start_agent` (which moves the story to current as a
side-effect of starting an agent).
- **Archived-dep warning**: if a dep was satisfied via `6_archived` rather than
`5_done` (e.g. abandoned/superseded), the auto-assigner logs a prominent
warning so the user can see the promotion was triggered by an archived dep.
#### Coding → Qa (current → 3_qa)
- Triggered when the coder agent finishes (gates start running).
- `mcp__huskies__request_qa` is the manual verb.
#### Qa → Coding (qa → current — rejection path)
- `mcp__huskies__reject_qa story_id=X notes="..."` moves qa → current,
**clears `review_hold`**, and writes the rejection notes
(`agents/lifecycle.rs:210`).
- Used when a qa agent fails or a human reviewer rejects the work.
#### Qa → Merge (qa → 4_merge)
- Triggered when QA gates pass. `mcp__huskies__move_story_to_merge` is the
dedicated verb.
- For server-mode QA: typed-side `PipelineEvent::QaSkipped` allows going from
Coding → Merge directly without entering Qa.
#### Merge → Done (merge → 5_done)
- Mergemaster picks up a story in `4_merge/`, squashes the feature branch onto
master, then transitions to `5_done`.
- `mcp__huskies__move_story_to_merge` queues; mergemaster does the actual work.
#### Done → Archived(Completed) (5_done → 6_archived)
- Auto-sweep after four hours, OR
- `mcp__huskies__accept_story` (immediate manual archive).
#### Any-stage → Archived(other reasons)
- **Abandoned / Superseded**: today done by `mcp__huskies__move_story
target_stage=done` (no first-class verbs for these reasons; see (b) below).
- **Blocked**: `blocked: true` flag in front matter is set on retry-limit
exceedance. `mcp__huskies__unblock_story` clears the flag and resets
retry_count.
- **MergeFailed**: written to front matter when mergemaster fails; auto-assign
skips these stories (`has_merge_failure` check).
- **ReviewHeld**: `review_hold: true` flag is set automatically on spike
completion; auto-assign skips these stories until the flag is cleared.
#### Tombstone / purge
- `mcp__huskies__delete_story` and `mcp__huskies__purge_story` permanently
remove. Purge writes a CRDT tombstone.
### Auto-assign skip conditions (current production)
`auto_assign_available_work` walks `2_current/`, `3_qa/`, `4_merge/` in order
and attempts to dispatch a free agent to each unassigned story. It **skips**
any story that:
1. Has `review_hold: true` in front matter (spikes after QA, manual hold).
2. Is `frozen` (`is_story_frozen` — pipeline advancement suspended for this story).
3. Has `blocked: true` (retry limit exceeded; cleared via `unblock_story`).
4. Has unmet `depends_on` dependencies.
5. (Merge stage only) Has a recorded merge failure (`has_merge_failure`).
6. (Merge stage only) Has an empty diff on the feature branch — auto-writes
`merge_failure` and blocks immediately rather than wasting a mergemaster turn.
### Front-matter fields that gate transitions
| Field | Type | Effect |
|---|---|---|
| `depends_on` | list of story IDs | Blocks backlog → current promotion until all deps are in 5_done or 6_archived |
| `agent` | string (e.g. `coder-opus`) | Pins the preferred agent for next assignment |
| `review_hold` | bool | Auto-assign skips this story; cleared by `reject_qa` or manual unblock |
| `blocked` | bool | Auto-assign skips this story; cleared by `unblock_story` |
| `frozen` | bool | Auto-assign skips this story; manual unfreeze required |
| `merge_failure` | string | Auto-assign skips merge-stage agents on this story |
| `retry_count` | int | Local-only (not in CRDT); incremented by orchestrator |
### Spike-specific behavior
Per the typical lifecycle, a spike runs through `current → qa` like any work
item, then **stops** in qa awaiting human review (`spikes skip merge`). This
is implemented via `review_hold: true` being written automatically when a
spike's qa gates pass. The user accepts (move qa → done) or rejects (move
qa → current). Spikes do NOT auto-promote to merge.
### Mergemaster lifecycle
The mergemaster agent only runs against stories in `4_merge/`. It:
1. Verifies the feature branch has commits (or the story is auto-blocked).
2. Squashes the feature branch onto master with a deterministic commit message.
3. Transitions the story to `5_done` with `merged_at` and `merge_commit`.
4. On failure beyond the retry budget, writes `merge_failure` and blocks the
story (auto-assign then skips it).
### Agent terminated with committed work (bug 645 recovery path)
When a coder agent terminates abnormally (e.g. the Claude Code CLI's
`output.write(&bytes).is_ok()` PTY write assertion fires mid-session), the
server-owned completion path detects the crash and checks for surviving work:
1. If the worktree is dirty but has commits ahead of master, reset the
uncommitted files (`git checkout . && git clean -fd`) and run gates
against the committed code.
2. If gates still fail but `git log master..HEAD` shows commits and
`cargo check` passes, **advance to QA** instead of entering the
retry/block path. This is the "work survived" check, implemented in
`server/src/agents/pool/pipeline/advance.rs`.
3. Agents that die WITHOUT committed work (no commits ahead of master)
still follow the existing retry → block path unchanged.
This prevents false-positive blocking of stories where the agent completed
meaningful work before crashing.
### Watchdog (current production)
The "watchdog" at `server/src/agents/pool/auto_assign/watchdog.rs` runs every
30 ticks of the unified background loop. Today it does **one** thing: detect
orphaned agents whose tokio task is `is_finished()` but whose status is still
`Running` or `Pending`, and mark them `Failed` with an `AgentEvent::Error`
emission. Bug 624 (now merged) extends it to also enforce `max_turns` and
`max_budget_usd` limits — an agent over either limit is killed via the
existing `kill_child_for_key` path and recorded with a typed termination
reason.
---
## (b) Transitions and behaviors that don't yet exist (or are only partially wired)
### Migration of consumers off legacy strings to typed `Stage` enum
The biggest outstanding piece. `pipeline_state.rs` is `#![allow(dead_code)]`.
Every consumer (auto-assign, mergemaster, MCP tools, chat commands) still
works with stage strings (`"2_current"`, `"4_merge"`) and front-matter flags.
The projection layer (`TryFrom<PipelineItemView> for PipelineItem` and
friends) exists but isn't called outside tests. Migration is intentionally
incremental.
**Opportunity**: pick a leaf consumer (e.g. one MCP tool that reads the stage
string) and migrate it to read `Stage` instead. Pattern repeats outward until
all consumers go through the typed projection and the legacy stage-string
code can be deleted.
### First-class verbs for archive reasons
`ArchiveReason` already has six variants but only `Completed` (via
`accept_story`) and `Blocked` (via the `blocked: true` flag) have dedicated
MCP verbs. Today, `Abandoned`, `Superseded`, `MergeFailed`, and `ReviewHeld`
are reached either via `move_story target_stage=done` (which doesn't carry
the reason) or via setting front-matter flags on the live story.
**Missing transitions**:
- `mcp__huskies__supersede_story story_id=X by=Y` — sets stage to
`Archived { reason: Superseded { by: Y } }`. Today we use
`move_story → done`, losing the `by` reference. (Came up 2026-04-25 with
spike 621 → refactor 623.)
- `mcp__huskies__abandon_story story_id=X reason="..."` — sets
`Archived { reason: Abandoned }`. Today done via `move_story → done` or
`purge_story`.
- `mcp__huskies__hold_for_review story_id=X reason="..."` — explicitly puts
a story in `Archived { reason: ReviewHeld }` rather than relying on the
auto-set `review_hold` flag.
### Type-conversion transitions
Spike → story conversion is a real workflow (we do it when a spike's scope
grows into an implementation story). Today, converting type via `update_story
front_matter={"type": "story"}` does not bootstrap the
`## Acceptance Criteria` section, and `add_criterion` then permanently fails
on that story (see **bug 625** filed 2026-04-25). The `type` field passed via
front_matter is also silently dropped — same silent-drop bug class as
`acceptance_criteria`. The state machine should treat type conversion as a
transition with side effects — at minimum, ensuring the AC section exists
when transitioning to a type that requires it, and the displayed type
reflects the new value (today the display chip is parsed from the immutable
story_id prefix; story 578 in backlog will fix this by switching to
numeric-only IDs).
### Limit-based agent termination (turn / budget)
Pre-624 master: `max_turns` and `max_budget_usd` per-agent config were read
by the metric tool (`tool_get_agent_remaining_turns_and_budget`) but **not
enforced** anywhere. Observed `coder-1` running 282/50 turns and $10.05/$5.00
USD on story 623 before a human stopped it (bug 624, now merged).
The bug 624 fix adds enforcement to the watchdog. The state-machine impact:
introduces a new agent-termination path distinct from "Failed (orphan)" —
something like `Failed(LimitExceeded { kind: Turns | Budget })`. The
`ExecutionState` enum may want a corresponding terminal variant so it can be
distinguished from generic `Failed`.
### Pinned-agent honoring under contention
When a story has `agent: coder-opus` pinned but `coder-opus` is busy, today's
auto-assign behavior is to leave the story unassigned — but if a human stops
the running attempt and the story sits in `current/`, auto-assign **re-grabs
it with the default coder** rather than waiting for the pinned agent.
Observed multiple times on 2026-04-25 with story 623: pinning `coder-opus`
did not prevent `coder-1` (sonnet) from being auto-assigned during opus's
busy window.
**Missing behavior**: auto-assign should treat a pinned agent as a hard
filter ("only this agent can take this story"), not a preference. Today the
workaround is to also set `depends_on` on a phantom story, or move the story
back to backlog and let the dependency system gate it.
### Honoring the `blocked` flag (bug 559)
`559_bug_mergemaster_ignores_blocked_flag_and_keeps_respawning_on_blocked_stories`
is in backlog. Even though `blocked: true` is documented as a skip condition
in `auto_assign_available_work`, mergemaster's spawn path apparently checks
something different (or earlier) and respawns on blocked merge-stage stories.
The state machine should make `Stage::Archived { reason: Blocked }` a single
authoritative source so no consumer can incidentally bypass it.
### Formal "ghost story recovery" transition
The `move_story` MCP tool description mentions "recovering a ghost story by
moving it back to current" as a valid use. Ghost stories are CRDT entries
with no corresponding filesystem stage directory (or the inverse). Today this
is an `update_story + move_story` ad-hoc dance. A first-class
`recover_ghost_story` verb that reconciles the CRDT and filesystem would
formalize the recovery path.
### Operator-level visibility / observability
There is no UI, CLI, or doc that shows "the state machine as a diagram." The
typed enums are the closest thing to a canonical specification, but they
aren't rendered anywhere a human can see at a glance: which stages exist,
which transitions are valid, which events trigger them. A generated state
diagram (graphviz or mermaid, dumped into this doc on each release) would
help both new contributors and operators triaging stuck pipelines.
### Review-hold cleanup verb
`review_hold: true` is set automatically on spike completion. Clearing it is
done as a side effect of `reject_qa` (which also moves the story qa →
current) or by manually editing front matter. There is no clean "I have
reviewed this, release the hold" verb that doesn't also move the story.
### Cross-node concurrency for execution state
`ExecutionState` is per-node (keyed by pubkey) so two nodes can't fight over
who's running an agent. But there is no formal transition that says "node A
hands the story to node B" if node A goes offline. The state machine's
distributed semantics for this case are not yet specified.
---
## How to update this document
Whenever you discover a transition that doesn't yet exist, or a flag that
behaves surprisingly, add it to **section (b)** with:
- A short description of the desired behavior
- Citation of the work item or incident that surfaced it
- Pointer to the place in `pipeline_state.rs` where it should be modeled (or
note "needs a new variant" if it doesn't fit any existing enum yet)
When a transition from (b) ships, move it to (a) with the relevant file:line
citations.