docs: add pipeline state machine reference (current + planned transitions)
Captures the dual representation we have today (legacy filesystem stage strings + front-matter flags vs the typed Stage/ArchiveReason/ExecutionState enums in pipeline_state.rs that are defined-but-not-wired) and itemises the transitions and behaviours we have identified as missing or partially implemented (first-class supersede/abandon/hold verbs, type-conversion side effects, pinned-agent honouring under contention, blocked-flag enforcement beyond auto-assign, ghost-story recovery, etc.). Section (b) is intended as a living dumping ground — append new transitions and incidents as they come up so that the state-machine roadmap (spike 613 in backlog) has a ready-made input.
This commit is contained in:
@@ -0,0 +1,331 @@
|
||||
# Pipeline State Machine
|
||||
|
||||
This document describes the huskies pipeline state machine in two halves:
|
||||
**(a)** the model that runs in production today, and **(b)** transitions, refinements,
|
||||
and corrections we have identified as needed but not yet implemented.
|
||||
|
||||
The codebase is in a deliberate transitional state: a typed CRDT state machine
|
||||
exists at `server/src/pipeline_state.rs` (introduced by story 520) with strict Rust
|
||||
enums for every stage, archive reason, execution state, and event. It is fully
|
||||
defined and tested but **not yet called from non-test code** (`#![allow(dead_code)]`
|
||||
at the top of the module). Consumers will migrate incrementally.
|
||||
|
||||
The model that is actually doing work is the older **filesystem-stage-string +
|
||||
front-matter-flag** model. Section (a) below documents both representations and
|
||||
the migration intent.
|
||||
|
||||
---
|
||||
|
||||
## (a) The current state machine
|
||||
|
||||
### Stages (production: filesystem string; future: typed enum)
|
||||
|
||||
| Filesystem (production) | Typed (future) | Meaning |
|
||||
|---|---|---|
|
||||
| `work/1_backlog/` | `Stage::Backlog` | Story exists, waiting for dependencies or auto-assign promotion |
|
||||
| `work/2_current/` | `Stage::Coding` | Coder agent is running (or about to) |
|
||||
| `work/3_qa/` | `Stage::Qa` | Coder finished; gates / human review running |
|
||||
| `work/4_merge/` | `Stage::Merge { feature_branch, commits_ahead: NonZeroU32 }` | Gates passed, mergemaster ready to squash |
|
||||
| `work/5_done/` | `Stage::Done { merged_at, merge_commit }` | Mergemaster squashed to master |
|
||||
| `work/6_archived/` | `Stage::Archived { archived_at, reason: ArchiveReason }` | Out of the active flow |
|
||||
|
||||
`5_done` auto-sweeps to `6_archived` after four hours. The typed `Stage::Done`
|
||||
variant always carries the merge SHA and timestamp; `Stage::Merge`'s
|
||||
`commits_ahead: NonZeroU32` makes "Merge with nothing to merge" structurally
|
||||
impossible (eliminates bug 519).
|
||||
|
||||
### Archive reasons (`pipeline_state.rs::ArchiveReason`)
|
||||
|
||||
The typed model already enumerates the reasons a story can leave the active flow
|
||||
(subsumes the legacy `blocked`, `merge_failure`, and `review_hold` front-matter
|
||||
fields per story 436):
|
||||
|
||||
- `Completed` — happy-path
|
||||
- `Abandoned` — user explicitly abandoned
|
||||
- `Superseded { by: StoryId }` — replaced by another story
|
||||
- `Blocked { reason: String }` — manually blocked, awaiting human resolution
|
||||
- `MergeFailed { reason: String }` — mergemaster gave up after retry budget
|
||||
- `ReviewHeld { reason: String }` — held for human review at user request
|
||||
|
||||
### Per-node execution state (`pipeline_state.rs::ExecutionState`)
|
||||
|
||||
Stage is shared/CRDT-replicated. Execution state is per-node and lives under
|
||||
each node's pubkey in the CRDT, so there are no inter-author merge conflicts:
|
||||
|
||||
- `Idle`
|
||||
- `Pending { agent, since }` — worktree being created, agent about to start
|
||||
- `Running { agent, started_at, last_heartbeat }`
|
||||
- `RateLimited { agent, resume_at }`
|
||||
- `Completed { agent, exit_code, completed_at }`
|
||||
|
||||
### Pipeline events (`pipeline_state.rs::PipelineEvent`)
|
||||
|
||||
The typed model defines every event that drives a Stage transition. Each variant
|
||||
carries the data needed to construct the destination state, so a transition
|
||||
function can never accidentally land in an underspecified state:
|
||||
|
||||
- `DepsMet` — dependencies met; promote from backlog
|
||||
- `GatesStarted` — coder starting gates
|
||||
- `GatesPassed { feature_branch, commits_ahead }`
|
||||
- `GatesFailed { reason }`
|
||||
- `QaSkipped { feature_branch, commits_ahead }` — qa-mode = "server"; skip QA, go to merge
|
||||
- `MergeSucceeded { merge_commit }`
|
||||
- `MergeFailedFinal { reason }`
|
||||
- `Accepted` — Done → Archived(Completed)
|
||||
|
||||
### Transitions (current production = MCP verb shape)
|
||||
|
||||
#### Backlog → Coding (a.k.a. backlog → 2_current)
|
||||
|
||||
- **Auto path**: `AgentPool::auto_assign_available_work` calls
|
||||
`promote_ready_backlog_stories`. A backlog story is promoted iff (a) it has
|
||||
an explicit non-empty `depends_on` AND (b) every dep is in `5_done` or
|
||||
`6_archived`. Stories with no `depends_on` are NOT auto-promoted — they wait
|
||||
for human scheduling.
|
||||
- Implemented in `server/src/agents/pool/auto_assign/auto_assign.rs::promote_ready_backlog_stories`.
|
||||
- **Manual path**: `mcp__huskies__move_story story_id=X target_stage=current`,
|
||||
or `mcp__huskies__start_agent` (which moves the story to current as a
|
||||
side-effect of starting an agent).
|
||||
- **Archived-dep warning**: if a dep was satisfied via `6_archived` rather than
|
||||
`5_done` (e.g. abandoned/superseded), the auto-assigner logs a prominent
|
||||
warning so the user can see the promotion was triggered by an archived dep.
|
||||
|
||||
#### Coding → Qa (current → 3_qa)
|
||||
|
||||
- Triggered when the coder agent finishes (gates start running).
|
||||
- `mcp__huskies__request_qa` is the manual verb.
|
||||
|
||||
#### Qa → Coding (qa → current — rejection path)
|
||||
|
||||
- `mcp__huskies__reject_qa story_id=X notes="..."` moves qa → current,
|
||||
**clears `review_hold`**, and writes the rejection notes
|
||||
(`agents/lifecycle.rs:210`).
|
||||
- Used when a qa agent fails or a human reviewer rejects the work.
|
||||
|
||||
#### Qa → Merge (qa → 4_merge)
|
||||
|
||||
- Triggered when QA gates pass. `mcp__huskies__move_story_to_merge` is the
|
||||
dedicated verb.
|
||||
- For server-mode QA: typed-side `PipelineEvent::QaSkipped` allows going from
|
||||
Coding → Merge directly without entering Qa.
|
||||
|
||||
#### Merge → Done (merge → 5_done)
|
||||
|
||||
- Mergemaster picks up a story in `4_merge/`, squashes the feature branch onto
|
||||
master, then transitions to `5_done`.
|
||||
- `mcp__huskies__move_story_to_merge` queues; mergemaster does the actual work.
|
||||
|
||||
#### Done → Archived(Completed) (5_done → 6_archived)
|
||||
|
||||
- Auto-sweep after four hours, OR
|
||||
- `mcp__huskies__accept_story` (immediate manual archive).
|
||||
|
||||
#### Any-stage → Archived(other reasons)
|
||||
|
||||
- **Abandoned / Superseded**: today done by `mcp__huskies__move_story
|
||||
target_stage=done` (no first-class verbs for these reasons; see (b) below).
|
||||
- **Blocked**: `blocked: true` flag in front matter is set on retry-limit
|
||||
exceedance. `mcp__huskies__unblock_story` clears the flag and resets
|
||||
retry_count.
|
||||
- **MergeFailed**: written to front matter when mergemaster fails; auto-assign
|
||||
skips these stories (`has_merge_failure` check).
|
||||
- **ReviewHeld**: `review_hold: true` flag is set automatically on spike
|
||||
completion; auto-assign skips these stories until the flag is cleared.
|
||||
|
||||
#### Tombstone / purge
|
||||
|
||||
- `mcp__huskies__delete_story` and `mcp__huskies__purge_story` permanently
|
||||
remove. Purge writes a CRDT tombstone.
|
||||
|
||||
### Auto-assign skip conditions (current production)
|
||||
|
||||
`auto_assign_available_work` walks `2_current/`, `3_qa/`, `4_merge/` in order
|
||||
and attempts to dispatch a free agent to each unassigned story. It **skips**
|
||||
any story that:
|
||||
|
||||
1. Has `review_hold: true` in front matter (spikes after QA, manual hold).
|
||||
2. Is `frozen` (`is_story_frozen` — pipeline advancement suspended for this story).
|
||||
3. Has `blocked: true` (retry limit exceeded; cleared via `unblock_story`).
|
||||
4. Has unmet `depends_on` dependencies.
|
||||
5. (Merge stage only) Has a recorded merge failure (`has_merge_failure`).
|
||||
6. (Merge stage only) Has an empty diff on the feature branch — auto-writes
|
||||
`merge_failure` and blocks immediately rather than wasting a mergemaster turn.
|
||||
|
||||
### Front-matter fields that gate transitions
|
||||
|
||||
| Field | Type | Effect |
|
||||
|---|---|---|
|
||||
| `depends_on` | list of story IDs | Blocks backlog → current promotion until all deps are in 5_done or 6_archived |
|
||||
| `agent` | string (e.g. `coder-opus`) | Pins the preferred agent for next assignment |
|
||||
| `review_hold` | bool | Auto-assign skips this story; cleared by `reject_qa` or manual unblock |
|
||||
| `blocked` | bool | Auto-assign skips this story; cleared by `unblock_story` |
|
||||
| `frozen` | bool | Auto-assign skips this story; manual unfreeze required |
|
||||
| `merge_failure` | string | Auto-assign skips merge-stage agents on this story |
|
||||
| `retry_count` | int | Local-only (not in CRDT); incremented by orchestrator |
|
||||
|
||||
### Spike-specific behavior
|
||||
|
||||
Per the typical lifecycle, a spike runs through `current → qa` like any work
|
||||
item, then **stops** in qa awaiting human review (`spikes skip merge`). This
|
||||
is implemented via `review_hold: true` being written automatically when a
|
||||
spike's qa gates pass. The user accepts (move qa → done) or rejects (move
|
||||
qa → current). Spikes do NOT auto-promote to merge.
|
||||
|
||||
### Mergemaster lifecycle
|
||||
|
||||
The mergemaster agent only runs against stories in `4_merge/`. It:
|
||||
|
||||
1. Verifies the feature branch has commits (or the story is auto-blocked).
|
||||
2. Squashes the feature branch onto master with a deterministic commit message.
|
||||
3. Transitions the story to `5_done` with `merged_at` and `merge_commit`.
|
||||
4. On failure beyond the retry budget, writes `merge_failure` and blocks the
|
||||
story (auto-assign then skips it).
|
||||
|
||||
### Watchdog (current production)
|
||||
|
||||
The "watchdog" at `server/src/agents/pool/auto_assign/watchdog.rs` runs every
|
||||
30 ticks of the unified background loop. Today it does **one** thing: detect
|
||||
orphaned agents whose tokio task is `is_finished()` but whose status is still
|
||||
`Running` or `Pending`, and mark them `Failed` with an `AgentEvent::Error`
|
||||
emission. Bug 624 (now merged) extends it to also enforce `max_turns` and
|
||||
`max_budget_usd` limits — an agent over either limit is killed via the
|
||||
existing `kill_child_for_key` path and recorded with a typed termination
|
||||
reason.
|
||||
|
||||
---
|
||||
|
||||
## (b) Transitions and behaviors that don't yet exist (or are only partially wired)
|
||||
|
||||
### Migration of consumers off legacy strings to typed `Stage` enum
|
||||
|
||||
The biggest outstanding piece. `pipeline_state.rs` is `#![allow(dead_code)]`.
|
||||
Every consumer (auto-assign, mergemaster, MCP tools, chat commands) still
|
||||
works with stage strings (`"2_current"`, `"4_merge"`) and front-matter flags.
|
||||
The projection layer (`TryFrom<PipelineItemView> for PipelineItem` and
|
||||
friends) exists but isn't called outside tests. Migration is intentionally
|
||||
incremental.
|
||||
|
||||
**Opportunity**: pick a leaf consumer (e.g. one MCP tool that reads the stage
|
||||
string) and migrate it to read `Stage` instead. Pattern repeats outward until
|
||||
all consumers go through the typed projection and the legacy stage-string
|
||||
code can be deleted.
|
||||
|
||||
### First-class verbs for archive reasons
|
||||
|
||||
`ArchiveReason` already has six variants but only `Completed` (via
|
||||
`accept_story`) and `Blocked` (via the `blocked: true` flag) have dedicated
|
||||
MCP verbs. Today, `Abandoned`, `Superseded`, `MergeFailed`, and `ReviewHeld`
|
||||
are reached either via `move_story target_stage=done` (which doesn't carry
|
||||
the reason) or via setting front-matter flags on the live story.
|
||||
|
||||
**Missing transitions**:
|
||||
|
||||
- `mcp__huskies__supersede_story story_id=X by=Y` — sets stage to
|
||||
`Archived { reason: Superseded { by: Y } }`. Today we use
|
||||
`move_story → done`, losing the `by` reference. (Came up 2026-04-25 with
|
||||
spike 621 → refactor 623.)
|
||||
- `mcp__huskies__abandon_story story_id=X reason="..."` — sets
|
||||
`Archived { reason: Abandoned }`. Today done via `move_story → done` or
|
||||
`purge_story`.
|
||||
- `mcp__huskies__hold_for_review story_id=X reason="..."` — explicitly puts
|
||||
a story in `Archived { reason: ReviewHeld }` rather than relying on the
|
||||
auto-set `review_hold` flag.
|
||||
|
||||
### Type-conversion transitions
|
||||
|
||||
Spike → story conversion is a real workflow (we do it when a spike's scope
|
||||
grows into an implementation story). Today, converting type via `update_story
|
||||
front_matter={"type": "story"}` does not bootstrap the
|
||||
`## Acceptance Criteria` section, and `add_criterion` then permanently fails
|
||||
on that story (see **bug 625** filed 2026-04-25). The `type` field passed via
|
||||
front_matter is also silently dropped — same silent-drop bug class as
|
||||
`acceptance_criteria`. The state machine should treat type conversion as a
|
||||
transition with side effects — at minimum, ensuring the AC section exists
|
||||
when transitioning to a type that requires it, and the displayed type
|
||||
reflects the new value (today the display chip is parsed from the immutable
|
||||
story_id prefix; story 578 in backlog will fix this by switching to
|
||||
numeric-only IDs).
|
||||
|
||||
### Limit-based agent termination (turn / budget)
|
||||
|
||||
Pre-624 master: `max_turns` and `max_budget_usd` per-agent config were read
|
||||
by the metric tool (`tool_get_agent_remaining_turns_and_budget`) but **not
|
||||
enforced** anywhere. Observed `coder-1` running 282/50 turns and $10.05/$5.00
|
||||
USD on story 623 before a human stopped it (bug 624, now merged).
|
||||
|
||||
The bug 624 fix adds enforcement to the watchdog. The state-machine impact:
|
||||
introduces a new agent-termination path distinct from "Failed (orphan)" —
|
||||
something like `Failed(LimitExceeded { kind: Turns | Budget })`. The
|
||||
`ExecutionState` enum may want a corresponding terminal variant so it can be
|
||||
distinguished from generic `Failed`.
|
||||
|
||||
### Pinned-agent honoring under contention
|
||||
|
||||
When a story has `agent: coder-opus` pinned but `coder-opus` is busy, today's
|
||||
auto-assign behavior is to leave the story unassigned — but if a human stops
|
||||
the running attempt and the story sits in `current/`, auto-assign **re-grabs
|
||||
it with the default coder** rather than waiting for the pinned agent.
|
||||
Observed multiple times on 2026-04-25 with story 623: pinning `coder-opus`
|
||||
did not prevent `coder-1` (sonnet) from being auto-assigned during opus's
|
||||
busy window.
|
||||
|
||||
**Missing behavior**: auto-assign should treat a pinned agent as a hard
|
||||
filter ("only this agent can take this story"), not a preference. Today the
|
||||
workaround is to also set `depends_on` on a phantom story, or move the story
|
||||
back to backlog and let the dependency system gate it.
|
||||
|
||||
### Honoring the `blocked` flag (bug 559)
|
||||
|
||||
`559_bug_mergemaster_ignores_blocked_flag_and_keeps_respawning_on_blocked_stories`
|
||||
is in backlog. Even though `blocked: true` is documented as a skip condition
|
||||
in `auto_assign_available_work`, mergemaster's spawn path apparently checks
|
||||
something different (or earlier) and respawns on blocked merge-stage stories.
|
||||
The state machine should make `Stage::Archived { reason: Blocked }` a single
|
||||
authoritative source so no consumer can incidentally bypass it.
|
||||
|
||||
### Formal "ghost story recovery" transition
|
||||
|
||||
The `move_story` MCP tool description mentions "recovering a ghost story by
|
||||
moving it back to current" as a valid use. Ghost stories are CRDT entries
|
||||
with no corresponding filesystem stage directory (or the inverse). Today this
|
||||
is an `update_story + move_story` ad-hoc dance. A first-class
|
||||
`recover_ghost_story` verb that reconciles the CRDT and filesystem would
|
||||
formalize the recovery path.
|
||||
|
||||
### Operator-level visibility / observability
|
||||
|
||||
There is no UI, CLI, or doc that shows "the state machine as a diagram." The
|
||||
typed enums are the closest thing to a canonical specification, but they
|
||||
aren't rendered anywhere a human can see at a glance: which stages exist,
|
||||
which transitions are valid, which events trigger them. A generated state
|
||||
diagram (graphviz or mermaid, dumped into this doc on each release) would
|
||||
help both new contributors and operators triaging stuck pipelines.
|
||||
|
||||
### Review-hold cleanup verb
|
||||
|
||||
`review_hold: true` is set automatically on spike completion. Clearing it is
|
||||
done as a side effect of `reject_qa` (which also moves the story qa →
|
||||
current) or by manually editing front matter. There is no clean "I have
|
||||
reviewed this, release the hold" verb that doesn't also move the story.
|
||||
|
||||
### Cross-node concurrency for execution state
|
||||
|
||||
`ExecutionState` is per-node (keyed by pubkey) so two nodes can't fight over
|
||||
who's running an agent. But there is no formal transition that says "node A
|
||||
hands the story to node B" if node A goes offline. The state machine's
|
||||
distributed semantics for this case are not yet specified.
|
||||
|
||||
---
|
||||
|
||||
## How to update this document
|
||||
|
||||
Whenever you discover a transition that doesn't yet exist, or a flag that
|
||||
behaves surprisingly, add it to **section (b)** with:
|
||||
|
||||
- A short description of the desired behavior
|
||||
- Citation of the work item or incident that surfaced it
|
||||
- Pointer to the place in `pipeline_state.rs` where it should be modeled (or
|
||||
note "needs a new variant" if it doesn't fit any existing enum yet)
|
||||
|
||||
When a transition from (b) ships, move it to (a) with the relevant file:line
|
||||
citations.
|
||||
Reference in New Issue
Block a user