2026-04-25 13:33:57 +00:00
|
|
|
# Pipeline State Machine
|
|
|
|
|
|
|
|
|
|
This document describes the huskies pipeline state machine in two halves:
|
|
|
|
|
**(a)** the model that runs in production today, and **(b)** transitions, refinements,
|
|
|
|
|
and corrections we have identified as needed but not yet implemented.
|
|
|
|
|
|
|
|
|
|
The codebase is in a deliberate transitional state: a typed CRDT state machine
|
|
|
|
|
exists at `server/src/pipeline_state.rs` (introduced by story 520) with strict Rust
|
|
|
|
|
enums for every stage, archive reason, execution state, and event. It is fully
|
|
|
|
|
defined and tested but **not yet called from non-test code** (`#![allow(dead_code)]`
|
|
|
|
|
at the top of the module). Consumers will migrate incrementally.
|
|
|
|
|
|
|
|
|
|
The model that is actually doing work is the older **filesystem-stage-string +
|
|
|
|
|
front-matter-flag** model. Section (a) below documents both representations and
|
|
|
|
|
the migration intent.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## (a) The current state machine
|
|
|
|
|
|
|
|
|
|
### Stages (production: filesystem string; future: typed enum)
|
|
|
|
|
|
|
|
|
|
| Filesystem (production) | Typed (future) | Meaning |
|
|
|
|
|
|---|---|---|
|
|
|
|
|
| `work/1_backlog/` | `Stage::Backlog` | Story exists, waiting for dependencies or auto-assign promotion |
|
|
|
|
|
| `work/2_current/` | `Stage::Coding` | Coder agent is running (or about to) |
|
|
|
|
|
| `work/3_qa/` | `Stage::Qa` | Coder finished; gates / human review running |
|
|
|
|
|
| `work/4_merge/` | `Stage::Merge { feature_branch, commits_ahead: NonZeroU32 }` | Gates passed, mergemaster ready to squash |
|
|
|
|
|
| `work/5_done/` | `Stage::Done { merged_at, merge_commit }` | Mergemaster squashed to master |
|
|
|
|
|
| `work/6_archived/` | `Stage::Archived { archived_at, reason: ArchiveReason }` | Out of the active flow |
|
|
|
|
|
|
|
|
|
|
`5_done` auto-sweeps to `6_archived` after four hours. The typed `Stage::Done`
|
|
|
|
|
variant always carries the merge SHA and timestamp; `Stage::Merge`'s
|
|
|
|
|
`commits_ahead: NonZeroU32` makes "Merge with nothing to merge" structurally
|
|
|
|
|
impossible (eliminates bug 519).
|
|
|
|
|
|
|
|
|
|
### Archive reasons (`pipeline_state.rs::ArchiveReason`)
|
|
|
|
|
|
|
|
|
|
The typed model already enumerates the reasons a story can leave the active flow
|
|
|
|
|
(subsumes the legacy `blocked`, `merge_failure`, and `review_hold` front-matter
|
|
|
|
|
fields per story 436):
|
|
|
|
|
|
|
|
|
|
- `Completed` — happy-path
|
|
|
|
|
- `Abandoned` — user explicitly abandoned
|
|
|
|
|
- `Superseded { by: StoryId }` — replaced by another story
|
|
|
|
|
- `Blocked { reason: String }` — manually blocked, awaiting human resolution
|
|
|
|
|
- `MergeFailed { reason: String }` — mergemaster gave up after retry budget
|
|
|
|
|
- `ReviewHeld { reason: String }` — held for human review at user request
|
|
|
|
|
|
|
|
|
|
### Per-node execution state (`pipeline_state.rs::ExecutionState`)
|
|
|
|
|
|
|
|
|
|
Stage is shared/CRDT-replicated. Execution state is per-node and lives under
|
|
|
|
|
each node's pubkey in the CRDT, so there are no inter-author merge conflicts:
|
|
|
|
|
|
|
|
|
|
- `Idle`
|
|
|
|
|
- `Pending { agent, since }` — worktree being created, agent about to start
|
|
|
|
|
- `Running { agent, started_at, last_heartbeat }`
|
|
|
|
|
- `RateLimited { agent, resume_at }`
|
|
|
|
|
- `Completed { agent, exit_code, completed_at }`
|
|
|
|
|
|
|
|
|
|
### Pipeline events (`pipeline_state.rs::PipelineEvent`)
|
|
|
|
|
|
|
|
|
|
The typed model defines every event that drives a Stage transition. Each variant
|
|
|
|
|
carries the data needed to construct the destination state, so a transition
|
|
|
|
|
function can never accidentally land in an underspecified state:
|
|
|
|
|
|
|
|
|
|
- `DepsMet` — dependencies met; promote from backlog
|
|
|
|
|
- `GatesStarted` — coder starting gates
|
|
|
|
|
- `GatesPassed { feature_branch, commits_ahead }`
|
|
|
|
|
- `GatesFailed { reason }`
|
|
|
|
|
- `QaSkipped { feature_branch, commits_ahead }` — qa-mode = "server"; skip QA, go to merge
|
|
|
|
|
- `MergeSucceeded { merge_commit }`
|
|
|
|
|
- `MergeFailedFinal { reason }`
|
|
|
|
|
- `Accepted` — Done → Archived(Completed)
|
|
|
|
|
|
|
|
|
|
### Transitions (current production = MCP verb shape)
|
|
|
|
|
|
|
|
|
|
#### Backlog → Coding (a.k.a. backlog → 2_current)
|
|
|
|
|
|
|
|
|
|
- **Auto path**: `AgentPool::auto_assign_available_work` calls
|
|
|
|
|
`promote_ready_backlog_stories`. A backlog story is promoted iff (a) it has
|
|
|
|
|
an explicit non-empty `depends_on` AND (b) every dep is in `5_done` or
|
|
|
|
|
`6_archived`. Stories with no `depends_on` are NOT auto-promoted — they wait
|
|
|
|
|
for human scheduling.
|
|
|
|
|
- Implemented in `server/src/agents/pool/auto_assign/auto_assign.rs::promote_ready_backlog_stories`.
|
|
|
|
|
- **Manual path**: `mcp__huskies__move_story story_id=X target_stage=current`,
|
|
|
|
|
or `mcp__huskies__start_agent` (which moves the story to current as a
|
|
|
|
|
side-effect of starting an agent).
|
|
|
|
|
- **Archived-dep warning**: if a dep was satisfied via `6_archived` rather than
|
|
|
|
|
`5_done` (e.g. abandoned/superseded), the auto-assigner logs a prominent
|
|
|
|
|
warning so the user can see the promotion was triggered by an archived dep.
|
|
|
|
|
|
|
|
|
|
#### Coding → Qa (current → 3_qa)
|
|
|
|
|
|
|
|
|
|
- Triggered when the coder agent finishes (gates start running).
|
|
|
|
|
- `mcp__huskies__request_qa` is the manual verb.
|
|
|
|
|
|
|
|
|
|
#### Qa → Coding (qa → current — rejection path)
|
|
|
|
|
|
|
|
|
|
- `mcp__huskies__reject_qa story_id=X notes="..."` moves qa → current,
|
|
|
|
|
**clears `review_hold`**, and writes the rejection notes
|
|
|
|
|
(`agents/lifecycle.rs:210`).
|
|
|
|
|
- Used when a qa agent fails or a human reviewer rejects the work.
|
|
|
|
|
|
|
|
|
|
#### Qa → Merge (qa → 4_merge)
|
|
|
|
|
|
|
|
|
|
- Triggered when QA gates pass. `mcp__huskies__move_story_to_merge` is the
|
|
|
|
|
dedicated verb.
|
|
|
|
|
- For server-mode QA: typed-side `PipelineEvent::QaSkipped` allows going from
|
|
|
|
|
Coding → Merge directly without entering Qa.
|
|
|
|
|
|
|
|
|
|
#### Merge → Done (merge → 5_done)
|
|
|
|
|
|
|
|
|
|
- Mergemaster picks up a story in `4_merge/`, squashes the feature branch onto
|
|
|
|
|
master, then transitions to `5_done`.
|
|
|
|
|
- `mcp__huskies__move_story_to_merge` queues; mergemaster does the actual work.
|
|
|
|
|
|
|
|
|
|
#### Done → Archived(Completed) (5_done → 6_archived)
|
|
|
|
|
|
|
|
|
|
- Auto-sweep after four hours, OR
|
|
|
|
|
- `mcp__huskies__accept_story` (immediate manual archive).
|
|
|
|
|
|
|
|
|
|
#### Any-stage → Archived(other reasons)
|
|
|
|
|
|
|
|
|
|
- **Abandoned / Superseded**: today done by `mcp__huskies__move_story
|
|
|
|
|
target_stage=done` (no first-class verbs for these reasons; see (b) below).
|
|
|
|
|
- **Blocked**: `blocked: true` flag in front matter is set on retry-limit
|
|
|
|
|
exceedance. `mcp__huskies__unblock_story` clears the flag and resets
|
|
|
|
|
retry_count.
|
|
|
|
|
- **MergeFailed**: written to front matter when mergemaster fails; auto-assign
|
|
|
|
|
skips these stories (`has_merge_failure` check).
|
|
|
|
|
- **ReviewHeld**: `review_hold: true` flag is set automatically on spike
|
|
|
|
|
completion; auto-assign skips these stories until the flag is cleared.
|
|
|
|
|
|
|
|
|
|
#### Tombstone / purge
|
|
|
|
|
|
|
|
|
|
- `mcp__huskies__delete_story` and `mcp__huskies__purge_story` permanently
|
|
|
|
|
remove. Purge writes a CRDT tombstone.
|
|
|
|
|
|
|
|
|
|
### Auto-assign skip conditions (current production)
|
|
|
|
|
|
|
|
|
|
`auto_assign_available_work` walks `2_current/`, `3_qa/`, `4_merge/` in order
|
|
|
|
|
and attempts to dispatch a free agent to each unassigned story. It **skips**
|
|
|
|
|
any story that:
|
|
|
|
|
|
|
|
|
|
1. Has `review_hold: true` in front matter (spikes after QA, manual hold).
|
|
|
|
|
2. Is `frozen` (`is_story_frozen` — pipeline advancement suspended for this story).
|
|
|
|
|
3. Has `blocked: true` (retry limit exceeded; cleared via `unblock_story`).
|
|
|
|
|
4. Has unmet `depends_on` dependencies.
|
|
|
|
|
5. (Merge stage only) Has a recorded merge failure (`has_merge_failure`).
|
|
|
|
|
6. (Merge stage only) Has an empty diff on the feature branch — auto-writes
|
|
|
|
|
`merge_failure` and blocks immediately rather than wasting a mergemaster turn.
|
|
|
|
|
|
|
|
|
|
### Front-matter fields that gate transitions
|
|
|
|
|
|
|
|
|
|
| Field | Type | Effect |
|
|
|
|
|
|---|---|---|
|
|
|
|
|
| `depends_on` | list of story IDs | Blocks backlog → current promotion until all deps are in 5_done or 6_archived |
|
|
|
|
|
| `agent` | string (e.g. `coder-opus`) | Pins the preferred agent for next assignment |
|
|
|
|
|
| `review_hold` | bool | Auto-assign skips this story; cleared by `reject_qa` or manual unblock |
|
|
|
|
|
| `blocked` | bool | Auto-assign skips this story; cleared by `unblock_story` |
|
|
|
|
|
| `frozen` | bool | Auto-assign skips this story; manual unfreeze required |
|
|
|
|
|
| `merge_failure` | string | Auto-assign skips merge-stage agents on this story |
|
|
|
|
|
| `retry_count` | int | Local-only (not in CRDT); incremented by orchestrator |
|
|
|
|
|
|
|
|
|
|
### Spike-specific behavior
|
|
|
|
|
|
|
|
|
|
Per the typical lifecycle, a spike runs through `current → qa` like any work
|
|
|
|
|
item, then **stops** in qa awaiting human review (`spikes skip merge`). This
|
|
|
|
|
is implemented via `review_hold: true` being written automatically when a
|
|
|
|
|
spike's qa gates pass. The user accepts (move qa → done) or rejects (move
|
|
|
|
|
qa → current). Spikes do NOT auto-promote to merge.
|
|
|
|
|
|
|
|
|
|
### Mergemaster lifecycle
|
|
|
|
|
|
|
|
|
|
The mergemaster agent only runs against stories in `4_merge/`. It:
|
|
|
|
|
|
|
|
|
|
1. Verifies the feature branch has commits (or the story is auto-blocked).
|
|
|
|
|
2. Squashes the feature branch onto master with a deterministic commit message.
|
|
|
|
|
3. Transitions the story to `5_done` with `merged_at` and `merge_commit`.
|
|
|
|
|
4. On failure beyond the retry budget, writes `merge_failure` and blocks the
|
|
|
|
|
story (auto-assign then skips it).
|
|
|
|
|
|
2026-04-26 10:50:40 +00:00
|
|
|
### Agent terminated with committed work (bug 645 recovery path)
|
|
|
|
|
|
|
|
|
|
When a coder agent terminates abnormally (e.g. the Claude Code CLI's
|
|
|
|
|
`output.write(&bytes).is_ok()` PTY write assertion fires mid-session), the
|
|
|
|
|
server-owned completion path detects the crash and checks for surviving work:
|
|
|
|
|
|
|
|
|
|
1. If the worktree is dirty but has commits ahead of master, reset the
|
|
|
|
|
uncommitted files (`git checkout . && git clean -fd`) and run gates
|
|
|
|
|
against the committed code.
|
|
|
|
|
2. If gates still fail but `git log master..HEAD` shows commits and
|
|
|
|
|
`cargo check` passes, **advance to QA** instead of entering the
|
|
|
|
|
retry/block path. This is the "work survived" check, implemented in
|
|
|
|
|
`server/src/agents/pool/pipeline/advance.rs`.
|
|
|
|
|
3. Agents that die WITHOUT committed work (no commits ahead of master)
|
|
|
|
|
still follow the existing retry → block path unchanged.
|
|
|
|
|
|
|
|
|
|
This prevents false-positive blocking of stories where the agent completed
|
|
|
|
|
meaningful work before crashing.
|
|
|
|
|
|
2026-04-25 13:33:57 +00:00
|
|
|
### Watchdog (current production)
|
|
|
|
|
|
|
|
|
|
The "watchdog" at `server/src/agents/pool/auto_assign/watchdog.rs` runs every
|
|
|
|
|
30 ticks of the unified background loop. Today it does **one** thing: detect
|
|
|
|
|
orphaned agents whose tokio task is `is_finished()` but whose status is still
|
|
|
|
|
`Running` or `Pending`, and mark them `Failed` with an `AgentEvent::Error`
|
|
|
|
|
emission. Bug 624 (now merged) extends it to also enforce `max_turns` and
|
|
|
|
|
`max_budget_usd` limits — an agent over either limit is killed via the
|
|
|
|
|
existing `kill_child_for_key` path and recorded with a typed termination
|
|
|
|
|
reason.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## (b) Transitions and behaviors that don't yet exist (or are only partially wired)
|
|
|
|
|
|
|
|
|
|
### Migration of consumers off legacy strings to typed `Stage` enum
|
|
|
|
|
|
|
|
|
|
The biggest outstanding piece. `pipeline_state.rs` is `#![allow(dead_code)]`.
|
|
|
|
|
Every consumer (auto-assign, mergemaster, MCP tools, chat commands) still
|
|
|
|
|
works with stage strings (`"2_current"`, `"4_merge"`) and front-matter flags.
|
|
|
|
|
The projection layer (`TryFrom<PipelineItemView> for PipelineItem` and
|
|
|
|
|
friends) exists but isn't called outside tests. Migration is intentionally
|
|
|
|
|
incremental.
|
|
|
|
|
|
|
|
|
|
**Opportunity**: pick a leaf consumer (e.g. one MCP tool that reads the stage
|
|
|
|
|
string) and migrate it to read `Stage` instead. Pattern repeats outward until
|
|
|
|
|
all consumers go through the typed projection and the legacy stage-string
|
|
|
|
|
code can be deleted.
|
|
|
|
|
|
|
|
|
|
### First-class verbs for archive reasons
|
|
|
|
|
|
|
|
|
|
`ArchiveReason` already has six variants but only `Completed` (via
|
|
|
|
|
`accept_story`) and `Blocked` (via the `blocked: true` flag) have dedicated
|
|
|
|
|
MCP verbs. Today, `Abandoned`, `Superseded`, `MergeFailed`, and `ReviewHeld`
|
|
|
|
|
are reached either via `move_story target_stage=done` (which doesn't carry
|
|
|
|
|
the reason) or via setting front-matter flags on the live story.
|
|
|
|
|
|
|
|
|
|
**Missing transitions**:
|
|
|
|
|
|
|
|
|
|
- `mcp__huskies__supersede_story story_id=X by=Y` — sets stage to
|
|
|
|
|
`Archived { reason: Superseded { by: Y } }`. Today we use
|
|
|
|
|
`move_story → done`, losing the `by` reference. (Came up 2026-04-25 with
|
|
|
|
|
spike 621 → refactor 623.)
|
|
|
|
|
- `mcp__huskies__abandon_story story_id=X reason="..."` — sets
|
|
|
|
|
`Archived { reason: Abandoned }`. Today done via `move_story → done` or
|
|
|
|
|
`purge_story`.
|
|
|
|
|
- `mcp__huskies__hold_for_review story_id=X reason="..."` — explicitly puts
|
|
|
|
|
a story in `Archived { reason: ReviewHeld }` rather than relying on the
|
|
|
|
|
auto-set `review_hold` flag.
|
|
|
|
|
|
|
|
|
|
### Type-conversion transitions
|
|
|
|
|
|
|
|
|
|
Spike → story conversion is a real workflow (we do it when a spike's scope
|
|
|
|
|
grows into an implementation story). Today, converting type via `update_story
|
|
|
|
|
front_matter={"type": "story"}` does not bootstrap the
|
|
|
|
|
`## Acceptance Criteria` section, and `add_criterion` then permanently fails
|
|
|
|
|
on that story (see **bug 625** filed 2026-04-25). The `type` field passed via
|
|
|
|
|
front_matter is also silently dropped — same silent-drop bug class as
|
|
|
|
|
`acceptance_criteria`. The state machine should treat type conversion as a
|
|
|
|
|
transition with side effects — at minimum, ensuring the AC section exists
|
|
|
|
|
when transitioning to a type that requires it, and the displayed type
|
|
|
|
|
reflects the new value (today the display chip is parsed from the immutable
|
|
|
|
|
story_id prefix; story 578 in backlog will fix this by switching to
|
|
|
|
|
numeric-only IDs).
|
|
|
|
|
|
|
|
|
|
### Limit-based agent termination (turn / budget)
|
|
|
|
|
|
|
|
|
|
Pre-624 master: `max_turns` and `max_budget_usd` per-agent config were read
|
|
|
|
|
by the metric tool (`tool_get_agent_remaining_turns_and_budget`) but **not
|
|
|
|
|
enforced** anywhere. Observed `coder-1` running 282/50 turns and $10.05/$5.00
|
|
|
|
|
USD on story 623 before a human stopped it (bug 624, now merged).
|
|
|
|
|
|
|
|
|
|
The bug 624 fix adds enforcement to the watchdog. The state-machine impact:
|
|
|
|
|
introduces a new agent-termination path distinct from "Failed (orphan)" —
|
|
|
|
|
something like `Failed(LimitExceeded { kind: Turns | Budget })`. The
|
|
|
|
|
`ExecutionState` enum may want a corresponding terminal variant so it can be
|
|
|
|
|
distinguished from generic `Failed`.
|
|
|
|
|
|
|
|
|
|
### Pinned-agent honoring under contention
|
|
|
|
|
|
|
|
|
|
When a story has `agent: coder-opus` pinned but `coder-opus` is busy, today's
|
|
|
|
|
auto-assign behavior is to leave the story unassigned — but if a human stops
|
|
|
|
|
the running attempt and the story sits in `current/`, auto-assign **re-grabs
|
|
|
|
|
it with the default coder** rather than waiting for the pinned agent.
|
|
|
|
|
Observed multiple times on 2026-04-25 with story 623: pinning `coder-opus`
|
|
|
|
|
did not prevent `coder-1` (sonnet) from being auto-assigned during opus's
|
|
|
|
|
busy window.
|
|
|
|
|
|
|
|
|
|
**Missing behavior**: auto-assign should treat a pinned agent as a hard
|
|
|
|
|
filter ("only this agent can take this story"), not a preference. Today the
|
|
|
|
|
workaround is to also set `depends_on` on a phantom story, or move the story
|
|
|
|
|
back to backlog and let the dependency system gate it.
|
|
|
|
|
|
|
|
|
|
### Honoring the `blocked` flag (bug 559)
|
|
|
|
|
|
|
|
|
|
`559_bug_mergemaster_ignores_blocked_flag_and_keeps_respawning_on_blocked_stories`
|
|
|
|
|
is in backlog. Even though `blocked: true` is documented as a skip condition
|
|
|
|
|
in `auto_assign_available_work`, mergemaster's spawn path apparently checks
|
|
|
|
|
something different (or earlier) and respawns on blocked merge-stage stories.
|
|
|
|
|
The state machine should make `Stage::Archived { reason: Blocked }` a single
|
|
|
|
|
authoritative source so no consumer can incidentally bypass it.
|
|
|
|
|
|
|
|
|
|
### Formal "ghost story recovery" transition
|
|
|
|
|
|
|
|
|
|
The `move_story` MCP tool description mentions "recovering a ghost story by
|
|
|
|
|
moving it back to current" as a valid use. Ghost stories are CRDT entries
|
|
|
|
|
with no corresponding filesystem stage directory (or the inverse). Today this
|
|
|
|
|
is an `update_story + move_story` ad-hoc dance. A first-class
|
|
|
|
|
`recover_ghost_story` verb that reconciles the CRDT and filesystem would
|
|
|
|
|
formalize the recovery path.
|
|
|
|
|
|
|
|
|
|
### Operator-level visibility / observability
|
|
|
|
|
|
|
|
|
|
There is no UI, CLI, or doc that shows "the state machine as a diagram." The
|
|
|
|
|
typed enums are the closest thing to a canonical specification, but they
|
|
|
|
|
aren't rendered anywhere a human can see at a glance: which stages exist,
|
|
|
|
|
which transitions are valid, which events trigger them. A generated state
|
|
|
|
|
diagram (graphviz or mermaid, dumped into this doc on each release) would
|
|
|
|
|
help both new contributors and operators triaging stuck pipelines.
|
|
|
|
|
|
|
|
|
|
### Review-hold cleanup verb
|
|
|
|
|
|
|
|
|
|
`review_hold: true` is set automatically on spike completion. Clearing it is
|
|
|
|
|
done as a side effect of `reject_qa` (which also moves the story qa →
|
|
|
|
|
current) or by manually editing front matter. There is no clean "I have
|
|
|
|
|
reviewed this, release the hold" verb that doesn't also move the story.
|
|
|
|
|
|
|
|
|
|
### Cross-node concurrency for execution state
|
|
|
|
|
|
|
|
|
|
`ExecutionState` is per-node (keyed by pubkey) so two nodes can't fight over
|
|
|
|
|
who's running an agent. But there is no formal transition that says "node A
|
|
|
|
|
hands the story to node B" if node A goes offline. The state machine's
|
|
|
|
|
distributed semantics for this case are not yet specified.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## How to update this document
|
|
|
|
|
|
|
|
|
|
Whenever you discover a transition that doesn't yet exist, or a flag that
|
|
|
|
|
behaves surprisingly, add it to **section (b)** with:
|
|
|
|
|
|
|
|
|
|
- A short description of the desired behavior
|
|
|
|
|
- Citation of the work item or incident that surfaced it
|
|
|
|
|
- Pointer to the place in `pipeline_state.rs` where it should be modeled (or
|
|
|
|
|
note "needs a new variant" if it doesn't fit any existing enum yet)
|
|
|
|
|
|
|
|
|
|
When a transition from (b) ships, move it to (a) with the relevant file:line
|
|
|
|
|
citations.
|