docs: add pipeline state machine reference (current + planned transitions)

Captures the dual representation we have today (legacy filesystem stage strings + front-matter flags vs the typed Stage/ArchiveReason/ExecutionState enums in pipeline_state.rs that are defined-but-not-wired) and itemises the transitions and behaviours we have identified as missing or partially implemented (first-class supersede/abandon/hold verbs, type-conversion side effects, pinned-agent honouring under contention, blocked-flag enforcement beyond auto-assign, ghost-story recovery, etc.). Section (b) is intended as a living dumping ground — append new transitions and incidents as they come up so that the state-machine roadmap (spike 613 in backlog) has a ready-made input.
2026-04-25 13:33:57 +00:00
parent e20083a283
commit 2097787e1f
1 changed files with 331 additions and 0 deletions
@@ -0,0 +1,331 @@
+# Pipeline State Machine
+
+This document describes the huskies pipeline state machine in two halves:
+**(a)** the model that runs in production today, and **(b)** transitions, refinements,
+and corrections we have identified as needed but not yet implemented.
+
+The codebase is in a deliberate transitional state: a typed CRDT state machine
+exists at `server/src/pipeline_state.rs` (introduced by story 520) with strict Rust
+enums for every stage, archive reason, execution state, and event. It is fully
+defined and tested but **not yet called from non-test code** (`#![allow(dead_code)]`
+at the top of the module). Consumers will migrate incrementally.
+
+The model that is actually doing work is the older **filesystem-stage-string +
+front-matter-flag** model. Section (a) below documents both representations and
+the migration intent.
+
+---
+
+## (a) The current state machine
+
+### Stages (production: filesystem string; future: typed enum)
+
+| Filesystem (production) | Typed (future) | Meaning |
+|---|---|---|
+| `work/1_backlog/` | `Stage::Backlog` | Story exists, waiting for dependencies or auto-assign promotion |
+| `work/2_current/` | `Stage::Coding` | Coder agent is running (or about to) |
+| `work/3_qa/` | `Stage::Qa` | Coder finished; gates / human review running |
+| `work/4_merge/` | `Stage::Merge { feature_branch, commits_ahead: NonZeroU32 }` | Gates passed, mergemaster ready to squash |
+| `work/5_done/` | `Stage::Done { merged_at, merge_commit }` | Mergemaster squashed to master |
+| `work/6_archived/` | `Stage::Archived { archived_at, reason: ArchiveReason }` | Out of the active flow |
+
+`5_done` auto-sweeps to `6_archived` after four hours. The typed `Stage::Done`
+variant always carries the merge SHA and timestamp; `Stage::Merge`'s
+`commits_ahead: NonZeroU32` makes "Merge with nothing to merge" structurally
+impossible (eliminates bug 519).
+
+### Archive reasons (`pipeline_state.rs::ArchiveReason`)
+
+The typed model already enumerates the reasons a story can leave the active flow
+(subsumes the legacy `blocked`, `merge_failure`, and `review_hold` front-matter
+fields per story 436):
+
+- `Completed` — happy-path
+- `Abandoned` — user explicitly abandoned
+- `Superseded { by: StoryId }` — replaced by another story
+- `Blocked { reason: String }` — manually blocked, awaiting human resolution
+- `MergeFailed { reason: String }` — mergemaster gave up after retry budget
+- `ReviewHeld { reason: String }` — held for human review at user request
+
+### Per-node execution state (`pipeline_state.rs::ExecutionState`)
+
+Stage is shared/CRDT-replicated. Execution state is per-node and lives under
+each node's pubkey in the CRDT, so there are no inter-author merge conflicts:
+
+- `Idle`
+- `Pending { agent, since }` — worktree being created, agent about to start
+- `Running { agent, started_at, last_heartbeat }`
+- `RateLimited { agent, resume_at }`
+- `Completed { agent, exit_code, completed_at }`
+
+### Pipeline events (`pipeline_state.rs::PipelineEvent`)
+
+The typed model defines every event that drives a Stage transition. Each variant
+carries the data needed to construct the destination state, so a transition
+function can never accidentally land in an underspecified state:
+
+- `DepsMet` — dependencies met; promote from backlog
+- `GatesStarted` — coder starting gates
+- `GatesPassed { feature_branch, commits_ahead }`
+- `GatesFailed { reason }`
+- `QaSkipped { feature_branch, commits_ahead }` — qa-mode = "server"; skip QA, go to merge
+- `MergeSucceeded { merge_commit }`
+- `MergeFailedFinal { reason }`
+- `Accepted` — Done → Archived(Completed)
+
+### Transitions (current production = MCP verb shape)
+
+#### Backlog → Coding (a.k.a. backlog → 2_current)
+
+- **Auto path**: `AgentPool::auto_assign_available_work` calls
+  `promote_ready_backlog_stories`. A backlog story is promoted iff (a) it has
+  an explicit non-empty `depends_on` AND (b) every dep is in `5_done` or
+  `6_archived`. Stories with no `depends_on` are NOT auto-promoted — they wait
+  for human scheduling.
+  - Implemented in `server/src/agents/pool/auto_assign/auto_assign.rs::promote_ready_backlog_stories`.
+- **Manual path**: `mcp__huskies__move_story story_id=X target_stage=current`,
+  or `mcp__huskies__start_agent` (which moves the story to current as a
+  side-effect of starting an agent).
+- **Archived-dep warning**: if a dep was satisfied via `6_archived` rather than
+  `5_done` (e.g. abandoned/superseded), the auto-assigner logs a prominent
+  warning so the user can see the promotion was triggered by an archived dep.
+
+#### Coding → Qa (current → 3_qa)
+
+- Triggered when the coder agent finishes (gates start running).
+- `mcp__huskies__request_qa` is the manual verb.
+
+#### Qa → Coding (qa → current — rejection path)
+
+- `mcp__huskies__reject_qa story_id=X notes="..."` moves qa → current,
+  **clears `review_hold`**, and writes the rejection notes
+  (`agents/lifecycle.rs:210`).
+- Used when a qa agent fails or a human reviewer rejects the work.
+
+#### Qa → Merge (qa → 4_merge)
+
+- Triggered when QA gates pass. `mcp__huskies__move_story_to_merge` is the
+  dedicated verb.
+- For server-mode QA: typed-side `PipelineEvent::QaSkipped` allows going from
+  Coding → Merge directly without entering Qa.
+
+#### Merge → Done (merge → 5_done)
+
+- Mergemaster picks up a story in `4_merge/`, squashes the feature branch onto
+  master, then transitions to `5_done`.
+- `mcp__huskies__move_story_to_merge` queues; mergemaster does the actual work.
+
+#### Done → Archived(Completed) (5_done → 6_archived)
+
+- Auto-sweep after four hours, OR
+- `mcp__huskies__accept_story` (immediate manual archive).
+
+#### Any-stage → Archived(other reasons)
+
+- **Abandoned / Superseded**: today done by `mcp__huskies__move_story
+  target_stage=done` (no first-class verbs for these reasons; see (b) below).
+- **Blocked**: `blocked: true` flag in front matter is set on retry-limit
+  exceedance. `mcp__huskies__unblock_story` clears the flag and resets
+  retry_count.
+- **MergeFailed**: written to front matter when mergemaster fails; auto-assign
+  skips these stories (`has_merge_failure` check).
+- **ReviewHeld**: `review_hold: true` flag is set automatically on spike
+  completion; auto-assign skips these stories until the flag is cleared.
+
+#### Tombstone / purge
+
+- `mcp__huskies__delete_story` and `mcp__huskies__purge_story` permanently
+  remove. Purge writes a CRDT tombstone.
+
+### Auto-assign skip conditions (current production)
+
+`auto_assign_available_work` walks `2_current/`, `3_qa/`, `4_merge/` in order
+and attempts to dispatch a free agent to each unassigned story. It **skips**
+any story that:
+
+1. Has `review_hold: true` in front matter (spikes after QA, manual hold).
+2. Is `frozen` (`is_story_frozen` — pipeline advancement suspended for this story).
+3. Has `blocked: true` (retry limit exceeded; cleared via `unblock_story`).
+4. Has unmet `depends_on` dependencies.
+5. (Merge stage only) Has a recorded merge failure (`has_merge_failure`).
+6. (Merge stage only) Has an empty diff on the feature branch — auto-writes
+   `merge_failure` and blocks immediately rather than wasting a mergemaster turn.
+
+### Front-matter fields that gate transitions
+
+| Field | Type | Effect |
+|---|---|---|
+| `depends_on` | list of story IDs | Blocks backlog → current promotion until all deps are in 5_done or 6_archived |
+| `agent` | string (e.g. `coder-opus`) | Pins the preferred agent for next assignment |
+| `review_hold` | bool | Auto-assign skips this story; cleared by `reject_qa` or manual unblock |
+| `blocked` | bool | Auto-assign skips this story; cleared by `unblock_story` |
+| `frozen` | bool | Auto-assign skips this story; manual unfreeze required |
+| `merge_failure` | string | Auto-assign skips merge-stage agents on this story |
+| `retry_count` | int | Local-only (not in CRDT); incremented by orchestrator |
+
+### Spike-specific behavior
+
+Per the typical lifecycle, a spike runs through `current → qa` like any work
+item, then **stops** in qa awaiting human review (`spikes skip merge`). This
+is implemented via `review_hold: true` being written automatically when a
+spike's qa gates pass. The user accepts (move qa → done) or rejects (move
+qa → current). Spikes do NOT auto-promote to merge.
+
+### Mergemaster lifecycle
+
+The mergemaster agent only runs against stories in `4_merge/`. It:
+
+1. Verifies the feature branch has commits (or the story is auto-blocked).
+2. Squashes the feature branch onto master with a deterministic commit message.
+3. Transitions the story to `5_done` with `merged_at` and `merge_commit`.
+4. On failure beyond the retry budget, writes `merge_failure` and blocks the
+   story (auto-assign then skips it).
+
+### Watchdog (current production)
+
+The "watchdog" at `server/src/agents/pool/auto_assign/watchdog.rs` runs every
+30 ticks of the unified background loop. Today it does **one** thing: detect
+orphaned agents whose tokio task is `is_finished()` but whose status is still
+`Running` or `Pending`, and mark them `Failed` with an `AgentEvent::Error`
+emission. Bug 624 (now merged) extends it to also enforce `max_turns` and
+`max_budget_usd` limits — an agent over either limit is killed via the
+existing `kill_child_for_key` path and recorded with a typed termination
+reason.
+
+---
+
+## (b) Transitions and behaviors that don't yet exist (or are only partially wired)
+
+### Migration of consumers off legacy strings to typed `Stage` enum
+
+The biggest outstanding piece. `pipeline_state.rs` is `#![allow(dead_code)]`.
+Every consumer (auto-assign, mergemaster, MCP tools, chat commands) still
+works with stage strings (`"2_current"`, `"4_merge"`) and front-matter flags.
+The projection layer (`TryFrom<PipelineItemView> for PipelineItem` and
+friends) exists but isn't called outside tests. Migration is intentionally
+incremental.
+
+**Opportunity**: pick a leaf consumer (e.g. one MCP tool that reads the stage
+string) and migrate it to read `Stage` instead. Pattern repeats outward until
+all consumers go through the typed projection and the legacy stage-string
+code can be deleted.
+
+### First-class verbs for archive reasons
+
+`ArchiveReason` already has six variants but only `Completed` (via
+`accept_story`) and `Blocked` (via the `blocked: true` flag) have dedicated
+MCP verbs. Today, `Abandoned`, `Superseded`, `MergeFailed`, and `ReviewHeld`
+are reached either via `move_story target_stage=done` (which doesn't carry
+the reason) or via setting front-matter flags on the live story.
+
+**Missing transitions**:
+
+- `mcp__huskies__supersede_story story_id=X by=Y` — sets stage to
+  `Archived { reason: Superseded { by: Y } }`. Today we use
+  `move_story → done`, losing the `by` reference. (Came up 2026-04-25 with
+  spike 621 → refactor 623.)
+- `mcp__huskies__abandon_story story_id=X reason="..."` — sets
+  `Archived { reason: Abandoned }`. Today done via `move_story → done` or
+  `purge_story`.
+- `mcp__huskies__hold_for_review story_id=X reason="..."` — explicitly puts
+  a story in `Archived { reason: ReviewHeld }` rather than relying on the
+  auto-set `review_hold` flag.
+
+### Type-conversion transitions
+
+Spike → story conversion is a real workflow (we do it when a spike's scope
+grows into an implementation story). Today, converting type via `update_story
+front_matter={"type": "story"}` does not bootstrap the
+`## Acceptance Criteria` section, and `add_criterion` then permanently fails
+on that story (see **bug 625** filed 2026-04-25). The `type` field passed via
+front_matter is also silently dropped — same silent-drop bug class as
+`acceptance_criteria`. The state machine should treat type conversion as a
+transition with side effects — at minimum, ensuring the AC section exists
+when transitioning to a type that requires it, and the displayed type
+reflects the new value (today the display chip is parsed from the immutable
+story_id prefix; story 578 in backlog will fix this by switching to
+numeric-only IDs).
+
+### Limit-based agent termination (turn / budget)
+
+Pre-624 master: `max_turns` and `max_budget_usd` per-agent config were read
+by the metric tool (`tool_get_agent_remaining_turns_and_budget`) but **not
+enforced** anywhere. Observed `coder-1` running 282/50 turns and $10.05/$5.00
+USD on story 623 before a human stopped it (bug 624, now merged).
+
+The bug 624 fix adds enforcement to the watchdog. The state-machine impact:
+introduces a new agent-termination path distinct from "Failed (orphan)" —
+something like `Failed(LimitExceeded { kind: Turns | Budget })`. The
+`ExecutionState` enum may want a corresponding terminal variant so it can be
+distinguished from generic `Failed`.
+
+### Pinned-agent honoring under contention
+
+When a story has `agent: coder-opus` pinned but `coder-opus` is busy, today's
+auto-assign behavior is to leave the story unassigned — but if a human stops
+the running attempt and the story sits in `current/`, auto-assign **re-grabs
+it with the default coder** rather than waiting for the pinned agent.
+Observed multiple times on 2026-04-25 with story 623: pinning `coder-opus`
+did not prevent `coder-1` (sonnet) from being auto-assigned during opus's
+busy window.
+
+**Missing behavior**: auto-assign should treat a pinned agent as a hard
+filter ("only this agent can take this story"), not a preference. Today the
+workaround is to also set `depends_on` on a phantom story, or move the story
+back to backlog and let the dependency system gate it.
+
+### Honoring the `blocked` flag (bug 559)
+
+`559_bug_mergemaster_ignores_blocked_flag_and_keeps_respawning_on_blocked_stories`
+is in backlog. Even though `blocked: true` is documented as a skip condition
+in `auto_assign_available_work`, mergemaster's spawn path apparently checks
+something different (or earlier) and respawns on blocked merge-stage stories.
+The state machine should make `Stage::Archived { reason: Blocked }` a single
+authoritative source so no consumer can incidentally bypass it.
+
+### Formal "ghost story recovery" transition
+
+The `move_story` MCP tool description mentions "recovering a ghost story by
+moving it back to current" as a valid use. Ghost stories are CRDT entries
+with no corresponding filesystem stage directory (or the inverse). Today this
+is an `update_story + move_story` ad-hoc dance. A first-class
+`recover_ghost_story` verb that reconciles the CRDT and filesystem would
+formalize the recovery path.
+
+### Operator-level visibility / observability
+
+There is no UI, CLI, or doc that shows "the state machine as a diagram." The
+typed enums are the closest thing to a canonical specification, but they
+aren't rendered anywhere a human can see at a glance: which stages exist,
+which transitions are valid, which events trigger them. A generated state
+diagram (graphviz or mermaid, dumped into this doc on each release) would
+help both new contributors and operators triaging stuck pipelines.
+
+### Review-hold cleanup verb
+
+`review_hold: true` is set automatically on spike completion. Clearing it is
+done as a side effect of `reject_qa` (which also moves the story qa →
+current) or by manually editing front matter. There is no clean "I have
+reviewed this, release the hold" verb that doesn't also move the story.
+
+### Cross-node concurrency for execution state
+
+`ExecutionState` is per-node (keyed by pubkey) so two nodes can't fight over
+who's running an agent. But there is no formal transition that says "node A
+hands the story to node B" if node A goes offline. The state machine's
+distributed semantics for this case are not yet specified.
+
+---
+
+## How to update this document
+
+Whenever you discover a transition that doesn't yet exist, or a flag that
+behaves surprisingly, add it to **section (b)** with:
+
+- A short description of the desired behavior
+- Citation of the work item or incident that surfaced it
+- Pointer to the place in `pipeline_state.rs` where it should be modeled (or
+  note "needs a new variant" if it doesn't fit any existing enum yet)
+
+When a transition from (b) ships, move it to (a) with the relevant file:line
+citations.