diff --git a/.story_kit/work/5_done/295_bug_stories_stuck_in_qa_when_qa_agent_is_busy.md b/.story_kit/work/5_done/295_bug_stories_stuck_in_qa_when_qa_agent_is_busy.md new file mode 100644 index 0000000..1485693 --- /dev/null +++ b/.story_kit/work/5_done/295_bug_stories_stuck_in_qa_when_qa_agent_is_busy.md @@ -0,0 +1,64 @@ +--- +name: "Stories stuck in QA when QA agent is busy" +review_hold: true +--- + +# Bug 295: Stories stuck in QA when QA agent is busy + +## Description + +When multiple stories pass coding gates simultaneously and move to QA, only the first one gets a QA agent assigned. The others fail with "Agent 'qa' is already running" and are never retried when the QA agent becomes free. Stories get stuck in QA with no agent indefinitely. + +The root cause is in the server-owned agent completion handler in `server/src/agents/pool.rs`. When a coder finishes and gates pass, the server calls the pipeline advance logic which tries to start a QA agent. If the QA agent is already busy on another story, the start fails with an error and the story is left in `3_qa/` with no agent. There is no retry mechanism — the `auto_assign_available_work` function is only called on startup (via `reconcile_on_startup`) and when agents are manually started, not when agents complete. + +## How to Reproduce + +1. Have 3 stories in current with coders running (e.g. coder-1, coder-2, coder-opus) +2. All 3 coders finish within seconds of each other and pass gates +3. Server tries to start QA agent on all 3: + - Story 292: `qa` agent starts successfully + - Story 293: fails — `"Agent 'qa' is already running on story '292'"` + - Story 294: fails — same error +4. QA on 292 completes (gates pass after retry) +5. Stories 293 and 294 remain stuck in QA with no agent — nobody retries them + +## Server Log Evidence (2026-03-18) + +``` +21:00:35 [agent:292:coder-1] Done. +21:00:42 [agents] Server-owned completion for '292:coder-1': gates_passed=true +21:00:47 [agent:292:qa] Spawning claude... + +21:01:32 [agent:293:coder-2] Done. +21:01:34 [agent:294:coder-opus] Done. + +21:01:41 [agents] Server-owned completion for '293:coder-2': gates_passed=true +21:01:41 [ERROR] Failed to start qa agent for '293': Agent 'qa' is already running on story '292' + +21:01:48 [agents] Server-owned completion for '294:coder-opus': gates_passed=true +21:01:48 [ERROR] Failed to start qa agent for '294': Agent 'qa' is already running on story '292' + +21:08:18 [agents] Server-owned completion for '292:qa': gates_passed=true +(293 and 294 are never picked up) +``` + +## Actual Result + +Stories 293 and 294 stuck in QA with no agent after 292's QA agent was busy. The pipeline status shows them in `3_qa` with `agent: null` indefinitely. + +## Expected Result + +When a QA agent finishes a story, `auto_assign_available_work` should be called to scan for unassigned stories in all active stages and assign free agents. Stories 293 and 294 should get QA agents as soon as the QA agent finishes with 292. + +## Suggested Fix + +In the server-owned completion handler (the code path that runs after an agent's process exits), call `auto_assign_available_work()` after processing the completed story. This ensures that when any agent becomes free, the server immediately looks for pending work to assign it to. + +The relevant code is in `server/src/agents/pool.rs` — the `handle_agent_completion` path (around line 804-950) and `auto_assign_available_work` (around line 1437-1559). + +## Acceptance Criteria + +- [ ] When an agent completes (any stage), `auto_assign_available_work` is called to pick up pending stories +- [ ] Stories that failed agent assignment due to busy agents are picked up when agents become available +- [ ] Server logs when a story is queued for retry vs permanently failed +- [ ] Multiple stories completing QA sequentially works correctly (story A finishes QA → story B gets QA agent)