story-kit: create 295_bug_stories_stuck_in_qa_when_qa_agent_is_busy

2026-03-18 21:24:11 +00:00
parent 483dca5b95
commit 13c0ee4c08
1 changed files with 41 additions and 8 deletions
--- a/.story_kit/work/1_backlog/295_bug_stories_stuck_in_qa_when_qa_agent_is_busy.md
+++ b/.story_kit/work/1_backlog/295_bug_stories_stuck_in_qa_when_qa_agent_is_busy.md
@@ -8,23 +8,56 @@ name: "Stories stuck in QA when QA agent is busy"

 When multiple stories pass coding gates simultaneously and move to QA, only the first one gets a QA agent assigned. The others fail with "Agent 'qa' is already running" and are never retried when the QA agent becomes free. Stories get stuck in QA with no agent indefinitely.

+The root cause is in the server-owned agent completion handler in `server/src/agents/pool.rs`. When a coder finishes and gates pass, the server calls the pipeline advance logic which tries to start a QA agent. If the QA agent is already busy on another story, the start fails with an error and the story is left in `3_qa/` with no agent. There is no retry mechanism — the `auto_assign_available_work` function is only called on startup (via `reconcile_on_startup`) and when agents are manually started, not when agents complete.
+
 ## How to Reproduce

-1. Have 3 stories in current with coders running
-2. All 3 coders finish within seconds of each other
-3. Server tries to start QA agent on all 3 — first succeeds, other 2 fail
-4. After first QA completes, the other 2 stories remain in QA with no agent
+1. Have 3 stories in current with coders running (e.g. coder-1, coder-2, coder-opus)
+2. All 3 coders finish within seconds of each other and pass gates
+3. Server tries to start QA agent on all 3:
+   - Story 292: `qa` agent starts successfully
+   - Story 293: fails — `"Agent 'qa' is already running on story '292'"`
+   - Story 294: fails — same error
+4. QA on 292 completes (gates pass after retry)
+5. Stories 293 and 294 remain stuck in QA with no agent — nobody retries them
+
+## Server Log Evidence (2026-03-18)
+
+```
+21:00:35 [agent:292:coder-1] Done.
+21:00:42 [agents] Server-owned completion for '292:coder-1': gates_passed=true
+21:00:47 [agent:292:qa] Spawning claude...
+
+21:01:32 [agent:293:coder-2] Done.
+21:01:34 [agent:294:coder-opus] Done.
+
+21:01:41 [agents] Server-owned completion for '293:coder-2': gates_passed=true
+21:01:41 [ERROR] Failed to start qa agent for '293': Agent 'qa' is already running on story '292'
+
+21:01:48 [agents] Server-owned completion for '294:coder-opus': gates_passed=true
+21:01:48 [ERROR] Failed to start qa agent for '294': Agent 'qa' is already running on story '292'
+
+21:08:18 [agents] Server-owned completion for '292:qa': gates_passed=true
+(293 and 294 are never picked up)
+```

 ## Actual Result

-Stories 293 and 294 stuck in QA with no agent after 292's QA agent was busy.
+Stories 293 and 294 stuck in QA with no agent after 292's QA agent was busy. The pipeline status shows them in `3_qa` with `agent: null` indefinitely.

 ## Expected Result

-When a QA agent finishes a story, auto-assign should check for other stories waiting in QA and pick them up.
+When a QA agent finishes a story, `auto_assign_available_work` should be called to scan for unassigned stories in all active stages and assign free agents. Stories 293 and 294 should get QA agents as soon as the QA agent finishes with 292.
+
+## Suggested Fix
+
+In the server-owned completion handler (the code path that runs after an agent's process exits), call `auto_assign_available_work()` after processing the completed story. This ensures that when any agent becomes free, the server immediately looks for pending work to assign it to.
+
+The relevant code is in `server/src/agents/pool.rs` — the `handle_agent_completion` path (around line 804-950) and `auto_assign_available_work` (around line 1437-1559).

 ## Acceptance Criteria

- [ ] When an agent completes, auto-assign scans for unassigned stories in all active stages and assigns free agents
- [ ] Stories that failed agent assignment due to busy agents are retried when agents become available
+- [ ] When an agent completes (any stage), `auto_assign_available_work` is called to pick up pending stories
+- [ ] Stories that failed agent assignment due to busy agents are picked up when agents become available
 - [ ] Server logs when a story is queued for retry vs permanently failed
+- [ ] Multiple stories completing QA sequentially works correctly (story A finishes QA → story B gets QA agent)