diff --git a/.story_kit/work/1_upcoming/118_bug_agent_pool_retains_stale_running_state_after_completion_blocking_auto_assign.md b/.story_kit/work/1_upcoming/118_bug_agent_pool_retains_stale_running_state_after_completion_blocking_auto_assign.md index ccae866..ba39a58 100644 --- a/.story_kit/work/1_upcoming/118_bug_agent_pool_retains_stale_running_state_after_completion_blocking_auto_assign.md +++ b/.story_kit/work/1_upcoming/118_bug_agent_pool_retains_stale_running_state_after_completion_blocking_auto_assign.md @@ -26,6 +26,65 @@ Agent pool reports mergemaster as running on the completed/archived story. Auto- When an agent process exits and the story advances, the agent pool should clear the running state so auto-assign can immediately dispatch the agent to the next queued item. +## Root Cause Analysis + +The bug is in `server/src/agents.rs`, in the `start_agent` method. + +### The Leak + +In `start_agent` (line ~177), a `Pending` entry is inserted into the in-memory `HashMap` at line ~263: + +```rust +{ + let mut agents = self.agents.lock().map_err(|e| e.to_string())?; + agents.insert( + key.clone(), + StoryAgent { + agent_name: resolved_name.clone(), + status: AgentStatus::Pending, + // ... + }, + ); +} +``` + +Then at line ~290, `create_worktree` is called: + +```rust +let wt_info = worktree::create_worktree(project_root, story_id, &config, self.port).await?; +``` + +**If `create_worktree` fails** (e.g. `pnpm run build` error during worktree setup), the function returns `Err` but **never removes the Pending entry** from the HashMap. + +### The Blocking Effect + +`find_free_agent_for_stage` (line ~1418) considers an agent "busy" if any HashMap entry has `Running | Pending` status: + +```rust +let is_busy = agents.values().any(|a| { + a.agent_name == agent_config.name + && matches!(a.status, AgentStatus::Running | AgentStatus::Pending) +}); +``` + +The leaked Pending entry permanently blocks this agent from being auto-assigned until someone manually stops the stale entry via the API. + +### Scope + +This affects **all agent types** (coders, QA, mergemaster) equally — anywhere `start_agent` is called and the subsequent worktree creation or process spawn can fail. Anywhere there's a gate that can fail after the Pending entry is inserted, the leak can happen. + +The code currently enforces gates but doesn't clean up if a gate fails — the Pending entry just stays in the HashMap forever. + +### Fix Strategy + +Add cleanup logic: if any step after the Pending insertion fails, remove the entry from the HashMap before returning the error. A guard/RAII pattern or explicit cleanup in the error path would both work. The key is that `start_agent` must be atomic — either the agent is fully started, or no trace of it remains in the pool. + +Also audit other code paths that insert entries into the agents HashMap to ensure they all have proper cleanup on failure. + ## Acceptance Criteria -- [ ] Bug is fixed and verified +- [ ] `start_agent` cleans up the Pending entry from the HashMap if `create_worktree` or any subsequent step fails +- [ ] No leaked Pending/Running entries remain after a failed agent start +- [ ] Automated test covers the failure case: simulate `create_worktree` failure and verify the agent pool is clean afterward +- [ ] All agent types (coder, QA, mergemaster) benefit from the fix +- [ ] Bug is fixed and verified with `cargo test` and `cargo clippy`