story-kit: create 118_bug_agent_pool_retains_stale_running_state_after_completion_blocking_auto_assign

This commit is contained in:
Dave
2026-02-23 22:38:04 +00:00
parent e694e7f040
commit 4839b64d4d

View File

@@ -26,6 +26,65 @@ Agent pool reports mergemaster as running on the completed/archived story. Auto-
When an agent process exits and the story advances, the agent pool should clear the running state so auto-assign can immediately dispatch the agent to the next queued item.
## Root Cause Analysis
The bug is in `server/src/agents.rs`, in the `start_agent` method.
### The Leak
In `start_agent` (line ~177), a `Pending` entry is inserted into the in-memory `HashMap<String, StoryAgent>` at line ~263:
```rust
{
let mut agents = self.agents.lock().map_err(|e| e.to_string())?;
agents.insert(
key.clone(),
StoryAgent {
agent_name: resolved_name.clone(),
status: AgentStatus::Pending,
// ...
},
);
}
```
Then at line ~290, `create_worktree` is called:
```rust
let wt_info = worktree::create_worktree(project_root, story_id, &config, self.port).await?;
```
**If `create_worktree` fails** (e.g. `pnpm run build` error during worktree setup), the function returns `Err` but **never removes the Pending entry** from the HashMap.
### The Blocking Effect
`find_free_agent_for_stage` (line ~1418) considers an agent "busy" if any HashMap entry has `Running | Pending` status:
```rust
let is_busy = agents.values().any(|a| {
a.agent_name == agent_config.name
&& matches!(a.status, AgentStatus::Running | AgentStatus::Pending)
});
```
The leaked Pending entry permanently blocks this agent from being auto-assigned until someone manually stops the stale entry via the API.
### Scope
This affects **all agent types** (coders, QA, mergemaster) equally — anywhere `start_agent` is called and the subsequent worktree creation or process spawn can fail. Anywhere there's a gate that can fail after the Pending entry is inserted, the leak can happen.
The code currently enforces gates but doesn't clean up if a gate fails — the Pending entry just stays in the HashMap forever.
### Fix Strategy
Add cleanup logic: if any step after the Pending insertion fails, remove the entry from the HashMap before returning the error. A guard/RAII pattern or explicit cleanup in the error path would both work. The key is that `start_agent` must be atomic — either the agent is fully started, or no trace of it remains in the pool.
Also audit other code paths that insert entries into the agents HashMap to ensure they all have proper cleanup on failure.
## Acceptance Criteria
- [ ] Bug is fixed and verified
- [ ] `start_agent` cleans up the Pending entry from the HashMap if `create_worktree` or any subsequent step fails
- [ ] No leaked Pending/Running entries remain after a failed agent start
- [ ] Automated test covers the failure case: simulate `create_worktree` failure and verify the agent pool is clean afterward
- [ ] All agent types (coder, QA, mergemaster) benefit from the fix
- [ ] Bug is fixed and verified with `cargo test` and `cargo clippy`