story-kit: create 118_bug_agent_pool_retains_stale_running_state_after_completion_blocking_auto_assign
This commit is contained in:
@@ -26,6 +26,65 @@ Agent pool reports mergemaster as running on the completed/archived story. Auto-
|
||||
|
||||
When an agent process exits and the story advances, the agent pool should clear the running state so auto-assign can immediately dispatch the agent to the next queued item.
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
The bug is in `server/src/agents.rs`, in the `start_agent` method.
|
||||
|
||||
### The Leak
|
||||
|
||||
In `start_agent` (line ~177), a `Pending` entry is inserted into the in-memory `HashMap<String, StoryAgent>` at line ~263:
|
||||
|
||||
```rust
|
||||
{
|
||||
let mut agents = self.agents.lock().map_err(|e| e.to_string())?;
|
||||
agents.insert(
|
||||
key.clone(),
|
||||
StoryAgent {
|
||||
agent_name: resolved_name.clone(),
|
||||
status: AgentStatus::Pending,
|
||||
// ...
|
||||
},
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
Then at line ~290, `create_worktree` is called:
|
||||
|
||||
```rust
|
||||
let wt_info = worktree::create_worktree(project_root, story_id, &config, self.port).await?;
|
||||
```
|
||||
|
||||
**If `create_worktree` fails** (e.g. `pnpm run build` error during worktree setup), the function returns `Err` but **never removes the Pending entry** from the HashMap.
|
||||
|
||||
### The Blocking Effect
|
||||
|
||||
`find_free_agent_for_stage` (line ~1418) considers an agent "busy" if any HashMap entry has `Running | Pending` status:
|
||||
|
||||
```rust
|
||||
let is_busy = agents.values().any(|a| {
|
||||
a.agent_name == agent_config.name
|
||||
&& matches!(a.status, AgentStatus::Running | AgentStatus::Pending)
|
||||
});
|
||||
```
|
||||
|
||||
The leaked Pending entry permanently blocks this agent from being auto-assigned until someone manually stops the stale entry via the API.
|
||||
|
||||
### Scope
|
||||
|
||||
This affects **all agent types** (coders, QA, mergemaster) equally — anywhere `start_agent` is called and the subsequent worktree creation or process spawn can fail. Anywhere there's a gate that can fail after the Pending entry is inserted, the leak can happen.
|
||||
|
||||
The code currently enforces gates but doesn't clean up if a gate fails — the Pending entry just stays in the HashMap forever.
|
||||
|
||||
### Fix Strategy
|
||||
|
||||
Add cleanup logic: if any step after the Pending insertion fails, remove the entry from the HashMap before returning the error. A guard/RAII pattern or explicit cleanup in the error path would both work. The key is that `start_agent` must be atomic — either the agent is fully started, or no trace of it remains in the pool.
|
||||
|
||||
Also audit other code paths that insert entries into the agents HashMap to ensure they all have proper cleanup on failure.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] Bug is fixed and verified
|
||||
- [ ] `start_agent` cleans up the Pending entry from the HashMap if `create_worktree` or any subsequent step fails
|
||||
- [ ] No leaked Pending/Running entries remain after a failed agent start
|
||||
- [ ] Automated test covers the failure case: simulate `create_worktree` failure and verify the agent pool is clean afterward
|
||||
- [ ] All agent types (coder, QA, mergemaster) benefit from the fix
|
||||
- [ ] Bug is fixed and verified with `cargo test` and `cargo clippy`
|
||||
|
||||
Reference in New Issue
Block a user