From 00e5fc2cccbce9b0693cb4ef0dac5d480db22074 Mon Sep 17 00:00:00 2001
From: Dave <futurechimp@users.noreply.github.com>
Date: Tue, 24 Feb 2026 12:01:49 +0000
Subject: [PATCH] story-kit: create
 132_story_fix_toctou_race_in_agent_check_and_insert

---
 ...x_toctou_race_in_agent_check_and_insert.md | 28 +++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/.story_kit/work/1_upcoming/132_story_fix_toctou_race_in_agent_check_and_insert.md b/.story_kit/work/1_upcoming/132_story_fix_toctou_race_in_agent_check_and_insert.md
index 91a3a62..54fa1a3 100644
--- a/.story_kit/work/1_upcoming/132_story_fix_toctou_race_in_agent_check_and_insert.md
+++ b/.story_kit/work/1_upcoming/132_story_fix_toctou_race_in_agent_check_and_insert.md
@@ -15,6 +15,34 @@ As a user running multiple agents, I want the agent pool to correctly enforce si
 - [ ] A test demonstrates that concurrent start_agent calls for the same agent name on different stories result in exactly one running agent and one rejection
 - [ ] A test demonstrates that concurrent auto_assign_available_work calls do not produce duplicate assignments
 
+## Analysis
+
+### Race 1: start_agent check-then-insert (agents.rs)
+
+The single-instance check at ~lines 262-296 acquires the mutex, checks for duplicate agents, then **releases the lock**. The HashMap insert happens later at ~line 324 after **re-acquiring the lock**. Between release and reacquire, a concurrent call can pass the same check:
+
+```
+Thread A: lock → check coder-1 available? YES → unlock
+Thread B: lock → check coder-1 available? YES → unlock → lock → insert "86:coder-1"
+Thread A: lock → insert "130:coder-1"
+Result: both coder-1 entries exist, two processes spawned
+```
+
+The composite key at ~line 27 is `format!("{story_id}:{agent_name}")`, so `86:coder-1` and `130:coder-1` are different keys. The name-only check at ~lines 277-295 iterates the HashMap looking for a Running/Pending agent with the same name — but both threads read the HashMap before either has inserted, so both pass.
+
+**Fix**: Hold the lock from the check (~line 264) through the insert (~line 324). This means the worktree setup and process spawn (~lines 297-322) must either happen inside the lock (blocking other callers) or the entry must be inserted as `Pending` before releasing the lock, with the process spawn happening after.
+
+### Race 2: auto_assign_available_work (agents.rs)
+
+At ~lines 1196-1215, the function locks the mutex, calls `find_free_agent_for_stage` to pick an available agent name, then **releases the lock**. It then calls `start_agent` at ~line 1228, which re-acquires the lock. Two concurrent `auto_assign` calls can both select the same free agent for different stories (or the same story) in this window.
+
+**Fix**: Either hold the lock across the full loop iteration, or restructure so that `start_agent` receives a reservation/guard rather than just an agent name string.
+
+### Observed symptoms
+
+- Both `coder-1` and `coder-2` showing as "running" on the same story
+- `coder-1` appearing on story 86 immediately after completing on bug 130, due to pipeline advancement calling `auto_assign_available_work` concurrently with other state transitions
+
 ## Out of Scope
 
 - TBD