fix: prune session_store on stdio abort, respawn cold

The bug 882 abort-respawn safeguard caps consecutive crashes at 5 then blocks the story — but the underlying stdio abort itself stays unfixed: each respawn calls start_agent which reads session_store.json, finds the prior session id, passes --resume to claude-code, and re-triggers the same crash. Five identical respawns later, the story is blocked. Now: when an abort+no-session exit triggers respawn, we first call session_store::remove_sessions_for_story to drop every entry for the story. The next spawn starts cold (no --resume), which avoids the bloated stdio replay claude-code is choking on. The function was already implemented but #[cfg(test)] only — promoted to a non-test pub fn. Existing remove_sessions_for_story_cleans_up test unchanged and still green. Net effect: instead of "5 retries, then blocked", we get "1 abort, prune, respawn cold, agent runs normally". The story can resume work without losing its worktree state.
2026-04-30 18:19:01 +00:00
parent a8eac3c278
commit 66f340a7a3
2 changed files with 16 additions and 3 deletions
@@ -73,8 +73,12 @@ pub fn lookup_session(
    read_store(project_root).get(&key).cloned()
 }

-/// Remove all session entries for a story (called when a story reaches done/archived).
-#[cfg(test)]
+/// Remove all session entries for a story.
+///
+/// Called when the story reaches done/archived, OR when claude-code keeps
+/// crashing on session resume — in the latter case the next spawn must start
+/// cold (no `--resume` flag) so the bloated stdio replay doesn't re-trigger
+/// the same abort. See bug 882 follow-up.
 pub fn remove_sessions_for_story(project_root: &Path, story_id: &str) {
    let mut data = read_store(project_root);
    let prefix = format!("{story_id}:");