From f10edd6718289b214535de26333c254a246aabcf Mon Sep 17 00:00:00 2001 From: dave Date: Tue, 31 Mar 2026 11:17:47 +0000 Subject: [PATCH] storkit: create 452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart --- ..._with_fatal_runtime_error_on_agent_restart.md | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/.storkit/work/1_backlog/452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart.md b/.storkit/work/1_backlog/452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart.md index 55de68a6..f143d7e3 100644 --- a/.storkit/work/1_backlog/452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart.md +++ b/.storkit/work/1_backlog/452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart.md @@ -6,21 +6,27 @@ name: "Claude Code PTY crashes with fatal runtime error on agent restart" ## Description -When an agent is restarted on a worktree where coding is already complete (clean worktree, all changes committed), the Claude Code PTY process crashes immediately on startup with `fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting`. The process exits in the same second it spawns, with `Session: None`. This burns through retries and blocks the story. The crash comes from inside the Claude Code binary itself, not storkit code. Observed on both story 449 and 450 worktrees. +When agent processes (Claude Code PTY sessions) exit, storkit does not reap the child processes, leaving them as zombies (`[claude] `). These accumulate over time — observed 101 zombie processes in one session. When the zombie count gets high enough, new PTY allocations fail and Claude Code crashes immediately on startup with `fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting`. The process exits in the same second it spawns (Session: None), burns through retries, and blocks the story. + +Root cause: storkit spawns Claude Code via `portable-pty` but does not call `waitpid` or equivalent to reap child processes after they exit. The `child.wait()` may not be called in all code paths (e.g. when the agent is killed during a rebuild or when gates fail and the agent is restarted). + +Observed in `agents/pty.rs` and `llm/providers/claude_code.rs` — both use the same pattern of `pair.slave.spawn_command()` followed by `drop(pair.master)` but may not properly wait on the child in all exit paths. ## How to Reproduce -Have a story in current stage with a worktree that has all coding done and committed (clean `git status`). Let the gates fail for an unrelated reason (e.g. a test failure). The pipeline restarts the agent on the same worktree. The Claude Code process crashes immediately with the fatal runtime error. +Run several agent sessions (or trigger gate failures that cause restarts). After enough sessions, zombie `[claude] ` processes accumulate. Check with `ps aux | grep ''`. Once enough zombies exist, new agent spawns crash immediately with the fatal runtime error. ## Actual Result -`fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting` — process exits instantly, session is None, burns retries. +`fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting` — process exits instantly, session is None, burns retries. `ps aux` shows many `[claude] ` zombie processes. ## Expected Result -Claude Code should start successfully on the worktree and the agent should be able to fix any gate failures. +All child processes should be properly reaped after exit. No zombie accumulation. New agent spawns should always succeed regardless of how many previous sessions have run. ## Acceptance Criteria -- [ ] Agent restart on a worktree with committed work does not crash the PTY +- [ ] `child.wait()` is called in all exit paths in `agents/pty.rs` and `llm/providers/claude_code.rs` +- [ ] No zombie `[claude] ` processes accumulate during normal operation +- [ ] Agent restart after gate failure does not crash the PTY - [ ] If Claude Code fails to start, the error is handled gracefully without burning retries