From d5d82bdb0091acc40dc780e97129947209353b18 Mon Sep 17 00:00:00 2001 From: dave Date: Tue, 31 Mar 2026 11:21:45 +0000 Subject: [PATCH] storkit: create 452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart --- ..._crashes_with_fatal_runtime_error_on_agent_restart.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/.storkit/work/1_backlog/452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart.md b/.storkit/work/1_backlog/452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart.md index f143d7e3..99ab7aa4 100644 --- a/.storkit/work/1_backlog/452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart.md +++ b/.storkit/work/1_backlog/452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart.md @@ -8,9 +8,14 @@ name: "Claude Code PTY crashes with fatal runtime error on agent restart" When agent processes (Claude Code PTY sessions) exit, storkit does not reap the child processes, leaving them as zombies (`[claude] `). These accumulate over time — observed 101 zombie processes in one session. When the zombie count gets high enough, new PTY allocations fail and Claude Code crashes immediately on startup with `fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting`. The process exits in the same second it spawns (Session: None), burns through retries, and blocks the story. -Root cause: storkit spawns Claude Code via `portable-pty` but does not call `waitpid` or equivalent to reap child processes after they exit. The `child.wait()` may not be called in all code paths (e.g. when the agent is killed during a rebuild or when gates fail and the agent is restarted). +Root cause: storkit runs as PID 1 inside Docker. As PID 1, it is responsible for reaping all orphaned child processes. The zombies are mostly grandchildren (`esbuild`, `echo`, `sh`, `cargo`) spawned by `npm run build`, `cargo test`, etc. during worktree setup and gate checks. When these grandchild processes exit, they get reparented to PID 1 (storkit), which never calls `waitpid(-1, WNOHANG)` to reap them. -Observed in `agents/pty.rs` and `llm/providers/claude_code.rs` — both use the same pattern of `pair.slave.spawn_command()` followed by `drop(pair.master)` but may not properly wait on the child in all exit paths. +Additionally, `llm/providers/claude_code.rs` has two code paths that call `child.kill()` without a following `child.wait()` (the cancel path at line 286 and the timeout kill at line 367), leaving direct-child zombies. `agents/pty.rs` correctly calls `child.wait()` after `child.kill()`. + +The fix is one or both of: +1. Add a background reaper thread/task in storkit that periodically calls `waitpid(-1, WNOHANG)` to clean up orphaned processes +2. Use `tini` or `dumb-init` as the Docker entrypoint to handle PID 1 reaping responsibilities +3. Add missing `child.wait()` calls after `child.kill()` in `claude_code.rs` ## How to Reproduce