storkit: create 452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart
This commit is contained in:
+5
-5
@@ -8,14 +8,14 @@ name: "Claude Code PTY crashes with fatal runtime error on agent restart"
|
||||
|
||||
When agent processes (Claude Code PTY sessions) exit, storkit does not reap the child processes, leaving them as zombies (`[claude] <defunct>`). These accumulate over time — observed 101 zombie processes in one session. When the zombie count gets high enough, new PTY allocations fail and Claude Code crashes immediately on startup with `fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting`. The process exits in the same second it spawns (Session: None), burns through retries, and blocks the story.
|
||||
|
||||
Root cause: storkit runs as PID 1 inside Docker. As PID 1, it is responsible for reaping all orphaned child processes. The zombies are mostly grandchildren (`esbuild`, `echo`, `sh`, `cargo`) spawned by `npm run build`, `cargo test`, etc. during worktree setup and gate checks. When these grandchild processes exit, they get reparented to PID 1 (storkit), which never calls `waitpid(-1, WNOHANG)` to reap them.
|
||||
Root cause: storkit does not reap orphaned grandchild processes. The zombies are mostly grandchildren (`esbuild`, `echo`, `sh`, `cargo`) spawned by `npm run build`, `cargo test`, etc. during worktree setup and gate checks. This happens both natively (observed 27 zombies on macOS host) and in Docker containers. When the intermediate parent exits, these grandchildren get reparented to storkit (or PID 1 in Docker) and become zombies because nobody calls `waitpid` for them.
|
||||
|
||||
Additionally, `llm/providers/claude_code.rs` has two code paths that call `child.kill()` without a following `child.wait()` (the cancel path at line 286 and the timeout kill at line 367), leaving direct-child zombies. `agents/pty.rs` correctly calls `child.wait()` after `child.kill()`.
|
||||
|
||||
The fix is one or both of:
|
||||
1. Add a background reaper thread/task in storkit that periodically calls `waitpid(-1, WNOHANG)` to clean up orphaned processes
|
||||
2. Use `tini` or `dumb-init` as the Docker entrypoint to handle PID 1 reaping responsibilities
|
||||
3. Add missing `child.wait()` calls after `child.kill()` in `claude_code.rs`
|
||||
The fix should include:
|
||||
1. Add a background reaper thread/task in storkit that periodically calls `waitpid(-1, WNOHANG)` to clean up orphaned processes (handles both native and Docker)
|
||||
2. Add missing `child.wait()` calls after `child.kill()` in `claude_code.rs`
|
||||
3. Optionally use `tini` or `dumb-init` as the Docker entrypoint as a belt-and-suspenders measure
|
||||
|
||||
## How to Reproduce
|
||||
|
||||
|
||||
Reference in New Issue
Block a user