storkit: create 452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart
This commit is contained in:
+7
-2
@@ -8,9 +8,14 @@ name: "Claude Code PTY crashes with fatal runtime error on agent restart"
|
||||
|
||||
When agent processes (Claude Code PTY sessions) exit, storkit does not reap the child processes, leaving them as zombies (`[claude] <defunct>`). These accumulate over time — observed 101 zombie processes in one session. When the zombie count gets high enough, new PTY allocations fail and Claude Code crashes immediately on startup with `fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting`. The process exits in the same second it spawns (Session: None), burns through retries, and blocks the story.
|
||||
|
||||
Root cause: storkit spawns Claude Code via `portable-pty` but does not call `waitpid` or equivalent to reap child processes after they exit. The `child.wait()` may not be called in all code paths (e.g. when the agent is killed during a rebuild or when gates fail and the agent is restarted).
|
||||
Root cause: storkit runs as PID 1 inside Docker. As PID 1, it is responsible for reaping all orphaned child processes. The zombies are mostly grandchildren (`esbuild`, `echo`, `sh`, `cargo`) spawned by `npm run build`, `cargo test`, etc. during worktree setup and gate checks. When these grandchild processes exit, they get reparented to PID 1 (storkit), which never calls `waitpid(-1, WNOHANG)` to reap them.
|
||||
|
||||
Observed in `agents/pty.rs` and `llm/providers/claude_code.rs` — both use the same pattern of `pair.slave.spawn_command()` followed by `drop(pair.master)` but may not properly wait on the child in all exit paths.
|
||||
Additionally, `llm/providers/claude_code.rs` has two code paths that call `child.kill()` without a following `child.wait()` (the cancel path at line 286 and the timeout kill at line 367), leaving direct-child zombies. `agents/pty.rs` correctly calls `child.wait()` after `child.kill()`.
|
||||
|
||||
The fix is one or both of:
|
||||
1. Add a background reaper thread/task in storkit that periodically calls `waitpid(-1, WNOHANG)` to clean up orphaned processes
|
||||
2. Use `tini` or `dumb-init` as the Docker entrypoint to handle PID 1 reaping responsibilities
|
||||
3. Add missing `child.wait()` calls after `child.kill()` in `claude_code.rs`
|
||||
|
||||
## How to Reproduce
|
||||
|
||||
|
||||
Reference in New Issue
Block a user