storkit: create 452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart
This commit is contained in:
+15
-5
@@ -10,12 +10,22 @@ Storkit accumulates zombie processes over time from unrereaped child and grandch
|
|||||||
|
|
||||||
Root cause: storkit does not reap orphaned grandchild processes. The zombies are mostly grandchildren (`esbuild`, `echo`, `sh`, `cargo`) spawned by `npm run build`, `cargo test`, etc. during worktree setup and gate checks. This happens both natively (observed 27 zombies on macOS host) and in Docker containers. When the intermediate parent exits, these grandchildren get reparented to storkit (or PID 1 in Docker) and become zombies because nobody calls `waitpid` for them.
|
Root cause: storkit does not reap orphaned grandchild processes. The zombies are mostly grandchildren (`esbuild`, `echo`, `sh`, `cargo`) spawned by `npm run build`, `cargo test`, etc. during worktree setup and gate checks. This happens both natively (observed 27 zombies on macOS host) and in Docker containers. When the intermediate parent exits, these grandchildren get reparented to storkit (or PID 1 in Docker) and become zombies because nobody calls `waitpid` for them.
|
||||||
|
|
||||||
Additionally, `llm/providers/claude_code.rs` has two code paths that call `child.kill()` without a following `child.wait()` (the cancel path at line 286 and the timeout kill at line 367), leaving direct-child zombies. `agents/pty.rs` correctly calls `child.wait()` after `child.kill()`.
|
**Already fixed:**
|
||||||
|
- `docker-compose.yml` now has `init: true` which uses tini as PID 1 in Docker — this handles zombie reaping inside containers
|
||||||
|
- `llm/providers/claude_code.rs` now has `child.wait()` after `child.kill()` in all code paths, and the reader thread is joined before returning
|
||||||
|
- `agents/pty.rs` reader thread is now joined before returning
|
||||||
|
|
||||||
The fix should include:
|
**Remaining:** Storkit running natively (e.g. on macOS) still accumulates zombie grandchildren because there is no tini. The fix is to add a background reaper thread that periodically calls `waitpid(-1, WNOHANG)` in a loop to clean up any orphaned children. This should be spawned early in `main()` on Unix platforms. Example:
|
||||||
1. Add a background reaper thread/task in storkit that periodically calls `waitpid(-1, WNOHANG)` to clean up orphaned processes (handles both native and Docker)
|
|
||||||
2. Add missing `child.wait()` calls after `child.kill()` in `claude_code.rs`
|
```rust
|
||||||
3. Optionally use `tini` or `dumb-init` as the Docker entrypoint as a belt-and-suspenders measure
|
#[cfg(unix)]
|
||||||
|
std::thread::spawn(|| {
|
||||||
|
loop {
|
||||||
|
unsafe { while libc::waitpid(-1, std::ptr::null_mut(), libc::WNOHANG) > 0 {} }
|
||||||
|
std::thread::sleep(std::time::Duration::from_secs(5));
|
||||||
|
}
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
## How to Reproduce
|
## How to Reproduce
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user