storkit: done 452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart

This commit is contained in:
dave
2026-04-02 10:31:08 +00:00
parent c6020b7f43
commit f06111f045
@@ -1,48 +0,0 @@
---
name: "Zombie process accumulation from unrereaped child processes"
---
# Bug 452: Zombie process accumulation from unrereaped child processes
## Description
Storkit accumulates zombie processes over time from unrereaped child and grandchild processes. Observed 101 zombies in Docker container, 27 on macOS host. Breakdown: 51 esbuild, 36 echo, 5 claude, 5 sh, 2 bash, 1 cargo.
Root cause: storkit does not reap orphaned grandchild processes. The zombies are mostly grandchildren (`esbuild`, `echo`, `sh`, `cargo`) spawned by `npm run build`, `cargo test`, etc. during worktree setup and gate checks. This happens both natively (observed 27 zombies on macOS host) and in Docker containers. When the intermediate parent exits, these grandchildren get reparented to storkit (or PID 1 in Docker) and become zombies because nobody calls `waitpid` for them.
**Already fixed:**
- `docker-compose.yml` now has `init: true` which uses tini as PID 1 in Docker — this handles zombie reaping inside containers
- `llm/providers/claude_code.rs` now has `child.wait()` after `child.kill()` in all code paths, and the reader thread is joined before returning
- `agents/pty.rs` reader thread is now joined before returning
**Remaining:** Storkit running natively (e.g. on macOS) still accumulates zombie grandchildren because there is no tini. The fix is to add a background reaper thread that periodically calls `waitpid(-1, WNOHANG)` in a loop to clean up any orphaned children. This should be spawned early in `main()` on Unix platforms. Example:
```rust
#[cfg(unix)]
std::thread::spawn(|| {
loop {
unsafe { while libc::waitpid(-1, std::ptr::null_mut(), libc::WNOHANG) > 0 {} }
std::thread::sleep(std::time::Duration::from_secs(5));
}
});
```
## How to Reproduce
Run several agent sessions. Check with `ps -eo stat,comm | grep Z | awk '{print $2}' | sort | uniq -c | sort -rn`.
## Actual Result
Zombie processes accumulate continuously. Never reaped.
## Expected Result
No zombie accumulation during normal operation.
## Acceptance Criteria
- [x] `child.wait()` is called after `child.kill()` in all code paths in `claude_code.rs`
- [x] Reader threads are joined in both `pty.rs` and `claude_code.rs`
- [x] `init: true` added to docker-compose.yml for Docker deployments
- [ ] Background reaper thread added for native (non-Docker) deployments
- [ ] Verified with `ps aux | grep '<defunct>'` after running multiple agent sessions natively on macOS