storkit: create 452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart

This commit is contained in:
dave
2026-03-31 11:17:47 +00:00
parent 3f6cd55833
commit f10edd6718
@@ -6,21 +6,27 @@ name: "Claude Code PTY crashes with fatal runtime error on agent restart"
## Description ## Description
When an agent is restarted on a worktree where coding is already complete (clean worktree, all changes committed), the Claude Code PTY process crashes immediately on startup with `fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting`. The process exits in the same second it spawns, with `Session: None`. This burns through retries and blocks the story. The crash comes from inside the Claude Code binary itself, not storkit code. Observed on both story 449 and 450 worktrees. When agent processes (Claude Code PTY sessions) exit, storkit does not reap the child processes, leaving them as zombies (`[claude] <defunct>`). These accumulate over time — observed 101 zombie processes in one session. When the zombie count gets high enough, new PTY allocations fail and Claude Code crashes immediately on startup with `fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting`. The process exits in the same second it spawns (Session: None), burns through retries, and blocks the story.
Root cause: storkit spawns Claude Code via `portable-pty` but does not call `waitpid` or equivalent to reap child processes after they exit. The `child.wait()` may not be called in all code paths (e.g. when the agent is killed during a rebuild or when gates fail and the agent is restarted).
Observed in `agents/pty.rs` and `llm/providers/claude_code.rs` — both use the same pattern of `pair.slave.spawn_command()` followed by `drop(pair.master)` but may not properly wait on the child in all exit paths.
## How to Reproduce ## How to Reproduce
Have a story in current stage with a worktree that has all coding done and committed (clean `git status`). Let the gates fail for an unrelated reason (e.g. a test failure). The pipeline restarts the agent on the same worktree. The Claude Code process crashes immediately with the fatal runtime error. Run several agent sessions (or trigger gate failures that cause restarts). After enough sessions, zombie `[claude] <defunct>` processes accumulate. Check with `ps aux | grep '<defunct>'`. Once enough zombies exist, new agent spawns crash immediately with the fatal runtime error.
## Actual Result ## Actual Result
`fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting` — process exits instantly, session is None, burns retries. `fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting` — process exits instantly, session is None, burns retries. `ps aux` shows many `[claude] <defunct>` zombie processes.
## Expected Result ## Expected Result
Claude Code should start successfully on the worktree and the agent should be able to fix any gate failures. All child processes should be properly reaped after exit. No zombie accumulation. New agent spawns should always succeed regardless of how many previous sessions have run.
## Acceptance Criteria ## Acceptance Criteria
- [ ] Agent restart on a worktree with committed work does not crash the PTY - [ ] `child.wait()` is called in all exit paths in `agents/pty.rs` and `llm/providers/claude_code.rs`
- [ ] No zombie `[claude] <defunct>` processes accumulate during normal operation
- [ ] Agent restart after gate failure does not crash the PTY
- [ ] If Claude Code fails to start, the error is handled gracefully without burning retries - [ ] If Claude Code fails to start, the error is handled gracefully without burning retries