storkit: create 452_bug_claude_code_pty_crashes_with_fatal_runtime_error_on_agent_restart
This commit is contained in:
+11
-5
@@ -6,21 +6,27 @@ name: "Claude Code PTY crashes with fatal runtime error on agent restart"
|
|||||||
|
|
||||||
## Description
|
## Description
|
||||||
|
|
||||||
When an agent is restarted on a worktree where coding is already complete (clean worktree, all changes committed), the Claude Code PTY process crashes immediately on startup with `fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting`. The process exits in the same second it spawns, with `Session: None`. This burns through retries and blocks the story. The crash comes from inside the Claude Code binary itself, not storkit code. Observed on both story 449 and 450 worktrees.
|
When agent processes (Claude Code PTY sessions) exit, storkit does not reap the child processes, leaving them as zombies (`[claude] <defunct>`). These accumulate over time — observed 101 zombie processes in one session. When the zombie count gets high enough, new PTY allocations fail and Claude Code crashes immediately on startup with `fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting`. The process exits in the same second it spawns (Session: None), burns through retries, and blocks the story.
|
||||||
|
|
||||||
|
Root cause: storkit spawns Claude Code via `portable-pty` but does not call `waitpid` or equivalent to reap child processes after they exit. The `child.wait()` may not be called in all code paths (e.g. when the agent is killed during a rebuild or when gates fail and the agent is restarted).
|
||||||
|
|
||||||
|
Observed in `agents/pty.rs` and `llm/providers/claude_code.rs` — both use the same pattern of `pair.slave.spawn_command()` followed by `drop(pair.master)` but may not properly wait on the child in all exit paths.
|
||||||
|
|
||||||
## How to Reproduce
|
## How to Reproduce
|
||||||
|
|
||||||
Have a story in current stage with a worktree that has all coding done and committed (clean `git status`). Let the gates fail for an unrelated reason (e.g. a test failure). The pipeline restarts the agent on the same worktree. The Claude Code process crashes immediately with the fatal runtime error.
|
Run several agent sessions (or trigger gate failures that cause restarts). After enough sessions, zombie `[claude] <defunct>` processes accumulate. Check with `ps aux | grep '<defunct>'`. Once enough zombies exist, new agent spawns crash immediately with the fatal runtime error.
|
||||||
|
|
||||||
## Actual Result
|
## Actual Result
|
||||||
|
|
||||||
`fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting` — process exits instantly, session is None, burns retries.
|
`fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting` — process exits instantly, session is None, burns retries. `ps aux` shows many `[claude] <defunct>` zombie processes.
|
||||||
|
|
||||||
## Expected Result
|
## Expected Result
|
||||||
|
|
||||||
Claude Code should start successfully on the worktree and the agent should be able to fix any gate failures.
|
All child processes should be properly reaped after exit. No zombie accumulation. New agent spawns should always succeed regardless of how many previous sessions have run.
|
||||||
|
|
||||||
## Acceptance Criteria
|
## Acceptance Criteria
|
||||||
|
|
||||||
- [ ] Agent restart on a worktree with committed work does not crash the PTY
|
- [ ] `child.wait()` is called in all exit paths in `agents/pty.rs` and `llm/providers/claude_code.rs`
|
||||||
|
- [ ] No zombie `[claude] <defunct>` processes accumulate during normal operation
|
||||||
|
- [ ] Agent restart after gate failure does not crash the PTY
|
||||||
- [ ] If Claude Code fails to start, the error is handled gracefully without burning retries
|
- [ ] If Claude Code fails to start, the error is handled gracefully without burning retries
|
||||||
|
|||||||
Reference in New Issue
Block a user