storkit: done 453_bug_agent_pty_crashes_with_fatal_runtime_error_on_restart_after_gate_failure
This commit is contained in:
+53
@@ -0,0 +1,53 @@
|
||||
---
|
||||
name: "Agent PTY crashes with fatal runtime error on restart after gate failure"
|
||||
---
|
||||
|
||||
# Bug 453: Agent PTY crashes with fatal runtime error on restart after gate failure
|
||||
|
||||
## Description
|
||||
|
||||
When an agent completes coding and the acceptance gates fail (e.g. a test failure), the pipeline restarts the agent on the same worktree. The restarted Claude Code PTY process crashes immediately with `fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting`. The process exits in the same second it spawns (Session: None), burns through all 3 retries, and blocks the story.
|
||||
|
||||
Key observations:
|
||||
- The crash is **deterministic, not intermittent**: the first PTY spawn in a worktree always works; the second spawn (restart) always crashes
|
||||
- Running `claude -p "hello"` manually in the same worktree works fine (no crash) — the issue is specific to spawning via portable-pty
|
||||
- The worktree is clean (all changes committed) — the agent has nothing to do but fix the gate failure
|
||||
- The crash is inside the Claude Code binary, not storkit code
|
||||
- Observed on every story that needed a restart: 329, 400, 420, 438, 446, 449, 450
|
||||
- Stories that passed gates on the first run were never affected — they never triggered a second spawn
|
||||
|
||||
Likely cause: the reader thread spawned by `std::thread::spawn` in `pty.rs` (line 248-255) is never joined. After `run_agent_pty_streaming` returns, the pipeline immediately calls `start_agent` for the retry, but the old reader thread may still be running and holding a cloned PTY reader fd. The new PTY allocation could collide with the still-open fd from the previous session.
|
||||
|
||||
The root cause is unknown. It is NOT caused by zombie process accumulation (that is a separate issue in #452).
|
||||
|
||||
**Timeline:** The crash first appeared on 2026-03-21. Agent logs go back to 2026-02-23 with no instances before that date. Stories that hit it: 329 (Mar 21), 400 (Mar 26), 420 (Mar 28), 438 (Mar 28), 446 (Mar 30), 449 (Mar 31), 450 (Mar 31).
|
||||
|
||||
**Suspect commits around 2026-03-21:**
|
||||
- `4344081b` — storkit: merge 343_refactor_abstract_agent_runtime_to_support_non_claude_code_backends (refactored agent runtime layer)
|
||||
- `c4e45b28` — The great storkit name conversion
|
||||
- Story 359 — Docker security hardening (`cap_drop: ALL`, added back only `SETUID`/`SETGID`) — could affect PTY allocation
|
||||
- Story 329 — Docker/OrbStack evaluation spike (first crash was on this story's mergemaster)
|
||||
|
||||
**Ruled out:** Docker capability restrictions (cap_drop: ALL) — tested by temporarily removing all cap_drop/security_opt; crash still occurs.
|
||||
|
||||
**Evidence of stale PTY fd:** After all agents stopped, storkit (PID 7) was still holding an open fd to `/dev/pts/ptmx` (fd 46). This is a leaked PTY master fd from a previous agent session. The reader thread spawned by `std::thread::spawn` in `pty.rs` is never joined, so the cloned reader fd stays open in the storkit process after the agent exits.
|
||||
|
||||
Remaining areas to investigate: the unjoined reader thread leaking PTY fds, and whether the leaked fd from the first session interferes with the second PTY allocation.
|
||||
|
||||
## How to Reproduce
|
||||
|
||||
1. Have a story in current stage with committed code in its worktree. 2. Introduce a test failure that causes gates to fail. 3. The pipeline restarts the agent on the same worktree. 4. The Claude Code process crashes immediately on spawn.
|
||||
|
||||
## Actual Result
|
||||
|
||||
`fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting` — process exits instantly (same second as spawn), Session: None. Burns through retries and blocks the story.
|
||||
|
||||
## Expected Result
|
||||
|
||||
The restarted agent should start successfully, receive the gate failure context, and be able to fix the issue.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] Agent restart after gate failure successfully spawns a Claude Code PTY session
|
||||
- [ ] No fatal runtime error on PTY restart in a worktree with prior committed work
|
||||
- [ ] If Claude Code fails to start, the error is handled gracefully without burning retries
|
||||
Reference in New Issue
Block a user