From 3f97e34f210659399e49b232b2591c3055a3595a Mon Sep 17 00:00:00 2001 From: dave Date: Tue, 31 Mar 2026 14:13:22 +0000 Subject: [PATCH] storkit: create 453_bug_agent_pty_crashes_with_fatal_runtime_error_on_restart_after_gate_failure --- ..._fatal_runtime_error_on_restart_after_gate_failure.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/.storkit/work/1_backlog/453_bug_agent_pty_crashes_with_fatal_runtime_error_on_restart_after_gate_failure.md b/.storkit/work/1_backlog/453_bug_agent_pty_crashes_with_fatal_runtime_error_on_restart_after_gate_failure.md index 9cce9738..afc69e72 100644 --- a/.storkit/work/1_backlog/453_bug_agent_pty_crashes_with_fatal_runtime_error_on_restart_after_gate_failure.md +++ b/.storkit/work/1_backlog/453_bug_agent_pty_crashes_with_fatal_runtime_error_on_restart_after_gate_failure.md @@ -9,11 +9,14 @@ name: "Agent PTY crashes with fatal runtime error on restart after gate failure" When an agent completes coding and the acceptance gates fail (e.g. a test failure), the pipeline restarts the agent on the same worktree. The restarted Claude Code PTY process crashes immediately with `fatal runtime error: assertion failed: output.write(&bytes).is_ok(), aborting`. The process exits in the same second it spawns (Session: None), burns through all 3 retries, and blocks the story. Key observations: -- Running `claude -p "hello"` manually in the same worktree works fine (no crash) -- The crash happens specifically when spawned via portable-pty in `agents/pty.rs` +- The crash is **deterministic, not intermittent**: the first PTY spawn in a worktree always works; the second spawn (restart) always crashes +- Running `claude -p "hello"` manually in the same worktree works fine (no crash) — the issue is specific to spawning via portable-pty - The worktree is clean (all changes committed) — the agent has nothing to do but fix the gate failure - The crash is inside the Claude Code binary, not storkit code -- Observed on stories 449 and 450 — both had their coding done but failed the same unrelated test +- Observed on every story that needed a restart: 329, 400, 420, 438, 446, 449, 450 +- Stories that passed gates on the first run were never affected — they never triggered a second spawn + +Likely cause: the reader thread spawned by `std::thread::spawn` in `pty.rs` (line 248-255) is never joined. After `run_agent_pty_streaming` returns, the pipeline immediately calls `start_agent` for the retry, but the old reader thread may still be running and holding a cloned PTY reader fd. The new PTY allocation could collide with the still-open fd from the previous session. The root cause is unknown. It is NOT caused by zombie process accumulation (that is a separate issue in #452).