fix: reap_stale_merge_jobs re-dispatches instead of just deleting

A mid-merge server restart used to silently kill the merge: the in-flight tokio task died with the process, reap_stale_merge_jobs ran on the new boot, saw the Running entry from the previous boot, and simply deleted it. Mergemaster polling `get_merge_status` then saw "Merge job disappeared", treated it as a strike, and after three restarts escalated the story to MergeFailureFinal — even though no real merge failure ever happened (this is what trapped story 998 during the bug 1001 iteration cycle). Reap now also fires a `WatcherEvent::WorkItem reassign` for the cleared story so the auto-assign watcher loop re-runs start_merge_agent_work on the fresh boot. The story is still in 4_merge/; the merge resumes automatically. The change is contained to the reap path — start_merge_agent_work's own behaviour is unchanged. Added regression test reap_stale_merge_jobs_emits_reassign_watcher_event that asserts the new event fires. Existing reap_stale_merge_jobs_removes_old_running_entry_without_merge still passes (the "without_merge" guarantee is about agent spawning, not about absence of watcher events). Also exposes AgentPool::watcher_tx() as pub(crate) so the merge runner can fan out re-dispatch events. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 21:28:10 +01:00
parent bbdee1239b
commit 2758f744f2
3 changed files with 87 additions and 9 deletions
@@ -11,16 +11,27 @@ use super::time::{
 };

 impl AgentPool {
-    /// Sweep all Running merge jobs in the CRDT and delete any that were left
-    /// behind by a previous server instance.
+    /// Sweep all Running merge jobs in the CRDT and clear any left behind by
+    /// a previous server instance, re-dispatching them to the auto-assigner
+    /// so the merge resumes on the new boot.
    ///
-    /// A job is considered stale when its recorded `server_start` timestamp is
-    /// older than the current server's boot time, or when the `error` field
-    /// cannot be decoded (legacy / malformed entries).
+    /// A job is considered stale when its recorded `server_start` timestamp
+    /// is older than the current server's boot time, or when the `error`
+    /// field cannot be decoded (legacy / malformed entries).
    ///
-    /// Called at the top of [`start_merge_agent_work`] to unblock retries, and
-    /// also by the periodic background reaper in the tick loop so stale entries
-    /// are cleaned up even when no new merge is triggered.
+    /// For every stale job we (a) delete the orphaned Running entry and (b)
+    /// emit a `WatcherEvent::WorkItem reassign` for the same story_id in
+    /// `4_merge/`, so the auto-assign watcher loop re-triggers
+    /// `start_merge_agent_work` on the fresh boot.  Without (b), a mid-merge
+    /// server restart would leave the mergemaster polling `get_merge_status`
+    /// only to see "Merge job disappeared", burn a retry, and after three
+    /// such interruptions escalate to `MergeFailureFinal` — a process bug,
+    /// not a real merge failure (this is what happened to story 998 during
+    /// the bug 1001 fix iterations).
+    ///
+    /// Called at the top of [`start_merge_agent_work`] to unblock retries,
+    /// and also by the periodic background reaper in the tick loop so stale
+    /// entries are cleaned up even when no new merge is triggered.
    pub(crate) fn reap_stale_merge_jobs(&self) {
        if let Some(jobs) = crate::crdt_state::read_all_merge_jobs() {
            let current_boot = server_start_time();
@@ -34,10 +45,22 @@ impl AgentPool {
                };
                if stale {
                    slog!(
-                        "[merge] Cleared stale Running merge job for '{}' (server restarted)",
+                        "[merge] Cleared stale Running merge job for '{}' (server restarted) — re-dispatching",
                        job.story_id
                    );
                    crate::crdt_state::delete_merge_job(&job.story_id);
+                    // Re-trigger the merge on the new boot.  Auto-assign sees
+                    // a story in 4_merge/ with no Running job and will call
+                    // start_merge_agent_work again.
+                    let _ = self
+                        .watcher_tx()
+                        .send(crate::io::watcher::WatcherEvent::WorkItem {
+                            stage: "4_merge".to_string(),
+                            item_id: job.story_id.clone(),
+                            action: "reassign".to_string(),
+                            commit_msg: String::new(),
+                            from_stage: None,
+                        });
                }
            }
        }