fix(merge): use server-start-time, not pid, for stale-merge detection

The merge_jobs cleanup encoded the server's pid in the CRDT and checked `kill(pid, 0)` to decide whether a "running" entry was stale. Two problems: 1. The cleanup runs *inside* the server, so checking whether the server's own pid is alive is tautological — kill(self_pid, 0) always succeeds. 2. `rebuild_and_restart` does an `execve()` re-exec, which keeps the same pid. After re-exec, merge_jobs from the previous server instance still encode "the current pid" — so the cleanup never fires, and stories like 799/800 sit forever with status="running" while no actual merge runs. Switch to a per-process server-start-time captured lazily in a `OnceLock<f64>` (reset by execve, so the new instance sees a fresh boot-time). A merge_job's recorded start-time < current boot-time means it came from a previous instance: stale, delete it. Legacy pid-encoded entries decode to None and are also treated as stale. MergeJob.pid → MergeJob.server_start_time. Tests updated.
2026-04-28 20:41:32 +00:00
parent 8f392f4fc7
commit 2a77f73ba4
2 changed files with 51 additions and 52 deletions
@@ -19,12 +19,14 @@ pub enum MergeJobStatus {
 pub struct MergeJob {
    pub story_id: String,
    pub status: MergeJobStatus,
-    /// PID of the server process that started this job.
+    /// Server start-time (Unix seconds) of the server instance that started
+    /// this job.
    ///
    /// Used by stale-lock recovery: on a new merge attempt the system checks
-    /// every Running entry and removes any whose owning process is no longer
-    /// alive (e.g. the server crashed and restarted).
-    pub pid: u32,
+    /// every Running entry and removes any whose recorded start-time is older
+    /// than the current server's boot time. This survives `rebuild_and_restart`
+    /// (which re-execs and keeps the same PID).
+    pub server_start_time: f64,
 }

 /// Result of a mergemaster merge operation.