fix(merge): use server-start-time, not pid, for stale-merge detection

The merge_jobs cleanup encoded the server's pid in the CRDT and checked
`kill(pid, 0)` to decide whether a "running" entry was stale. Two problems:

  1. The cleanup runs *inside* the server, so checking whether the
     server's own pid is alive is tautological — kill(self_pid, 0)
     always succeeds.
  2. `rebuild_and_restart` does an `execve()` re-exec, which keeps the
     same pid. After re-exec, merge_jobs from the previous server
     instance still encode "the current pid" — so the cleanup never
     fires, and stories like 799/800 sit forever with status="running"
     while no actual merge runs.

Switch to a per-process server-start-time captured lazily in a
`OnceLock<f64>` (reset by execve, so the new instance sees a fresh
boot-time). A merge_job's recorded start-time < current boot-time means
it came from a previous instance: stale, delete it.

Legacy pid-encoded entries decode to None and are also treated as stale.

MergeJob.pid → MergeJob.server_start_time. Tests updated.
This commit is contained in:
dave
2026-04-28 20:41:32 +00:00
parent 8f392f4fc7
commit 2a77f73ba4
2 changed files with 51 additions and 52 deletions
+6 -4
View File
@@ -19,12 +19,14 @@ pub enum MergeJobStatus {
pub struct MergeJob {
pub story_id: String,
pub status: MergeJobStatus,
/// PID of the server process that started this job.
/// Server start-time (Unix seconds) of the server instance that started
/// this job.
///
/// Used by stale-lock recovery: on a new merge attempt the system checks
/// every Running entry and removes any whose owning process is no longer
/// alive (e.g. the server crashed and restarted).
pub pid: u32,
/// every Running entry and removes any whose recorded start-time is older
/// than the current server's boot time. This survives `rebuild_and_restart`
/// (which re-execs and keeps the same PID).
pub server_start_time: f64,
}
/// Result of a mergemaster merge operation.