fix(merge): use server-start-time, not pid, for stale-merge detection
The merge_jobs cleanup encoded the server's pid in the CRDT and checked
`kill(pid, 0)` to decide whether a "running" entry was stale. Two problems:
1. The cleanup runs *inside* the server, so checking whether the
server's own pid is alive is tautological — kill(self_pid, 0)
always succeeds.
2. `rebuild_and_restart` does an `execve()` re-exec, which keeps the
same pid. After re-exec, merge_jobs from the previous server
instance still encode "the current pid" — so the cleanup never
fires, and stories like 799/800 sit forever with status="running"
while no actual merge runs.
Switch to a per-process server-start-time captured lazily in a
`OnceLock<f64>` (reset by execve, so the new instance sees a fresh
boot-time). A merge_job's recorded start-time < current boot-time means
it came from a previous instance: stale, delete it.
Legacy pid-encoded entries decode to None and are also treated as stale.
MergeJob.pid → MergeJob.server_start_time. Tests updated.
This commit is contained in:
@@ -19,12 +19,14 @@ pub enum MergeJobStatus {
|
||||
pub struct MergeJob {
|
||||
pub story_id: String,
|
||||
pub status: MergeJobStatus,
|
||||
/// PID of the server process that started this job.
|
||||
/// Server start-time (Unix seconds) of the server instance that started
|
||||
/// this job.
|
||||
///
|
||||
/// Used by stale-lock recovery: on a new merge attempt the system checks
|
||||
/// every Running entry and removes any whose owning process is no longer
|
||||
/// alive (e.g. the server crashed and restarted).
|
||||
pub pid: u32,
|
||||
/// every Running entry and removes any whose recorded start-time is older
|
||||
/// than the current server's boot time. This survives `rebuild_and_restart`
|
||||
/// (which re-execs and keeps the same PID).
|
||||
pub server_start_time: f64,
|
||||
}
|
||||
|
||||
/// Result of a mergemaster merge operation.
|
||||
|
||||
Reference in New Issue
Block a user