Commit Graph

4 Commits

Author SHA1 Message Date
dave 7408cc5b4b fix(crdt_snapshot): per-thread SNAPSHOT_STATE in cfg(test) instead of shared static (bug 669)
Replaces the test-time GLOBAL_STATE_LOCK approach (which was just disguised
single-threading) with proper test isolation: each test thread gets its own
SnapshotState via a thread_local!.

Pattern matches crdt_state::CRDT_STATE_TL — production keeps the global
OnceLock; tests get a per-thread OnceLock that's accessed through a
snapshot_state() helper. The unsafe `&*ptr` cast to 'static is safe because
the thread_local lives as long as the spawning test thread.

The race: latest_snapshot_available_after_compaction captured at_seq from a
freshly-generated snapshot, then asserted it equalled SNAPSHOT_STATE's
latest.at_seq. With shared SNAPSHOT_STATE, another test thread's
apply_compaction could overwrite latest_snapshot between capture and read.
Per-thread state eliminates the race at its source.

ALL_OPS / VECTOR_CLOCK stay shared — the tests don't assert on absolute
counts, only on (this-thread's at_seq) == (this-thread's latest.at_seq).

5 consecutive default-parallel `cargo test --bin huskies` runs all green
at 2636/2636.
2026-04-27 02:49:53 +00:00
dave fc71c22305 Revert "fix(crdt_snapshot): serialise tests that share global SNAPSHOT_STATE / ALL_OPS / VECTOR_CLOCK (bug 669)"
This reverts commit 8e608feec1.
2026-04-27 02:45:01 +00:00
dave 8e608feec1 fix(crdt_snapshot): serialise tests that share global SNAPSHOT_STATE / ALL_OPS / VECTOR_CLOCK (bug 669)
The crdt_snapshot tests share three global statics:
- SNAPSHOT_STATE (latest_snapshot, pending_acks, pending_at_seq) — coordination state
- crdt_state::ALL_OPS / VECTOR_CLOCK — op journal + vector clock

Only the per-thread CRDT is thread-local (init_for_test); these other globals
are shared across test threads. Under default cargo test parallelism, two tests
running concurrently interleave their op writes and snapshot generation, so
assertions like assert_eq!(at_seq, 4) fail with at_seq=5 (the other thread's
ops snuck in).

Add a module-level GLOBAL_STATE_LOCK that all 17 affected tests grab at the
top. unwrap_or_else(|e| e.into_inner()) handles the case where a prior test
panicked while holding the lock (poisoned).

Fixes bug 669 — these two tests were the silent killer behind every agent's
script/test failure (see also bug 668, which advanced agents to merge despite
gates_passed=false; that compounded this by sending failing-tests worktrees
to mergemaster).

All 2636 tests now pass under default parallel execution (no --test-threads=1
needed).

Closes #669.
2026-04-27 02:43:49 +00:00
dave b84ce1f6bb huskies: merge 636_story_full_crdt_snapshot_compaction_with_cross_node_coordination 2026-04-26 01:19:05 +00:00