From 23fd70c1311b6044bd41a2f4bc15d35b16aca989 Mon Sep 17 00:00:00 2001 From: dave Date: Fri, 24 Apr 2026 22:57:48 +0000 Subject: [PATCH] spike(613): add architecture roadmap for transports, services, state machine, CRDT MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Documents current state and recommended next steps across four layers: - Service layer: 21 modules extracted, remaining work in http/ws.rs and http/mcp/ - Chat transports: 4 backends (Matrix/Slack/WhatsApp/Discord), Bug 501 noted - Pipeline state machine: typed enum in place, consumer migration (Story 520) remaining - CRDT: source-of-truth migration ongoing, cleanup stories 511/513/517/518/519/521 prioritised Phases A–E chart the dependency order: state machine → transport registry → CRDT cleanup → cryptographic auth → build agent polish. --- docs/architecture/roadmap.md | 196 +++++++++++++++++++++++++++++++++++ 1 file changed, 196 insertions(+) create mode 100644 docs/architecture/roadmap.md diff --git a/docs/architecture/roadmap.md b/docs/architecture/roadmap.md new file mode 100644 index 00000000..c7aede48 --- /dev/null +++ b/docs/architecture/roadmap.md @@ -0,0 +1,196 @@ +# Architecture Roadmap: Transports, Services, State Machine, CRDT + +*Spike 613 — April 2026* + +This document captures the current architecture across four key layers and charts +the recommended next steps for each. + +--- + +## 1. Current State + +### 1.1 Service Layer + +Stories 604–619 established a clean service extraction pattern. The +`server/src/service/` directory now has 21 sub-modules, each following the +functional-core / imperative-shell convention documented in +[service-modules.md](service-modules.md). + +**Extracted so far:** +`agents`, `anthropic`, `bot_command`, `common`, `diagnostics`, `events`, +`file_io`, `gateway`, `git_ops`, `health`, `merge`, `notifications`, `oauth`, +`pipeline`, `project`, `qa`, `settings`, `shell`, `story`, `timer`, `wizard`, +`ws` + +**Remaining in HTTP handlers** (see [future-extractions.md](future-extractions.md)): +The list there was written before stories 615–619. After those stories landed, +the remaining surface is smaller. The HTTP handlers still containing inline +business logic are: `http/ws.rs` (WebSocket dispatch) and scattered ad-hoc +helpers in `http/mcp/` that have not yet been migrated to typed service modules. + +### 1.2 Chat Transports + +Four transport backends implement `ChatTransport` (defined in `chat/mod.rs`): + +| Transport | Connection model | Rooms / channels | +|-----------|-----------------|-----------------| +| Matrix | Long-lived WebSocket to homeserver | Dynamic (per-room history) | +| Slack | HTTP webhook (Events API) | Fixed at startup from bot.toml | +| WhatsApp | HTTP webhook (Meta Graph API or Twilio) | Ambient (tracked active senders) | +| Discord | Gateway WebSocket + REST | Fixed at startup from bot.toml | + +All four are instantiated manually in `main.rs` (~lines 567–690) and passed into +`AppContext`. Stage-transition notifications are pushed through +`service/notifications/`. + +**Known issue (Bug 501):** The Matrix bot spawns its own `TimerStore` instead of +consuming the shared `AppContext.timer_store`. This means MCP-tool cancellations +and the bot's tick loop see different in-memory state. + +### 1.3 Pipeline State Machine + +`server/src/pipeline_state.rs` provides a typed, compile-time-safe state machine +that replaces the old stringly-typed CRDT views. + +**Synced stages (all nodes converge):** +``` +Backlog → Coding → Qa → Merge { feature_branch, commits_ahead: NonZeroU32 } + → Done { merged_at, merge_commit } + → Archived { archived_at, reason } +``` + +`ArchiveReason` subsumes the old `blocked`, `merge_failure`, and `review_hold` +flags: `Completed | Abandoned | Superseded | Blocked | MergeFailed | ReviewHeld`. + +`NonZeroU32` in `Merge` makes zero-commit merges structurally impossible. + +**Per-node execution state (local, not replicated):** +`Idle → Pending → Running → RateLimited → Completed` + +**Status:** The typed state machine is defined and the projection layer +(`PipelineItemView → PipelineItem via TryFrom`) is in place. Consumer +migration — replacing ad-hoc string comparisons across the codebase — is the +remaining work (tracked by Story 520). + +### 1.4 CRDT Layer + +`server/src/crdt_state.rs` + `crdt_sync.rs` form the distributed-state +foundation: + +- **Document model:** `PipelineDoc { items: ListCrdt, nodes: ListCrdt }` +- **Registers:** `LwwRegisterCrdt` for all mutable fields +- **Persistence:** Ops stored in SQLite (`pipeline.db`); `CrdtEvent` broadcast on every stage change +- **Sync protocol:** WebSocket `/crdt-sync` — bulk dump on connect (text), individual `SignedOp`s in real-time (binary) +- **Backpressure:** Slow peers are disconnected; they reconnect and get a fresh bulk dump + +**Filesystem shadows** (`huskies/work/`) are now a secondary output only — CRDT is +the source of truth. Several clean-up stories (513, 517) remain backlogged to +remove the remaining fallback paths. + +--- + +## 2. Roadmap + +### Phase A — Finish the State Machine Migration (Story 520) + +**Goal:** Every pipeline query uses the typed `PipelineItem` enum instead of +raw string comparisons on `stage`. + +Work: +1. Replace `stage == "current"` / `"qa"` / `"merge"` patterns in `agents/`, + `http/mcp/`, `chat/commands/`, and `gateway.rs` with `matches!(item, PipelineItem::Coding)` etc. +2. Remove the `PipelineItemView` → string projection paths once all consumers + use the typed enum. +3. Add exhaustive match tests in `pipeline_state.rs` so new stages cause + compile-time failures, not silent mismatches. + +### Phase B — Transport Registry Abstraction + +**Goal:** Replace the manual transport wiring in `main.rs` with a pluggable +registry, making it easy to add or remove transports without modifying the +startup sequence. + +Work: +1. Define a `TransportRegistry` that holds `Vec>` keyed + by `TransportKind` (Matrix, Slack, WhatsApp, Discord). +2. Move the per-transport instantiation logic from `main.rs` into + `service/transport/` following the service module conventions. +3. Unify webhook signature verification (currently duplicated between Slack and + WhatsApp) into a shared `service/transport/verify.rs`. +4. Fix Bug 501: pass the shared `AppContext.timer_store` into the Matrix bot + instead of spawning a private instance. +5. Unify message history persistence (each transport currently owns a separate + history file format) into a common `service/transport/history.rs`. + +### Phase C — CRDT Cleanup (Stories 513, 517, 518, 519, 521) + +**Goal:** Remove all legacy filesystem-first paths and complete the +CRDT-as-source-of-truth migration. + +Priority order (based on risk/value): +1. **519** — Mergemaster must detect zero-commits-ahead and fail loudly instead of + silently exiting. Structural fix: `Merge { commits_ahead: NonZeroU32 }` already + enforces this — just ensure mergemaster reads from the typed enum. +2. **518** — `apply_and_persist` should log when the persist tx fails instead of + silently dropping ops. +3. **513** — Startup reconciliation pass: detect drift between CRDT pipeline items + and filesystem shadows, heal or report. +4. **517** — Remove filesystem shadow fallback paths from `lifecycle.rs`. +5. **521** — MCP HTTP capability to write a CRDT tombstone-delete op, clearing a + story from in-memory state cleanly. +6. **511** — Lamport clock inner seq resets to 1 on restart instead of resuming + from `max(own_author_seq) + 1`. Low risk to fix, high risk to leave. + +### Phase D — Distributed Node Authentication (Story 480) + +**Goal:** Cryptographic node identity for the distributed mesh. + +Nodes already carry an Ed25519 pubkey as their `node_id` in `NodePresenceCrdt`. +Work: +1. Sign each `SignedOp` with the node's Ed25519 key before broadcast. +2. Verify signatures on receipt in `crdt_sync.rs` before applying ops. +3. Expose the node's public key via `NodePresenceCrdt.address` so peers can + bootstrap trust. +4. Add a key-rotation path for long-lived nodes. + +### Phase E — Build Agent Mode Polish (Story 479) + +**Goal:** Stable headless build-agent mode (`huskies --rendezvous`) for +distributing story processing across multiple machines. + +Work: +1. Resolve claim-timeout races: if a node claims a story and dies, the claim + should expire after a configurable TTL and be re-claimable. +2. Stale merge-job lock (Bug 498) — a lock left by a dead node should be + detectable and clearable by the surviving cluster. +3. CRDT Lamport clock fix (511) is a prerequisite — distributed agents need + monotonically increasing sequences to converge correctly. + +--- + +## 3. Dependency Graph + +``` +Phase A (State Machine) + ↓ +Phase B (Transport Registry) Phase C (CRDT Cleanup: 511, 518, 513, 517, 521, 519) + ↓ + Phase D (Cryptographic Auth) + ↓ + Phase E (Build Agent Polish) +``` + +Phase A and C can progress in parallel. Phase B is independent of C/D/E. +Phase D requires Phase C (especially 511 and 518). Phase E requires Phase D. + +--- + +## 4. What NOT to Do + +- **Don't split `crdt_state.rs` prematurely.** It's large but internally + cohesive. A split should wait until the cleanup stories (Phase C) are done. +- **Don't add a transport abstraction layer before fixing Bug 501.** A registry + that instantiates a broken Matrix bot just propagates the bug. +- **Don't extract `http/ws.rs` to a service module before Phase A is done.** + The WebSocket handler touches pipeline state in string form; migrating it + while the state machine migration is in progress will cause double-churn.