# Architecture Roadmap: Transports, Services, State Machine, CRDT *Spike 613 — April 2026* This document captures the current architecture across four key layers and charts the recommended next steps for each. --- ## 1. Current State ### 1.1 Service Layer Stories 604–619 established a clean service extraction pattern. The `server/src/service/` directory now has 21 sub-modules, each following the functional-core / imperative-shell convention documented in [service-modules.md](service-modules.md). **Extracted so far:** `agents`, `anthropic`, `bot_command`, `common`, `diagnostics`, `events`, `file_io`, `gateway`, `git_ops`, `health`, `merge`, `notifications`, `oauth`, `pipeline`, `project`, `qa`, `settings`, `shell`, `story`, `timer`, `wizard`, `ws` **Remaining in HTTP handlers** (see [future-extractions.md](future-extractions.md)): The list there was written before stories 615–619. After those stories landed, the remaining surface is smaller. The HTTP handlers still containing inline business logic are: `http/ws.rs` (WebSocket dispatch) and scattered ad-hoc helpers in `http/mcp/` that have not yet been migrated to typed service modules. ### 1.2 Chat Transports Four transport backends implement `ChatTransport` (defined in `chat/mod.rs`): | Transport | Connection model | Rooms / channels | |-----------|-----------------|-----------------| | Matrix | Long-lived WebSocket to homeserver | Dynamic (per-room history) | | Slack | HTTP webhook (Events API) | Fixed at startup from bot.toml | | WhatsApp | HTTP webhook (Meta Graph API or Twilio) | Ambient (tracked active senders) | | Discord | Gateway WebSocket + REST | Fixed at startup from bot.toml | All four are instantiated manually in `main.rs` (~lines 567–690) and passed into `AppContext`. Stage-transition notifications are pushed through `service/notifications/`. **Known issue (Bug 501):** The Matrix bot spawns its own `TimerStore` instead of consuming the shared `AppContext.timer_store`. This means MCP-tool cancellations and the bot's tick loop see different in-memory state. ### 1.3 Pipeline State Machine `server/src/pipeline_state.rs` provides a typed, compile-time-safe state machine that replaces the old stringly-typed CRDT views. **Synced stages (all nodes converge):** ``` Backlog → Coding → Qa → Merge { feature_branch, commits_ahead: NonZeroU32 } → Done { merged_at, merge_commit } → Archived { archived_at, reason } ``` `ArchiveReason` subsumes the old `blocked`, `merge_failure`, and `review_hold` flags: `Completed | Abandoned | Superseded | Blocked | MergeFailed | ReviewHeld`. `NonZeroU32` in `Merge` makes zero-commit merges structurally impossible. **Per-node execution state (local, not replicated):** `Idle → Pending → Running → RateLimited → Completed` **Status:** The typed state machine is defined and the projection layer (`PipelineItemView → PipelineItem via TryFrom`) is in place. Consumer migration — replacing ad-hoc string comparisons across the codebase — is the remaining work (tracked by Story 520). ### 1.4 CRDT Layer `server/src/crdt_state.rs` + `crdt_sync.rs` form the distributed-state foundation: - **Document model:** `PipelineDoc { items: ListCrdt, nodes: ListCrdt }` - **Registers:** `LwwRegisterCrdt` for all mutable fields - **Persistence:** Ops stored in SQLite (`pipeline.db`); `CrdtEvent` broadcast on every stage change - **Sync protocol:** WebSocket `/crdt-sync` — bulk dump on connect (text), individual `SignedOp`s in real-time (binary) - **Backpressure:** Slow peers are disconnected; they reconnect and get a fresh bulk dump **Filesystem shadows** (`huskies/work/`) are now a secondary output only — CRDT is the source of truth. Several clean-up stories (513, 517) remain backlogged to remove the remaining fallback paths. --- ## 2. Roadmap ### Phase A — Finish the State Machine Migration (Story 520) **Goal:** Every pipeline query uses the typed `PipelineItem` enum instead of raw string comparisons on `stage`. Work: 1. Replace `stage == "current"` / `"qa"` / `"merge"` patterns in `agents/`, `http/mcp/`, `chat/commands/`, and `gateway.rs` with `matches!(item, PipelineItem::Coding)` etc. 2. Remove the `PipelineItemView` → string projection paths once all consumers use the typed enum. 3. Add exhaustive match tests in `pipeline_state.rs` so new stages cause compile-time failures, not silent mismatches. ### Phase B — Transport Registry Abstraction **Goal:** Replace the manual transport wiring in `main.rs` with a pluggable registry, making it easy to add or remove transports without modifying the startup sequence. Work: 1. Define a `TransportRegistry` that holds `Vec>` keyed by `TransportKind` (Matrix, Slack, WhatsApp, Discord). 2. Move the per-transport instantiation logic from `main.rs` into `service/transport/` following the service module conventions. 3. Unify webhook signature verification (currently duplicated between Slack and WhatsApp) into a shared `service/transport/verify.rs`. 4. Fix Bug 501: pass the shared `AppContext.timer_store` into the Matrix bot instead of spawning a private instance. 5. Unify message history persistence (each transport currently owns a separate history file format) into a common `service/transport/history.rs`. ### Phase C — CRDT Cleanup (Stories 513, 517, 518, 519, 521) **Goal:** Remove all legacy filesystem-first paths and complete the CRDT-as-source-of-truth migration. Priority order (based on risk/value): 1. **519** — Mergemaster must detect zero-commits-ahead and fail loudly instead of silently exiting. Structural fix: `Merge { commits_ahead: NonZeroU32 }` already enforces this — just ensure mergemaster reads from the typed enum. 2. **518** — `apply_and_persist` should log when the persist tx fails instead of silently dropping ops. 3. **513** — Startup reconciliation pass: detect drift between CRDT pipeline items and filesystem shadows, heal or report. 4. **517** — Remove filesystem shadow fallback paths from `lifecycle.rs`. 5. **521** — MCP HTTP capability to write a CRDT tombstone-delete op, clearing a story from in-memory state cleanly. 6. **511** — Lamport clock inner seq resets to 1 on restart instead of resuming from `max(own_author_seq) + 1`. Low risk to fix, high risk to leave. ### Phase D — Distributed Node Authentication (Story 480) **Goal:** Cryptographic node identity for the distributed mesh. Nodes already carry an Ed25519 pubkey as their `node_id` in `NodePresenceCrdt`. Work: 1. Sign each `SignedOp` with the node's Ed25519 key before broadcast. 2. Verify signatures on receipt in `crdt_sync.rs` before applying ops. 3. Expose the node's public key via `NodePresenceCrdt.address` so peers can bootstrap trust. 4. Add a key-rotation path for long-lived nodes. ### Phase E — Build Agent Mode Polish (Story 479) **Goal:** Stable headless build-agent mode (`huskies --rendezvous`) for distributing story processing across multiple machines. Work: 1. Resolve claim-timeout races: if a node claims a story and dies, the claim should expire after a configurable TTL and be re-claimable. 2. Stale merge-job lock (Bug 498) — a lock left by a dead node should be detectable and clearable by the surviving cluster. 3. CRDT Lamport clock fix (511) is a prerequisite — distributed agents need monotonically increasing sequences to converge correctly. --- ## 3. Dependency Graph ``` Phase A (State Machine) ↓ Phase B (Transport Registry) Phase C (CRDT Cleanup: 511, 518, 513, 517, 521, 519) ↓ Phase D (Cryptographic Auth) ↓ Phase E (Build Agent Polish) ``` Phase A and C can progress in parallel. Phase B is independent of C/D/E. Phase D requires Phase C (especially 511 and 518). Phase E requires Phase D. --- ## 4. What NOT to Do - **Don't split `crdt_state.rs` prematurely.** It's large but internally cohesive. A split should wait until the cleanup stories (Phase C) are done. - **Don't add a transport abstraction layer before fixing Bug 501.** A registry that instantiates a broken Matrix bot just propagates the bug. - **Don't extract `http/ws.rs` to a service module before Phase A is done.** The WebSocket handler touches pipeline state in string form; migrating it while the state machine migration is in progress will cause double-churn.