Documents current state and recommended next steps across four layers: - Service layer: 21 modules extracted, remaining work in http/ws.rs and http/mcp/ - Chat transports: 4 backends (Matrix/Slack/WhatsApp/Discord), Bug 501 noted - Pipeline state machine: typed enum in place, consumer migration (Story 520) remaining - CRDT: source-of-truth migration ongoing, cleanup stories 511/513/517/518/519/521 prioritised Phases A–E chart the dependency order: state machine → transport registry → CRDT cleanup → cryptographic auth → build agent polish.
8.4 KiB
Architecture Roadmap: Transports, Services, State Machine, CRDT
Spike 613 — April 2026
This document captures the current architecture across four key layers and charts the recommended next steps for each.
1. Current State
1.1 Service Layer
Stories 604–619 established a clean service extraction pattern. The
server/src/service/ directory now has 21 sub-modules, each following the
functional-core / imperative-shell convention documented in
service-modules.md.
Extracted so far:
agents, anthropic, bot_command, common, diagnostics, events,
file_io, gateway, git_ops, health, merge, notifications, oauth,
pipeline, project, qa, settings, shell, story, timer, wizard,
ws
Remaining in HTTP handlers (see future-extractions.md):
The list there was written before stories 615–619. After those stories landed,
the remaining surface is smaller. The HTTP handlers still containing inline
business logic are: http/ws.rs (WebSocket dispatch) and scattered ad-hoc
helpers in http/mcp/ that have not yet been migrated to typed service modules.
1.2 Chat Transports
Four transport backends implement ChatTransport (defined in chat/mod.rs):
| Transport | Connection model | Rooms / channels |
|---|---|---|
| Matrix | Long-lived WebSocket to homeserver | Dynamic (per-room history) |
| Slack | HTTP webhook (Events API) | Fixed at startup from bot.toml |
| HTTP webhook (Meta Graph API or Twilio) | Ambient (tracked active senders) | |
| Discord | Gateway WebSocket + REST | Fixed at startup from bot.toml |
All four are instantiated manually in main.rs (~lines 567–690) and passed into
AppContext. Stage-transition notifications are pushed through
service/notifications/.
Known issue (Bug 501): The Matrix bot spawns its own TimerStore instead of
consuming the shared AppContext.timer_store. This means MCP-tool cancellations
and the bot's tick loop see different in-memory state.
1.3 Pipeline State Machine
server/src/pipeline_state.rs provides a typed, compile-time-safe state machine
that replaces the old stringly-typed CRDT views.
Synced stages (all nodes converge):
Backlog → Coding → Qa → Merge { feature_branch, commits_ahead: NonZeroU32 }
→ Done { merged_at, merge_commit }
→ Archived { archived_at, reason }
ArchiveReason subsumes the old blocked, merge_failure, and review_hold
flags: Completed | Abandoned | Superseded | Blocked | MergeFailed | ReviewHeld.
NonZeroU32 in Merge makes zero-commit merges structurally impossible.
Per-node execution state (local, not replicated):
Idle → Pending → Running → RateLimited → Completed
Status: The typed state machine is defined and the projection layer
(PipelineItemView → PipelineItem via TryFrom) is in place. Consumer
migration — replacing ad-hoc string comparisons across the codebase — is the
remaining work (tracked by Story 520).
1.4 CRDT Layer
server/src/crdt_state.rs + crdt_sync.rs form the distributed-state
foundation:
- Document model:
PipelineDoc { items: ListCrdt<PipelineItemCrdt>, nodes: ListCrdt<NodePresenceCrdt> } - Registers:
LwwRegisterCrdt<T>for all mutable fields - Persistence: Ops stored in SQLite (
pipeline.db);CrdtEventbroadcast on every stage change - Sync protocol: WebSocket
/crdt-sync— bulk dump on connect (text), individualSignedOps in real-time (binary) - Backpressure: Slow peers are disconnected; they reconnect and get a fresh bulk dump
Filesystem shadows (huskies/work/) are now a secondary output only — CRDT is
the source of truth. Several clean-up stories (513, 517) remain backlogged to
remove the remaining fallback paths.
2. Roadmap
Phase A — Finish the State Machine Migration (Story 520)
Goal: Every pipeline query uses the typed PipelineItem enum instead of
raw string comparisons on stage.
Work:
- Replace
stage == "current"/"qa"/"merge"patterns inagents/,http/mcp/,chat/commands/, andgateway.rswithmatches!(item, PipelineItem::Coding)etc. - Remove the
PipelineItemView→ string projection paths once all consumers use the typed enum. - Add exhaustive match tests in
pipeline_state.rsso new stages cause compile-time failures, not silent mismatches.
Phase B — Transport Registry Abstraction
Goal: Replace the manual transport wiring in main.rs with a pluggable
registry, making it easy to add or remove transports without modifying the
startup sequence.
Work:
- Define a
TransportRegistrythat holdsVec<Box<dyn ChatTransport>>keyed byTransportKind(Matrix, Slack, WhatsApp, Discord). - Move the per-transport instantiation logic from
main.rsintoservice/transport/following the service module conventions. - Unify webhook signature verification (currently duplicated between Slack and
WhatsApp) into a shared
service/transport/verify.rs. - Fix Bug 501: pass the shared
AppContext.timer_storeinto the Matrix bot instead of spawning a private instance. - Unify message history persistence (each transport currently owns a separate
history file format) into a common
service/transport/history.rs.
Phase C — CRDT Cleanup (Stories 513, 517, 518, 519, 521)
Goal: Remove all legacy filesystem-first paths and complete the CRDT-as-source-of-truth migration.
Priority order (based on risk/value):
- 519 — Mergemaster must detect zero-commits-ahead and fail loudly instead of
silently exiting. Structural fix:
Merge { commits_ahead: NonZeroU32 }already enforces this — just ensure mergemaster reads from the typed enum. - 518 —
apply_and_persistshould log when the persist tx fails instead of silently dropping ops. - 513 — Startup reconciliation pass: detect drift between CRDT pipeline items and filesystem shadows, heal or report.
- 517 — Remove filesystem shadow fallback paths from
lifecycle.rs. - 521 — MCP HTTP capability to write a CRDT tombstone-delete op, clearing a story from in-memory state cleanly.
- 511 — Lamport clock inner seq resets to 1 on restart instead of resuming
from
max(own_author_seq) + 1. Low risk to fix, high risk to leave.
Phase D — Distributed Node Authentication (Story 480)
Goal: Cryptographic node identity for the distributed mesh.
Nodes already carry an Ed25519 pubkey as their node_id in NodePresenceCrdt.
Work:
- Sign each
SignedOpwith the node's Ed25519 key before broadcast. - Verify signatures on receipt in
crdt_sync.rsbefore applying ops. - Expose the node's public key via
NodePresenceCrdt.addressso peers can bootstrap trust. - Add a key-rotation path for long-lived nodes.
Phase E — Build Agent Mode Polish (Story 479)
Goal: Stable headless build-agent mode (huskies --rendezvous) for
distributing story processing across multiple machines.
Work:
- Resolve claim-timeout races: if a node claims a story and dies, the claim should expire after a configurable TTL and be re-claimable.
- Stale merge-job lock (Bug 498) — a lock left by a dead node should be detectable and clearable by the surviving cluster.
- CRDT Lamport clock fix (511) is a prerequisite — distributed agents need monotonically increasing sequences to converge correctly.
3. Dependency Graph
Phase A (State Machine)
↓
Phase B (Transport Registry) Phase C (CRDT Cleanup: 511, 518, 513, 517, 521, 519)
↓
Phase D (Cryptographic Auth)
↓
Phase E (Build Agent Polish)
Phase A and C can progress in parallel. Phase B is independent of C/D/E. Phase D requires Phase C (especially 511 and 518). Phase E requires Phase D.
4. What NOT to Do
- Don't split
crdt_state.rsprematurely. It's large but internally cohesive. A split should wait until the cleanup stories (Phase C) are done. - Don't add a transport abstraction layer before fixing Bug 501. A registry that instantiates a broken Matrix bot just propagates the bug.
- Don't extract
http/ws.rsto a service module before Phase A is done. The WebSocket handler touches pipeline state in string form; migrating it while the state machine migration is in progress will cause double-churn.