Files
huskies/docs/architecture/roadmap.md
T
dave 23fd70c131 spike(613): add architecture roadmap for transports, services, state machine, CRDT
Documents current state and recommended next steps across four layers:
- Service layer: 21 modules extracted, remaining work in http/ws.rs and http/mcp/
- Chat transports: 4 backends (Matrix/Slack/WhatsApp/Discord), Bug 501 noted
- Pipeline state machine: typed enum in place, consumer migration (Story 520) remaining
- CRDT: source-of-truth migration ongoing, cleanup stories 511/513/517/518/519/521 prioritised

Phases A–E chart the dependency order: state machine → transport registry →
CRDT cleanup → cryptographic auth → build agent polish.
2026-04-24 22:57:48 +00:00

8.4 KiB
Raw Blame History

Architecture Roadmap: Transports, Services, State Machine, CRDT

Spike 613 — April 2026

This document captures the current architecture across four key layers and charts the recommended next steps for each.


1. Current State

1.1 Service Layer

Stories 604619 established a clean service extraction pattern. The server/src/service/ directory now has 21 sub-modules, each following the functional-core / imperative-shell convention documented in service-modules.md.

Extracted so far: agents, anthropic, bot_command, common, diagnostics, events, file_io, gateway, git_ops, health, merge, notifications, oauth, pipeline, project, qa, settings, shell, story, timer, wizard, ws

Remaining in HTTP handlers (see future-extractions.md): The list there was written before stories 615619. After those stories landed, the remaining surface is smaller. The HTTP handlers still containing inline business logic are: http/ws.rs (WebSocket dispatch) and scattered ad-hoc helpers in http/mcp/ that have not yet been migrated to typed service modules.

1.2 Chat Transports

Four transport backends implement ChatTransport (defined in chat/mod.rs):

Transport Connection model Rooms / channels
Matrix Long-lived WebSocket to homeserver Dynamic (per-room history)
Slack HTTP webhook (Events API) Fixed at startup from bot.toml
WhatsApp HTTP webhook (Meta Graph API or Twilio) Ambient (tracked active senders)
Discord Gateway WebSocket + REST Fixed at startup from bot.toml

All four are instantiated manually in main.rs (~lines 567690) and passed into AppContext. Stage-transition notifications are pushed through service/notifications/.

Known issue (Bug 501): The Matrix bot spawns its own TimerStore instead of consuming the shared AppContext.timer_store. This means MCP-tool cancellations and the bot's tick loop see different in-memory state.

1.3 Pipeline State Machine

server/src/pipeline_state.rs provides a typed, compile-time-safe state machine that replaces the old stringly-typed CRDT views.

Synced stages (all nodes converge):

Backlog → Coding → Qa → Merge { feature_branch, commits_ahead: NonZeroU32 }
       → Done { merged_at, merge_commit }
       → Archived { archived_at, reason }

ArchiveReason subsumes the old blocked, merge_failure, and review_hold flags: Completed | Abandoned | Superseded | Blocked | MergeFailed | ReviewHeld.

NonZeroU32 in Merge makes zero-commit merges structurally impossible.

Per-node execution state (local, not replicated): Idle → Pending → Running → RateLimited → Completed

Status: The typed state machine is defined and the projection layer (PipelineItemView → PipelineItem via TryFrom) is in place. Consumer migration — replacing ad-hoc string comparisons across the codebase — is the remaining work (tracked by Story 520).

1.4 CRDT Layer

server/src/crdt_state.rs + crdt_sync.rs form the distributed-state foundation:

  • Document model: PipelineDoc { items: ListCrdt<PipelineItemCrdt>, nodes: ListCrdt<NodePresenceCrdt> }
  • Registers: LwwRegisterCrdt<T> for all mutable fields
  • Persistence: Ops stored in SQLite (pipeline.db); CrdtEvent broadcast on every stage change
  • Sync protocol: WebSocket /crdt-sync — bulk dump on connect (text), individual SignedOps in real-time (binary)
  • Backpressure: Slow peers are disconnected; they reconnect and get a fresh bulk dump

Filesystem shadows (huskies/work/) are now a secondary output only — CRDT is the source of truth. Several clean-up stories (513, 517) remain backlogged to remove the remaining fallback paths.


2. Roadmap

Phase A — Finish the State Machine Migration (Story 520)

Goal: Every pipeline query uses the typed PipelineItem enum instead of raw string comparisons on stage.

Work:

  1. Replace stage == "current" / "qa" / "merge" patterns in agents/, http/mcp/, chat/commands/, and gateway.rs with matches!(item, PipelineItem::Coding) etc.
  2. Remove the PipelineItemView → string projection paths once all consumers use the typed enum.
  3. Add exhaustive match tests in pipeline_state.rs so new stages cause compile-time failures, not silent mismatches.

Phase B — Transport Registry Abstraction

Goal: Replace the manual transport wiring in main.rs with a pluggable registry, making it easy to add or remove transports without modifying the startup sequence.

Work:

  1. Define a TransportRegistry that holds Vec<Box<dyn ChatTransport>> keyed by TransportKind (Matrix, Slack, WhatsApp, Discord).
  2. Move the per-transport instantiation logic from main.rs into service/transport/ following the service module conventions.
  3. Unify webhook signature verification (currently duplicated between Slack and WhatsApp) into a shared service/transport/verify.rs.
  4. Fix Bug 501: pass the shared AppContext.timer_store into the Matrix bot instead of spawning a private instance.
  5. Unify message history persistence (each transport currently owns a separate history file format) into a common service/transport/history.rs.

Phase C — CRDT Cleanup (Stories 513, 517, 518, 519, 521)

Goal: Remove all legacy filesystem-first paths and complete the CRDT-as-source-of-truth migration.

Priority order (based on risk/value):

  1. 519 — Mergemaster must detect zero-commits-ahead and fail loudly instead of silently exiting. Structural fix: Merge { commits_ahead: NonZeroU32 } already enforces this — just ensure mergemaster reads from the typed enum.
  2. 518apply_and_persist should log when the persist tx fails instead of silently dropping ops.
  3. 513 — Startup reconciliation pass: detect drift between CRDT pipeline items and filesystem shadows, heal or report.
  4. 517 — Remove filesystem shadow fallback paths from lifecycle.rs.
  5. 521 — MCP HTTP capability to write a CRDT tombstone-delete op, clearing a story from in-memory state cleanly.
  6. 511 — Lamport clock inner seq resets to 1 on restart instead of resuming from max(own_author_seq) + 1. Low risk to fix, high risk to leave.

Phase D — Distributed Node Authentication (Story 480)

Goal: Cryptographic node identity for the distributed mesh.

Nodes already carry an Ed25519 pubkey as their node_id in NodePresenceCrdt. Work:

  1. Sign each SignedOp with the node's Ed25519 key before broadcast.
  2. Verify signatures on receipt in crdt_sync.rs before applying ops.
  3. Expose the node's public key via NodePresenceCrdt.address so peers can bootstrap trust.
  4. Add a key-rotation path for long-lived nodes.

Phase E — Build Agent Mode Polish (Story 479)

Goal: Stable headless build-agent mode (huskies --rendezvous) for distributing story processing across multiple machines.

Work:

  1. Resolve claim-timeout races: if a node claims a story and dies, the claim should expire after a configurable TTL and be re-claimable.
  2. Stale merge-job lock (Bug 498) — a lock left by a dead node should be detectable and clearable by the surviving cluster.
  3. CRDT Lamport clock fix (511) is a prerequisite — distributed agents need monotonically increasing sequences to converge correctly.

3. Dependency Graph

Phase A (State Machine)
    ↓
Phase B (Transport Registry)     Phase C (CRDT Cleanup: 511, 518, 513, 517, 521, 519)
                                      ↓
                                 Phase D (Cryptographic Auth)
                                      ↓
                                 Phase E (Build Agent Polish)

Phase A and C can progress in parallel. Phase B is independent of C/D/E. Phase D requires Phase C (especially 511 and 518). Phase E requires Phase D.


4. What NOT to Do

  • Don't split crdt_state.rs prematurely. It's large but internally cohesive. A split should wait until the cleanup stories (Phase C) are done.
  • Don't add a transport abstraction layer before fixing Bug 501. A registry that instantiates a broken Matrix bot just propagates the bug.
  • Don't extract http/ws.rs to a service module before Phase A is done. The WebSocket handler touches pipeline state in string form; migrating it while the state machine migration is in progress will cause double-churn.