Files
dave 23fd70c131 spike(613): add architecture roadmap for transports, services, state machine, CRDT
Documents current state and recommended next steps across four layers:
- Service layer: 21 modules extracted, remaining work in http/ws.rs and http/mcp/
- Chat transports: 4 backends (Matrix/Slack/WhatsApp/Discord), Bug 501 noted
- Pipeline state machine: typed enum in place, consumer migration (Story 520) remaining
- CRDT: source-of-truth migration ongoing, cleanup stories 511/513/517/518/519/521 prioritised

Phases A–E chart the dependency order: state machine → transport registry →
CRDT cleanup → cryptographic auth → build agent polish.
2026-04-24 22:57:48 +00:00

197 lines
8.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Architecture Roadmap: Transports, Services, State Machine, CRDT
*Spike 613 — April 2026*
This document captures the current architecture across four key layers and charts
the recommended next steps for each.
---
## 1. Current State
### 1.1 Service Layer
Stories 604619 established a clean service extraction pattern. The
`server/src/service/` directory now has 21 sub-modules, each following the
functional-core / imperative-shell convention documented in
[service-modules.md](service-modules.md).
**Extracted so far:**
`agents`, `anthropic`, `bot_command`, `common`, `diagnostics`, `events`,
`file_io`, `gateway`, `git_ops`, `health`, `merge`, `notifications`, `oauth`,
`pipeline`, `project`, `qa`, `settings`, `shell`, `story`, `timer`, `wizard`,
`ws`
**Remaining in HTTP handlers** (see [future-extractions.md](future-extractions.md)):
The list there was written before stories 615619. After those stories landed,
the remaining surface is smaller. The HTTP handlers still containing inline
business logic are: `http/ws.rs` (WebSocket dispatch) and scattered ad-hoc
helpers in `http/mcp/` that have not yet been migrated to typed service modules.
### 1.2 Chat Transports
Four transport backends implement `ChatTransport` (defined in `chat/mod.rs`):
| Transport | Connection model | Rooms / channels |
|-----------|-----------------|-----------------|
| Matrix | Long-lived WebSocket to homeserver | Dynamic (per-room history) |
| Slack | HTTP webhook (Events API) | Fixed at startup from bot.toml |
| WhatsApp | HTTP webhook (Meta Graph API or Twilio) | Ambient (tracked active senders) |
| Discord | Gateway WebSocket + REST | Fixed at startup from bot.toml |
All four are instantiated manually in `main.rs` (~lines 567690) and passed into
`AppContext`. Stage-transition notifications are pushed through
`service/notifications/`.
**Known issue (Bug 501):** The Matrix bot spawns its own `TimerStore` instead of
consuming the shared `AppContext.timer_store`. This means MCP-tool cancellations
and the bot's tick loop see different in-memory state.
### 1.3 Pipeline State Machine
`server/src/pipeline_state.rs` provides a typed, compile-time-safe state machine
that replaces the old stringly-typed CRDT views.
**Synced stages (all nodes converge):**
```
Backlog → Coding → Qa → Merge { feature_branch, commits_ahead: NonZeroU32 }
→ Done { merged_at, merge_commit }
→ Archived { archived_at, reason }
```
`ArchiveReason` subsumes the old `blocked`, `merge_failure`, and `review_hold`
flags: `Completed | Abandoned | Superseded | Blocked | MergeFailed | ReviewHeld`.
`NonZeroU32` in `Merge` makes zero-commit merges structurally impossible.
**Per-node execution state (local, not replicated):**
`Idle → Pending → Running → RateLimited → Completed`
**Status:** The typed state machine is defined and the projection layer
(`PipelineItemView → PipelineItem via TryFrom`) is in place. Consumer
migration — replacing ad-hoc string comparisons across the codebase — is the
remaining work (tracked by Story 520).
### 1.4 CRDT Layer
`server/src/crdt_state.rs` + `crdt_sync.rs` form the distributed-state
foundation:
- **Document model:** `PipelineDoc { items: ListCrdt<PipelineItemCrdt>, nodes: ListCrdt<NodePresenceCrdt> }`
- **Registers:** `LwwRegisterCrdt<T>` for all mutable fields
- **Persistence:** Ops stored in SQLite (`pipeline.db`); `CrdtEvent` broadcast on every stage change
- **Sync protocol:** WebSocket `/crdt-sync` — bulk dump on connect (text), individual `SignedOp`s in real-time (binary)
- **Backpressure:** Slow peers are disconnected; they reconnect and get a fresh bulk dump
**Filesystem shadows** (`huskies/work/`) are now a secondary output only — CRDT is
the source of truth. Several clean-up stories (513, 517) remain backlogged to
remove the remaining fallback paths.
---
## 2. Roadmap
### Phase A — Finish the State Machine Migration (Story 520)
**Goal:** Every pipeline query uses the typed `PipelineItem` enum instead of
raw string comparisons on `stage`.
Work:
1. Replace `stage == "current"` / `"qa"` / `"merge"` patterns in `agents/`,
`http/mcp/`, `chat/commands/`, and `gateway.rs` with `matches!(item, PipelineItem::Coding)` etc.
2. Remove the `PipelineItemView` → string projection paths once all consumers
use the typed enum.
3. Add exhaustive match tests in `pipeline_state.rs` so new stages cause
compile-time failures, not silent mismatches.
### Phase B — Transport Registry Abstraction
**Goal:** Replace the manual transport wiring in `main.rs` with a pluggable
registry, making it easy to add or remove transports without modifying the
startup sequence.
Work:
1. Define a `TransportRegistry` that holds `Vec<Box<dyn ChatTransport>>` keyed
by `TransportKind` (Matrix, Slack, WhatsApp, Discord).
2. Move the per-transport instantiation logic from `main.rs` into
`service/transport/` following the service module conventions.
3. Unify webhook signature verification (currently duplicated between Slack and
WhatsApp) into a shared `service/transport/verify.rs`.
4. Fix Bug 501: pass the shared `AppContext.timer_store` into the Matrix bot
instead of spawning a private instance.
5. Unify message history persistence (each transport currently owns a separate
history file format) into a common `service/transport/history.rs`.
### Phase C — CRDT Cleanup (Stories 513, 517, 518, 519, 521)
**Goal:** Remove all legacy filesystem-first paths and complete the
CRDT-as-source-of-truth migration.
Priority order (based on risk/value):
1. **519** — Mergemaster must detect zero-commits-ahead and fail loudly instead of
silently exiting. Structural fix: `Merge { commits_ahead: NonZeroU32 }` already
enforces this — just ensure mergemaster reads from the typed enum.
2. **518**`apply_and_persist` should log when the persist tx fails instead of
silently dropping ops.
3. **513** — Startup reconciliation pass: detect drift between CRDT pipeline items
and filesystem shadows, heal or report.
4. **517** — Remove filesystem shadow fallback paths from `lifecycle.rs`.
5. **521** — MCP HTTP capability to write a CRDT tombstone-delete op, clearing a
story from in-memory state cleanly.
6. **511** — Lamport clock inner seq resets to 1 on restart instead of resuming
from `max(own_author_seq) + 1`. Low risk to fix, high risk to leave.
### Phase D — Distributed Node Authentication (Story 480)
**Goal:** Cryptographic node identity for the distributed mesh.
Nodes already carry an Ed25519 pubkey as their `node_id` in `NodePresenceCrdt`.
Work:
1. Sign each `SignedOp` with the node's Ed25519 key before broadcast.
2. Verify signatures on receipt in `crdt_sync.rs` before applying ops.
3. Expose the node's public key via `NodePresenceCrdt.address` so peers can
bootstrap trust.
4. Add a key-rotation path for long-lived nodes.
### Phase E — Build Agent Mode Polish (Story 479)
**Goal:** Stable headless build-agent mode (`huskies --rendezvous`) for
distributing story processing across multiple machines.
Work:
1. Resolve claim-timeout races: if a node claims a story and dies, the claim
should expire after a configurable TTL and be re-claimable.
2. Stale merge-job lock (Bug 498) — a lock left by a dead node should be
detectable and clearable by the surviving cluster.
3. CRDT Lamport clock fix (511) is a prerequisite — distributed agents need
monotonically increasing sequences to converge correctly.
---
## 3. Dependency Graph
```
Phase A (State Machine)
Phase B (Transport Registry) Phase C (CRDT Cleanup: 511, 518, 513, 517, 521, 519)
Phase D (Cryptographic Auth)
Phase E (Build Agent Polish)
```
Phase A and C can progress in parallel. Phase B is independent of C/D/E.
Phase D requires Phase C (especially 511 and 518). Phase E requires Phase D.
---
## 4. What NOT to Do
- **Don't split `crdt_state.rs` prematurely.** It's large but internally
cohesive. A split should wait until the cleanup stories (Phase C) are done.
- **Don't add a transport abstraction layer before fixing Bug 501.** A registry
that instantiates a broken Matrix bot just propagates the bug.
- **Don't extract `http/ws.rs` to a service module before Phase A is done.**
The WebSocket handler touches pipeline state in string form; migrating it
while the state machine migration is in progress will cause double-churn.