23fd70c131
Documents current state and recommended next steps across four layers: - Service layer: 21 modules extracted, remaining work in http/ws.rs and http/mcp/ - Chat transports: 4 backends (Matrix/Slack/WhatsApp/Discord), Bug 501 noted - Pipeline state machine: typed enum in place, consumer migration (Story 520) remaining - CRDT: source-of-truth migration ongoing, cleanup stories 511/513/517/518/519/521 prioritised Phases A–E chart the dependency order: state machine → transport registry → CRDT cleanup → cryptographic auth → build agent polish.
197 lines
8.4 KiB
Markdown
197 lines
8.4 KiB
Markdown
# Architecture Roadmap: Transports, Services, State Machine, CRDT
|
||
|
||
*Spike 613 — April 2026*
|
||
|
||
This document captures the current architecture across four key layers and charts
|
||
the recommended next steps for each.
|
||
|
||
---
|
||
|
||
## 1. Current State
|
||
|
||
### 1.1 Service Layer
|
||
|
||
Stories 604–619 established a clean service extraction pattern. The
|
||
`server/src/service/` directory now has 21 sub-modules, each following the
|
||
functional-core / imperative-shell convention documented in
|
||
[service-modules.md](service-modules.md).
|
||
|
||
**Extracted so far:**
|
||
`agents`, `anthropic`, `bot_command`, `common`, `diagnostics`, `events`,
|
||
`file_io`, `gateway`, `git_ops`, `health`, `merge`, `notifications`, `oauth`,
|
||
`pipeline`, `project`, `qa`, `settings`, `shell`, `story`, `timer`, `wizard`,
|
||
`ws`
|
||
|
||
**Remaining in HTTP handlers** (see [future-extractions.md](future-extractions.md)):
|
||
The list there was written before stories 615–619. After those stories landed,
|
||
the remaining surface is smaller. The HTTP handlers still containing inline
|
||
business logic are: `http/ws.rs` (WebSocket dispatch) and scattered ad-hoc
|
||
helpers in `http/mcp/` that have not yet been migrated to typed service modules.
|
||
|
||
### 1.2 Chat Transports
|
||
|
||
Four transport backends implement `ChatTransport` (defined in `chat/mod.rs`):
|
||
|
||
| Transport | Connection model | Rooms / channels |
|
||
|-----------|-----------------|-----------------|
|
||
| Matrix | Long-lived WebSocket to homeserver | Dynamic (per-room history) |
|
||
| Slack | HTTP webhook (Events API) | Fixed at startup from bot.toml |
|
||
| WhatsApp | HTTP webhook (Meta Graph API or Twilio) | Ambient (tracked active senders) |
|
||
| Discord | Gateway WebSocket + REST | Fixed at startup from bot.toml |
|
||
|
||
All four are instantiated manually in `main.rs` (~lines 567–690) and passed into
|
||
`AppContext`. Stage-transition notifications are pushed through
|
||
`service/notifications/`.
|
||
|
||
**Known issue (Bug 501):** The Matrix bot spawns its own `TimerStore` instead of
|
||
consuming the shared `AppContext.timer_store`. This means MCP-tool cancellations
|
||
and the bot's tick loop see different in-memory state.
|
||
|
||
### 1.3 Pipeline State Machine
|
||
|
||
`server/src/pipeline_state.rs` provides a typed, compile-time-safe state machine
|
||
that replaces the old stringly-typed CRDT views.
|
||
|
||
**Synced stages (all nodes converge):**
|
||
```
|
||
Backlog → Coding → Qa → Merge { feature_branch, commits_ahead: NonZeroU32 }
|
||
→ Done { merged_at, merge_commit }
|
||
→ Archived { archived_at, reason }
|
||
```
|
||
|
||
`ArchiveReason` subsumes the old `blocked`, `merge_failure`, and `review_hold`
|
||
flags: `Completed | Abandoned | Superseded | Blocked | MergeFailed | ReviewHeld`.
|
||
|
||
`NonZeroU32` in `Merge` makes zero-commit merges structurally impossible.
|
||
|
||
**Per-node execution state (local, not replicated):**
|
||
`Idle → Pending → Running → RateLimited → Completed`
|
||
|
||
**Status:** The typed state machine is defined and the projection layer
|
||
(`PipelineItemView → PipelineItem via TryFrom`) is in place. Consumer
|
||
migration — replacing ad-hoc string comparisons across the codebase — is the
|
||
remaining work (tracked by Story 520).
|
||
|
||
### 1.4 CRDT Layer
|
||
|
||
`server/src/crdt_state.rs` + `crdt_sync.rs` form the distributed-state
|
||
foundation:
|
||
|
||
- **Document model:** `PipelineDoc { items: ListCrdt<PipelineItemCrdt>, nodes: ListCrdt<NodePresenceCrdt> }`
|
||
- **Registers:** `LwwRegisterCrdt<T>` for all mutable fields
|
||
- **Persistence:** Ops stored in SQLite (`pipeline.db`); `CrdtEvent` broadcast on every stage change
|
||
- **Sync protocol:** WebSocket `/crdt-sync` — bulk dump on connect (text), individual `SignedOp`s in real-time (binary)
|
||
- **Backpressure:** Slow peers are disconnected; they reconnect and get a fresh bulk dump
|
||
|
||
**Filesystem shadows** (`huskies/work/`) are now a secondary output only — CRDT is
|
||
the source of truth. Several clean-up stories (513, 517) remain backlogged to
|
||
remove the remaining fallback paths.
|
||
|
||
---
|
||
|
||
## 2. Roadmap
|
||
|
||
### Phase A — Finish the State Machine Migration (Story 520)
|
||
|
||
**Goal:** Every pipeline query uses the typed `PipelineItem` enum instead of
|
||
raw string comparisons on `stage`.
|
||
|
||
Work:
|
||
1. Replace `stage == "current"` / `"qa"` / `"merge"` patterns in `agents/`,
|
||
`http/mcp/`, `chat/commands/`, and `gateway.rs` with `matches!(item, PipelineItem::Coding)` etc.
|
||
2. Remove the `PipelineItemView` → string projection paths once all consumers
|
||
use the typed enum.
|
||
3. Add exhaustive match tests in `pipeline_state.rs` so new stages cause
|
||
compile-time failures, not silent mismatches.
|
||
|
||
### Phase B — Transport Registry Abstraction
|
||
|
||
**Goal:** Replace the manual transport wiring in `main.rs` with a pluggable
|
||
registry, making it easy to add or remove transports without modifying the
|
||
startup sequence.
|
||
|
||
Work:
|
||
1. Define a `TransportRegistry` that holds `Vec<Box<dyn ChatTransport>>` keyed
|
||
by `TransportKind` (Matrix, Slack, WhatsApp, Discord).
|
||
2. Move the per-transport instantiation logic from `main.rs` into
|
||
`service/transport/` following the service module conventions.
|
||
3. Unify webhook signature verification (currently duplicated between Slack and
|
||
WhatsApp) into a shared `service/transport/verify.rs`.
|
||
4. Fix Bug 501: pass the shared `AppContext.timer_store` into the Matrix bot
|
||
instead of spawning a private instance.
|
||
5. Unify message history persistence (each transport currently owns a separate
|
||
history file format) into a common `service/transport/history.rs`.
|
||
|
||
### Phase C — CRDT Cleanup (Stories 513, 517, 518, 519, 521)
|
||
|
||
**Goal:** Remove all legacy filesystem-first paths and complete the
|
||
CRDT-as-source-of-truth migration.
|
||
|
||
Priority order (based on risk/value):
|
||
1. **519** — Mergemaster must detect zero-commits-ahead and fail loudly instead of
|
||
silently exiting. Structural fix: `Merge { commits_ahead: NonZeroU32 }` already
|
||
enforces this — just ensure mergemaster reads from the typed enum.
|
||
2. **518** — `apply_and_persist` should log when the persist tx fails instead of
|
||
silently dropping ops.
|
||
3. **513** — Startup reconciliation pass: detect drift between CRDT pipeline items
|
||
and filesystem shadows, heal or report.
|
||
4. **517** — Remove filesystem shadow fallback paths from `lifecycle.rs`.
|
||
5. **521** — MCP HTTP capability to write a CRDT tombstone-delete op, clearing a
|
||
story from in-memory state cleanly.
|
||
6. **511** — Lamport clock inner seq resets to 1 on restart instead of resuming
|
||
from `max(own_author_seq) + 1`. Low risk to fix, high risk to leave.
|
||
|
||
### Phase D — Distributed Node Authentication (Story 480)
|
||
|
||
**Goal:** Cryptographic node identity for the distributed mesh.
|
||
|
||
Nodes already carry an Ed25519 pubkey as their `node_id` in `NodePresenceCrdt`.
|
||
Work:
|
||
1. Sign each `SignedOp` with the node's Ed25519 key before broadcast.
|
||
2. Verify signatures on receipt in `crdt_sync.rs` before applying ops.
|
||
3. Expose the node's public key via `NodePresenceCrdt.address` so peers can
|
||
bootstrap trust.
|
||
4. Add a key-rotation path for long-lived nodes.
|
||
|
||
### Phase E — Build Agent Mode Polish (Story 479)
|
||
|
||
**Goal:** Stable headless build-agent mode (`huskies --rendezvous`) for
|
||
distributing story processing across multiple machines.
|
||
|
||
Work:
|
||
1. Resolve claim-timeout races: if a node claims a story and dies, the claim
|
||
should expire after a configurable TTL and be re-claimable.
|
||
2. Stale merge-job lock (Bug 498) — a lock left by a dead node should be
|
||
detectable and clearable by the surviving cluster.
|
||
3. CRDT Lamport clock fix (511) is a prerequisite — distributed agents need
|
||
monotonically increasing sequences to converge correctly.
|
||
|
||
---
|
||
|
||
## 3. Dependency Graph
|
||
|
||
```
|
||
Phase A (State Machine)
|
||
↓
|
||
Phase B (Transport Registry) Phase C (CRDT Cleanup: 511, 518, 513, 517, 521, 519)
|
||
↓
|
||
Phase D (Cryptographic Auth)
|
||
↓
|
||
Phase E (Build Agent Polish)
|
||
```
|
||
|
||
Phase A and C can progress in parallel. Phase B is independent of C/D/E.
|
||
Phase D requires Phase C (especially 511 and 518). Phase E requires Phase D.
|
||
|
||
---
|
||
|
||
## 4. What NOT to Do
|
||
|
||
- **Don't split `crdt_state.rs` prematurely.** It's large but internally
|
||
cohesive. A split should wait until the cleanup stories (Phase C) are done.
|
||
- **Don't add a transport abstraction layer before fixing Bug 501.** A registry
|
||
that instantiates a broken Matrix bot just propagates the bug.
|
||
- **Don't extract `http/ws.rs` to a service module before Phase A is done.**
|
||
The WebSocket handler touches pipeline state in string form; migrating it
|
||
while the state machine migration is in progress will cause double-churn.
|