Files

197 lines
8.4 KiB
Markdown
Raw Permalink Normal View History

# Architecture Roadmap: Transports, Services, State Machine, CRDT
*Spike 613 — April 2026*
This document captures the current architecture across four key layers and charts
the recommended next steps for each.
---
## 1. Current State
### 1.1 Service Layer
Stories 604619 established a clean service extraction pattern. The
`server/src/service/` directory now has 21 sub-modules, each following the
functional-core / imperative-shell convention documented in
[service-modules.md](service-modules.md).
**Extracted so far:**
`agents`, `anthropic`, `bot_command`, `common`, `diagnostics`, `events`,
`file_io`, `gateway`, `git_ops`, `health`, `merge`, `notifications`, `oauth`,
`pipeline`, `project`, `qa`, `settings`, `shell`, `story`, `timer`, `wizard`,
`ws`
**Remaining in HTTP handlers** (see [future-extractions.md](future-extractions.md)):
The list there was written before stories 615619. After those stories landed,
the remaining surface is smaller. The HTTP handlers still containing inline
business logic are: `http/ws.rs` (WebSocket dispatch) and scattered ad-hoc
helpers in `http/mcp/` that have not yet been migrated to typed service modules.
### 1.2 Chat Transports
Four transport backends implement `ChatTransport` (defined in `chat/mod.rs`):
| Transport | Connection model | Rooms / channels |
|-----------|-----------------|-----------------|
| Matrix | Long-lived WebSocket to homeserver | Dynamic (per-room history) |
| Slack | HTTP webhook (Events API) | Fixed at startup from bot.toml |
| WhatsApp | HTTP webhook (Meta Graph API or Twilio) | Ambient (tracked active senders) |
| Discord | Gateway WebSocket + REST | Fixed at startup from bot.toml |
All four are instantiated manually in `main.rs` (~lines 567690) and passed into
`AppContext`. Stage-transition notifications are pushed through
`service/notifications/`.
**Known issue (Bug 501):** The Matrix bot spawns its own `TimerStore` instead of
consuming the shared `AppContext.timer_store`. This means MCP-tool cancellations
and the bot's tick loop see different in-memory state.
### 1.3 Pipeline State Machine
`server/src/pipeline_state.rs` provides a typed, compile-time-safe state machine
that replaces the old stringly-typed CRDT views.
**Synced stages (all nodes converge):**
```
Backlog → Coding → Qa → Merge { feature_branch, commits_ahead: NonZeroU32 }
→ Done { merged_at, merge_commit }
→ Archived { archived_at, reason }
```
`ArchiveReason` subsumes the old `blocked`, `merge_failure`, and `review_hold`
flags: `Completed | Abandoned | Superseded | Blocked | MergeFailed | ReviewHeld`.
`NonZeroU32` in `Merge` makes zero-commit merges structurally impossible.
**Per-node execution state (local, not replicated):**
`Idle → Pending → Running → RateLimited → Completed`
**Status:** The typed state machine is defined and the projection layer
(`PipelineItemView → PipelineItem via TryFrom`) is in place. Consumer
migration — replacing ad-hoc string comparisons across the codebase — is the
remaining work (tracked by Story 520).
### 1.4 CRDT Layer
`server/src/crdt_state.rs` + `crdt_sync.rs` form the distributed-state
foundation:
- **Document model:** `PipelineDoc { items: ListCrdt<PipelineItemCrdt>, nodes: ListCrdt<NodePresenceCrdt> }`
- **Registers:** `LwwRegisterCrdt<T>` for all mutable fields
- **Persistence:** Ops stored in SQLite (`pipeline.db`); `CrdtEvent` broadcast on every stage change
- **Sync protocol:** WebSocket `/crdt-sync` — bulk dump on connect (text), individual `SignedOp`s in real-time (binary)
- **Backpressure:** Slow peers are disconnected; they reconnect and get a fresh bulk dump
**Filesystem shadows** (`huskies/work/`) are now a secondary output only — CRDT is
the source of truth. Several clean-up stories (513, 517) remain backlogged to
remove the remaining fallback paths.
---
## 2. Roadmap
### Phase A — Finish the State Machine Migration (Story 520)
**Goal:** Every pipeline query uses the typed `PipelineItem` enum instead of
raw string comparisons on `stage`.
Work:
1. Replace `stage == "current"` / `"qa"` / `"merge"` patterns in `agents/`,
`http/mcp/`, `chat/commands/`, and `gateway.rs` with `matches!(item, PipelineItem::Coding)` etc.
2. Remove the `PipelineItemView` → string projection paths once all consumers
use the typed enum.
3. Add exhaustive match tests in `pipeline_state.rs` so new stages cause
compile-time failures, not silent mismatches.
### Phase B — Transport Registry Abstraction
**Goal:** Replace the manual transport wiring in `main.rs` with a pluggable
registry, making it easy to add or remove transports without modifying the
startup sequence.
Work:
1. Define a `TransportRegistry` that holds `Vec<Box<dyn ChatTransport>>` keyed
by `TransportKind` (Matrix, Slack, WhatsApp, Discord).
2. Move the per-transport instantiation logic from `main.rs` into
`service/transport/` following the service module conventions.
3. Unify webhook signature verification (currently duplicated between Slack and
WhatsApp) into a shared `service/transport/verify.rs`.
4. Fix Bug 501: pass the shared `AppContext.timer_store` into the Matrix bot
instead of spawning a private instance.
5. Unify message history persistence (each transport currently owns a separate
history file format) into a common `service/transport/history.rs`.
### Phase C — CRDT Cleanup (Stories 513, 517, 518, 519, 521)
**Goal:** Remove all legacy filesystem-first paths and complete the
CRDT-as-source-of-truth migration.
Priority order (based on risk/value):
1. **519** — Mergemaster must detect zero-commits-ahead and fail loudly instead of
silently exiting. Structural fix: `Merge { commits_ahead: NonZeroU32 }` already
enforces this — just ensure mergemaster reads from the typed enum.
2. **518**`apply_and_persist` should log when the persist tx fails instead of
silently dropping ops.
3. **513** — Startup reconciliation pass: detect drift between CRDT pipeline items
and filesystem shadows, heal or report.
4. **517** — Remove filesystem shadow fallback paths from `lifecycle.rs`.
5. **521** — MCP HTTP capability to write a CRDT tombstone-delete op, clearing a
story from in-memory state cleanly.
6. **511** — Lamport clock inner seq resets to 1 on restart instead of resuming
from `max(own_author_seq) + 1`. Low risk to fix, high risk to leave.
### Phase D — Distributed Node Authentication (Story 480)
**Goal:** Cryptographic node identity for the distributed mesh.
Nodes already carry an Ed25519 pubkey as their `node_id` in `NodePresenceCrdt`.
Work:
1. Sign each `SignedOp` with the node's Ed25519 key before broadcast.
2. Verify signatures on receipt in `crdt_sync.rs` before applying ops.
3. Expose the node's public key via `NodePresenceCrdt.address` so peers can
bootstrap trust.
4. Add a key-rotation path for long-lived nodes.
### Phase E — Build Agent Mode Polish (Story 479)
**Goal:** Stable headless build-agent mode (`huskies --rendezvous`) for
distributing story processing across multiple machines.
Work:
1. Resolve claim-timeout races: if a node claims a story and dies, the claim
should expire after a configurable TTL and be re-claimable.
2. Stale merge-job lock (Bug 498) — a lock left by a dead node should be
detectable and clearable by the surviving cluster.
3. CRDT Lamport clock fix (511) is a prerequisite — distributed agents need
monotonically increasing sequences to converge correctly.
---
## 3. Dependency Graph
```
Phase A (State Machine)
Phase B (Transport Registry) Phase C (CRDT Cleanup: 511, 518, 513, 517, 521, 519)
Phase D (Cryptographic Auth)
Phase E (Build Agent Polish)
```
Phase A and C can progress in parallel. Phase B is independent of C/D/E.
Phase D requires Phase C (especially 511 and 518). Phase E requires Phase D.
---
## 4. What NOT to Do
- **Don't split `crdt_state.rs` prematurely.** It's large but internally
cohesive. A split should wait until the cleanup stories (Phase C) are done.
- **Don't add a transport abstraction layer before fixing Bug 501.** A registry
that instantiates a broken Matrix bot just propagates the bug.
- **Don't extract `http/ws.rs` to a service module before Phase A is done.**
The WebSocket handler touches pipeline state in string form; migrating it
while the state machine migration is in progress will cause double-churn.