Merge spike branch 'feature/story-613_spike_architecture_roadmap_transports_services_state_machine_crdt' into master
This commit is contained in:
@@ -0,0 +1,196 @@
|
||||
# Architecture Roadmap: Transports, Services, State Machine, CRDT
|
||||
|
||||
*Spike 613 — April 2026*
|
||||
|
||||
This document captures the current architecture across four key layers and charts
|
||||
the recommended next steps for each.
|
||||
|
||||
---
|
||||
|
||||
## 1. Current State
|
||||
|
||||
### 1.1 Service Layer
|
||||
|
||||
Stories 604–619 established a clean service extraction pattern. The
|
||||
`server/src/service/` directory now has 21 sub-modules, each following the
|
||||
functional-core / imperative-shell convention documented in
|
||||
[service-modules.md](service-modules.md).
|
||||
|
||||
**Extracted so far:**
|
||||
`agents`, `anthropic`, `bot_command`, `common`, `diagnostics`, `events`,
|
||||
`file_io`, `gateway`, `git_ops`, `health`, `merge`, `notifications`, `oauth`,
|
||||
`pipeline`, `project`, `qa`, `settings`, `shell`, `story`, `timer`, `wizard`,
|
||||
`ws`
|
||||
|
||||
**Remaining in HTTP handlers** (see [future-extractions.md](future-extractions.md)):
|
||||
The list there was written before stories 615–619. After those stories landed,
|
||||
the remaining surface is smaller. The HTTP handlers still containing inline
|
||||
business logic are: `http/ws.rs` (WebSocket dispatch) and scattered ad-hoc
|
||||
helpers in `http/mcp/` that have not yet been migrated to typed service modules.
|
||||
|
||||
### 1.2 Chat Transports
|
||||
|
||||
Four transport backends implement `ChatTransport` (defined in `chat/mod.rs`):
|
||||
|
||||
| Transport | Connection model | Rooms / channels |
|
||||
|-----------|-----------------|-----------------|
|
||||
| Matrix | Long-lived WebSocket to homeserver | Dynamic (per-room history) |
|
||||
| Slack | HTTP webhook (Events API) | Fixed at startup from bot.toml |
|
||||
| WhatsApp | HTTP webhook (Meta Graph API or Twilio) | Ambient (tracked active senders) |
|
||||
| Discord | Gateway WebSocket + REST | Fixed at startup from bot.toml |
|
||||
|
||||
All four are instantiated manually in `main.rs` (~lines 567–690) and passed into
|
||||
`AppContext`. Stage-transition notifications are pushed through
|
||||
`service/notifications/`.
|
||||
|
||||
**Known issue (Bug 501):** The Matrix bot spawns its own `TimerStore` instead of
|
||||
consuming the shared `AppContext.timer_store`. This means MCP-tool cancellations
|
||||
and the bot's tick loop see different in-memory state.
|
||||
|
||||
### 1.3 Pipeline State Machine
|
||||
|
||||
`server/src/pipeline_state.rs` provides a typed, compile-time-safe state machine
|
||||
that replaces the old stringly-typed CRDT views.
|
||||
|
||||
**Synced stages (all nodes converge):**
|
||||
```
|
||||
Backlog → Coding → Qa → Merge { feature_branch, commits_ahead: NonZeroU32 }
|
||||
→ Done { merged_at, merge_commit }
|
||||
→ Archived { archived_at, reason }
|
||||
```
|
||||
|
||||
`ArchiveReason` subsumes the old `blocked`, `merge_failure`, and `review_hold`
|
||||
flags: `Completed | Abandoned | Superseded | Blocked | MergeFailed | ReviewHeld`.
|
||||
|
||||
`NonZeroU32` in `Merge` makes zero-commit merges structurally impossible.
|
||||
|
||||
**Per-node execution state (local, not replicated):**
|
||||
`Idle → Pending → Running → RateLimited → Completed`
|
||||
|
||||
**Status:** The typed state machine is defined and the projection layer
|
||||
(`PipelineItemView → PipelineItem via TryFrom`) is in place. Consumer
|
||||
migration — replacing ad-hoc string comparisons across the codebase — is the
|
||||
remaining work (tracked by Story 520).
|
||||
|
||||
### 1.4 CRDT Layer
|
||||
|
||||
`server/src/crdt_state.rs` + `crdt_sync.rs` form the distributed-state
|
||||
foundation:
|
||||
|
||||
- **Document model:** `PipelineDoc { items: ListCrdt<PipelineItemCrdt>, nodes: ListCrdt<NodePresenceCrdt> }`
|
||||
- **Registers:** `LwwRegisterCrdt<T>` for all mutable fields
|
||||
- **Persistence:** Ops stored in SQLite (`pipeline.db`); `CrdtEvent` broadcast on every stage change
|
||||
- **Sync protocol:** WebSocket `/crdt-sync` — bulk dump on connect (text), individual `SignedOp`s in real-time (binary)
|
||||
- **Backpressure:** Slow peers are disconnected; they reconnect and get a fresh bulk dump
|
||||
|
||||
**Filesystem shadows** (`huskies/work/`) are now a secondary output only — CRDT is
|
||||
the source of truth. Several clean-up stories (513, 517) remain backlogged to
|
||||
remove the remaining fallback paths.
|
||||
|
||||
---
|
||||
|
||||
## 2. Roadmap
|
||||
|
||||
### Phase A — Finish the State Machine Migration (Story 520)
|
||||
|
||||
**Goal:** Every pipeline query uses the typed `PipelineItem` enum instead of
|
||||
raw string comparisons on `stage`.
|
||||
|
||||
Work:
|
||||
1. Replace `stage == "current"` / `"qa"` / `"merge"` patterns in `agents/`,
|
||||
`http/mcp/`, `chat/commands/`, and `gateway.rs` with `matches!(item, PipelineItem::Coding)` etc.
|
||||
2. Remove the `PipelineItemView` → string projection paths once all consumers
|
||||
use the typed enum.
|
||||
3. Add exhaustive match tests in `pipeline_state.rs` so new stages cause
|
||||
compile-time failures, not silent mismatches.
|
||||
|
||||
### Phase B — Transport Registry Abstraction
|
||||
|
||||
**Goal:** Replace the manual transport wiring in `main.rs` with a pluggable
|
||||
registry, making it easy to add or remove transports without modifying the
|
||||
startup sequence.
|
||||
|
||||
Work:
|
||||
1. Define a `TransportRegistry` that holds `Vec<Box<dyn ChatTransport>>` keyed
|
||||
by `TransportKind` (Matrix, Slack, WhatsApp, Discord).
|
||||
2. Move the per-transport instantiation logic from `main.rs` into
|
||||
`service/transport/` following the service module conventions.
|
||||
3. Unify webhook signature verification (currently duplicated between Slack and
|
||||
WhatsApp) into a shared `service/transport/verify.rs`.
|
||||
4. Fix Bug 501: pass the shared `AppContext.timer_store` into the Matrix bot
|
||||
instead of spawning a private instance.
|
||||
5. Unify message history persistence (each transport currently owns a separate
|
||||
history file format) into a common `service/transport/history.rs`.
|
||||
|
||||
### Phase C — CRDT Cleanup (Stories 513, 517, 518, 519, 521)
|
||||
|
||||
**Goal:** Remove all legacy filesystem-first paths and complete the
|
||||
CRDT-as-source-of-truth migration.
|
||||
|
||||
Priority order (based on risk/value):
|
||||
1. **519** — Mergemaster must detect zero-commits-ahead and fail loudly instead of
|
||||
silently exiting. Structural fix: `Merge { commits_ahead: NonZeroU32 }` already
|
||||
enforces this — just ensure mergemaster reads from the typed enum.
|
||||
2. **518** — `apply_and_persist` should log when the persist tx fails instead of
|
||||
silently dropping ops.
|
||||
3. **513** — Startup reconciliation pass: detect drift between CRDT pipeline items
|
||||
and filesystem shadows, heal or report.
|
||||
4. **517** — Remove filesystem shadow fallback paths from `lifecycle.rs`.
|
||||
5. **521** — MCP HTTP capability to write a CRDT tombstone-delete op, clearing a
|
||||
story from in-memory state cleanly.
|
||||
6. **511** — Lamport clock inner seq resets to 1 on restart instead of resuming
|
||||
from `max(own_author_seq) + 1`. Low risk to fix, high risk to leave.
|
||||
|
||||
### Phase D — Distributed Node Authentication (Story 480)
|
||||
|
||||
**Goal:** Cryptographic node identity for the distributed mesh.
|
||||
|
||||
Nodes already carry an Ed25519 pubkey as their `node_id` in `NodePresenceCrdt`.
|
||||
Work:
|
||||
1. Sign each `SignedOp` with the node's Ed25519 key before broadcast.
|
||||
2. Verify signatures on receipt in `crdt_sync.rs` before applying ops.
|
||||
3. Expose the node's public key via `NodePresenceCrdt.address` so peers can
|
||||
bootstrap trust.
|
||||
4. Add a key-rotation path for long-lived nodes.
|
||||
|
||||
### Phase E — Build Agent Mode Polish (Story 479)
|
||||
|
||||
**Goal:** Stable headless build-agent mode (`huskies --rendezvous`) for
|
||||
distributing story processing across multiple machines.
|
||||
|
||||
Work:
|
||||
1. Resolve claim-timeout races: if a node claims a story and dies, the claim
|
||||
should expire after a configurable TTL and be re-claimable.
|
||||
2. Stale merge-job lock (Bug 498) — a lock left by a dead node should be
|
||||
detectable and clearable by the surviving cluster.
|
||||
3. CRDT Lamport clock fix (511) is a prerequisite — distributed agents need
|
||||
monotonically increasing sequences to converge correctly.
|
||||
|
||||
---
|
||||
|
||||
## 3. Dependency Graph
|
||||
|
||||
```
|
||||
Phase A (State Machine)
|
||||
↓
|
||||
Phase B (Transport Registry) Phase C (CRDT Cleanup: 511, 518, 513, 517, 521, 519)
|
||||
↓
|
||||
Phase D (Cryptographic Auth)
|
||||
↓
|
||||
Phase E (Build Agent Polish)
|
||||
```
|
||||
|
||||
Phase A and C can progress in parallel. Phase B is independent of C/D/E.
|
||||
Phase D requires Phase C (especially 511 and 518). Phase E requires Phase D.
|
||||
|
||||
---
|
||||
|
||||
## 4. What NOT to Do
|
||||
|
||||
- **Don't split `crdt_state.rs` prematurely.** It's large but internally
|
||||
cohesive. A split should wait until the cleanup stories (Phase C) are done.
|
||||
- **Don't add a transport abstraction layer before fixing Bug 501.** A registry
|
||||
that instantiates a broken Matrix bot just propagates the bug.
|
||||
- **Don't extract `http/ws.rs` to a service module before Phase A is done.**
|
||||
The WebSocket handler touches pipeline state in string form; migrating it
|
||||
while the state machine migration is in progress will cause double-churn.
|
||||
Reference in New Issue
Block a user