Files
huskies/.huskies/specs/tech/SPIKE_679_HTTP_TO_CRDT_BUS.md
T
dave 756c790b9f spike 679: document HTTP-to-CRDT-bus migration plan
Full inventory of all gateway and project server endpoints with caller,
purpose, latency/freshness/durability requirements. Classifies each as
write/read/external-webhook/frontend-asset. Maps write endpoints to
target CRDT collections, proposes RPC frame shapes for read endpoints,
drafts the unsigned read-RPC protocol (envelope, correlation IDs, TTL,
error codes, peer-offline handling), lists in-memory state needing CRDT
migration with proposed types, and defines a wave-ordered migration plan
with explicit dependencies (story 665 Ed25519 auth as the blocker for
write migrations).
2026-04-27 14:49:38 +00:00

25 KiB
Raw Blame History

Spike 679: Migrate Inter-Component HTTP to Signed CRDT WebSocket Bus

1. Endpoint Inventory

Every HTTP/WS endpoint currently exposed by the gateway and project servers, with caller, purpose, and requirements.

Standard-Mode Server Endpoints

WebSocket

Path Caller Purpose Latency Freshness Durability
/ws Browser frontend Chat messages, command output streaming Real-time N/A (stream) Ephemeral
/crdt-sync Peer nodes, headless agents CRDT op replication, snapshot exchange Sub-second Must converge Durable (SQLite)

MCP

Method Path Caller Purpose Latency Freshness Durability
GET/POST /mcp Claude Code agent (stdio), gateway proxy Agent tool calls (story create/update, git, shell, etc.) <500 ms Strong (mutations) Durable via CRDT

Agents API

Method Path Caller Purpose Latency Freshness Durability
POST /api/agents/start Frontend, MCP Start a coding agent for a story <1 s N/A Durable (process started)
POST /api/agents/stop Frontend, MCP Stop a running agent <1 s N/A Durable (process killed)
GET /api/agents Frontend List active agents and status <100 ms Near-real-time None (in-memory)
GET /api/agents/config Frontend Read agent config from project.toml <100 ms Seconds OK None
POST /api/agents/config/reload Frontend Reload config from disk <500 ms N/A None
POST /api/agents/worktrees MCP Create worktree for a story <1 s N/A Durable (git)
GET /api/agents/worktrees Frontend, MCP List worktrees <100 ms Seconds OK None
DELETE /api/agents/worktrees/:story_id MCP Remove a worktree <1 s N/A Durable (git)
GET /api/agents/:story_id/:name/output Frontend, MCP Read agent log file <200 ms Seconds OK Durable (JSONL file)
GET /api/work-items/:story_id MCP Get story test results <100 ms Seconds OK Durable (file)
GET /api/work-items/:story_id/test-results MCP Fetch cached test run output <100 ms Seconds OK Durable (file)
GET /api/work-items/:story_id/token-cost MCP Get token usage for story <100 ms Seconds OK Durable (file)
GET /api/token-usage Frontend Aggregate token usage <100 ms Minutes OK Durable (file)

Project Management

Method Path Caller Purpose Latency Freshness Durability
GET /api/project Frontend Get current project config <100 ms Seconds OK Durable (file)
POST /api/project Frontend Update project config <500 ms N/A Durable (file)
DELETE /api/project Frontend Reset project config <500 ms N/A Durable (file)
GET /api/projects Frontend List all known projects <100 ms Seconds OK Durable (file)
POST /api/projects/forget Frontend Remove project from registry <500 ms N/A Durable (file)

Chat

Method Path Caller Purpose Latency Freshness Durability
POST /api/chat/cancel Frontend Cancel an in-progress chat <100 ms N/A None

Settings

Method Path Caller Purpose Latency Freshness Durability
GET/PUT /api/settings Frontend Read/write general settings <100 ms Seconds OK Durable (JSON store)
GET/PUT /api/settings/editor Frontend Read/write editor setting <100 ms Seconds OK Durable (JSON store)
POST /api/settings/open-file Frontend Open file in editor <500 ms N/A None

IO (Filesystem/Shell)

Method Path Caller Purpose Latency Freshness Durability
POST /api/io/fs/read Agent (MCP alt), Frontend Read file contents <200 ms Real-time N/A
POST /api/io/fs/write Agent (MCP alt), Frontend Write file contents <500 ms N/A Durable (fs)
POST /api/io/fs/list Frontend List directory relative to project <100 ms Real-time N/A
POST /api/io/fs/list/absolute Frontend List absolute path directory <100 ms Real-time N/A
POST /api/io/fs/create/absolute Frontend Create file at absolute path <500 ms N/A Durable (fs)
GET /api/io/fs/home Frontend Get home directory <50 ms Stable N/A
GET /api/io/fs/files Frontend File tree of project <500 ms Seconds OK N/A
POST /api/io/search Frontend Ripgrep search <1 s Real-time N/A
POST /api/io/shell/exec Frontend Execute shell command Variable N/A None

Model / LLM Config

Method Path Caller Purpose Latency Freshness Durability
GET/POST /api/model Frontend Read/write active model selection <100 ms Seconds OK Durable (JSON store)
GET /api/ollama/models Frontend List available Ollama models <1 s Minutes OK None
GET /api/anthropic/key/exists Frontend Check if API key is set <50 ms Seconds OK None
POST /api/anthropic/key Frontend Store Anthropic API key <100 ms N/A Durable (store)
GET /api/anthropic/models Frontend List Claude models <1 s Minutes OK None

Wizard

Method Path Caller Purpose Latency Freshness Durability
GET /api/wizard Frontend Get wizard state <100 ms Real-time Durable (store)
PUT /api/wizard/step/:step/content Frontend Update step content <200 ms N/A Durable (store)
POST /api/wizard/step/:step/confirm Frontend Confirm a wizard step <200 ms N/A Durable
POST /api/wizard/step/:step/skip Frontend Skip a wizard step <100 ms N/A Durable
POST /api/wizard/step/:step/generating Frontend Mark step as generating <100 ms N/A Durable

Bot / Transports

Method Path Caller Purpose Latency Freshness Durability
POST /api/bot/command Frontend Send a bot command <500 ms N/A None
GET/PUT /api/bot/config Frontend Read/write bot config <100 ms Seconds OK Durable (file)

Auth / OAuth

Method Path Caller Purpose Latency Freshness Durability
GET /oauth/authorize Browser redirect Start OAuth flow <200 ms N/A None
GET /callback OAuth provider redirect Handle OAuth callback <500 ms N/A Durable (token)
GET /oauth/status Frontend Check OAuth connection status <100 ms Seconds OK None

Webhooks (External Inbound)

Method Path Caller Purpose Latency Freshness Durability
GET/POST /webhook/whatsapp WhatsApp platform Receive WhatsApp messages <200 ms Real-time None (forwarded)
POST /webhook/slack Slack platform Receive Slack events <200 ms Real-time None (forwarded)
POST /webhook/slack/command Slack platform Receive Slack slash commands <200 ms Real-time None (forwarded)

Debug / Health

Method Path Caller Purpose Latency Freshness Durability
GET /health Gateway, load balancer Health check <50 ms Real-time None
GET /debug/crdt Developer/ops Dump raw CRDT state <500 ms Real-time None
GET (SSE) /api/agents/:story_id/:name/stream Frontend Stream live agent output Real-time N/A None
GET /api/events Gateway polling task Poll project events <200 ms Seconds OK None

Frontend Assets

Path Purpose
/ SPA entry point
/assets/* JS/CSS/fonts (rust-embed)
/*path SPA fallback

Gateway-Mode Server Endpoints

Method Path Caller Purpose Latency Freshness Durability
GET /health Load balancer, project containers Health check <50 ms Real-time None
GET /bot-config Browser Serve bot config HTML page <100 ms N/A N/A
GET /api/gateway Frontend Get gateway state (active project, project list) <100 ms Seconds OK Durable (toml)
POST /api/gateway/switch Frontend, MCP Switch active project <200 ms N/A Durable (in-memory + file)
GET /api/gateway/pipeline Frontend Aggregate pipeline status across all projects <1 s Seconds OK None (aggregated)
POST /api/gateway/projects Frontend, init_project MCP Register a new project in projects.toml <500 ms N/A Durable (file)
DELETE /api/gateway/projects/:name Frontend Remove a registered project <500 ms N/A Durable (file)
GET/PUT /api/gateway/bot-config Frontend Read/write bot config file <100 ms Seconds OK Durable (file)
GET/POST /mcp Claude Code agent MCP proxy to active project <500 ms Strong Durable via upstream
GET /gateway/mode Frontend Check whether gateway mode is active <50 ms Stable None
POST /gateway/tokens Ops/admin Generate a headless-agent join token <100 ms N/A Durable (in-memory HashMap)
POST /gateway/register Headless build agent at startup Register agent with token, supply address <200 ms N/A In-memory Vec
GET /gateway/agents Frontend, ops List all registered headless agents <100 ms Seconds OK In-memory Vec
DELETE /gateway/agents/:id Frontend, ops Deregister an agent <200 ms N/A In-memory Vec
POST /gateway/agents/:id/assign Frontend, ops Assign agent to a project <200 ms N/A In-memory Vec
POST /gateway/agents/:id/heartbeat Headless agent (periodic) Signal agent is alive <100 ms Real-time In-memory Vec

2. Classification

Endpoint Group Classification
/webhook/whatsapp, /webhook/slack, /webhook/slack/command external-webhook
/, /assets/*, /*path, /bot-config (HTML) frontend-asset
POST /api/agents/start, POST /api/agents/stop, POST /api/agents/worktrees, DELETE /api/agents/worktrees/:id write
POST /api/project, DELETE /api/project, POST /api/projects/forget write
PUT /api/settings, PUT /api/settings/editor, POST /api/settings/open-file write
POST /api/model, POST /api/anthropic/key write
POST /api/wizard/step/*, PUT /api/wizard/step/* write
POST /api/bot/command, PUT /api/bot/config write
POST /api/io/fs/write, POST /api/io/fs/create/absolute, POST /api/io/shell/exec write
POST /api/gateway/switch, POST /api/gateway/projects, DELETE /api/gateway/projects/:name write
POST /gateway/tokens, POST /gateway/register, DELETE /gateway/agents/:id, POST /gateway/agents/:id/assign write
POST /gateway/agents/:id/heartbeat write
POST /mcp, GET /mcp write (mutations dominate; reads via CRDT subscription eventually)
All remaining GET endpoints read
POST /api/chat/cancel, POST /api/agents/config/reload write (side-effect only, stateless result)

3. Write Endpoints → Target CRDT Collections

Endpoint Current Storage Target CRDT Collection Notes
POST /gateway/tokens GatewayState.pending_tokens: HashMap<String, PendingToken> tokens — LWW map keyed by token UUID TTL field; garbage-collect expired entries
POST /gateway/register GatewayState.joined_agents: Vec<JoinedAgent> nodes — existing CRDT node collection (extend with agent metadata) Already partially exists for CRDT mesh peers
POST /gateway/agents/:id/assign joined_agents Vec mutation nodes — LWW field assigned_project per node entry
DELETE /gateway/agents/:id joined_agents Vec mutation nodes — tombstone / remove entry Add-wins or explicit remove flag
POST /gateway/agents/:id/heartbeat joined_agents Vec last_seen field nodes — LWW last_seen_ms field per node Low-cost: just a timestamp LWW
POST /api/agents/start AgentPool.agents: HashMap No new CRDT; agent process is local. Side-effect only. Assign record if cross-node visibility needed → active_agents LWW map
POST /api/agents/stop AgentPool.agents mutation Same as above
POST /api/agents/worktrees git filesystem No CRDT needed; git worktrees are local
POST /api/gateway/switch GatewayState.active_project in-memory gateway_config — LWW field active_project
POST /api/gateway/projects projects.toml file gateway_config.projects — LWW map by project name
DELETE /api/gateway/projects/:name projects.toml file gateway_config.projects — tombstone entry
PUT /api/settings, PUT /api/settings/editor JsonFileStore settings — LWW map per key Low priority; settings are single-node today
POST /api/model JsonFileStore settings — same LWW map
POST /api/anthropic/key Encrypted file/env Stay out of CRDT (secrets)
PUT /api/bot/config .huskies/bot.toml file Stay out of CRDT (credentials)
POST /mcp CRDT (already) Already replicated via CRDT WebSocket bus Story/pipeline mutations are CRDT-native
Merge job tracking AgentPool.merge_jobs: HashMap<String, MergeJob> merge_jobs — LWW map by story_id, or append-only log Needed for cross-node merge visibility
Test job tracking AppContext.test_job_registry: HashMap<WorkPath, TestJob> test_jobs — LWW map by story_id Needed so any node can query test status

4. Read Endpoints → Proposed RPC Frame Shapes

Endpoint Request Fields Response Fields
GET /health (none) {status: "ok", version: string, node_id: string}
GET /api/gateway (none) {active_project: string, projects: {name, url, healthy}[]}
GET /api/gateway/pipeline (none) {projects: {name: string, pipeline: PipelineStages}[]}
GET /gateway/agents (none) {agents: {id, label, address, assigned_project, last_seen_ms, alive: bool}[]}
GET /api/agents (none) {agents: {story_id, agent_name, pid, status, started_at}[]}
GET /api/agents/worktrees (none) {worktrees: {story_id, path, branch}[]}
GET /api/agents/:id/:name/output (path params) {lines: AgentLogLine[]}
GET /api/work-items/:story_id/test-results (path param) {passed: bool, output: string, ran_at: timestamp}
GET /api/work-items/:story_id/token-cost (path param) {input_tokens: u64, output_tokens: u64, cost_usd: f64}
GET /api/token-usage (none) {total_input: u64, total_output: u64, per_agent: {...}[]}
GET /api/settings (none) {settings: Record<string, JsonValue>}
GET /api/model (none) {provider: string, model: string}
GET /api/events {since: unix_ms} {events: {type, payload, ts}[], next_since: unix_ms}
GET /debug/crdt (none) {crdt_doc: json}
GET /api/wizard (none) {steps: WizardStep[], current_step: string}
GET /api/anthropic/models (none) {models: {id, name}[]}
GET /api/ollama/models (none) {models: {name, size}[]}

5. Draft: Unsigned Read-RPC Protocol

Rationale

Write mutations already flow through the CRDT bus (signed ops). Read endpoints are the remaining HTTP surface that could be migrated to the same WebSocket channel. This section drafts the envelope format so read RPCs can share the bus without requiring Ed25519 auth (unsigned reads are fine; only writes need authenticity guarantees).

Frame Envelope (JSON over WebSocket)

// Request (caller → peer)
{
  "version": 1,
  "kind": "rpc_request",
  "correlation_id": "uuid-v4",
  "ttl_ms": 5000,
  "method": "get_pipeline_status",
  "params": {}
}

// Success response (peer → caller)
{
  "version": 1,
  "kind": "rpc_response",
  "correlation_id": "uuid-v4",
  "ok": true,
  "result": { ... }
}

// Error response
{
  "version": 1,
  "kind": "rpc_response",
  "correlation_id": "uuid-v4",
  "ok": false,
  "error": "human-readable message",
  "code": "NOT_FOUND | TIMEOUT | PEER_OFFLINE | INTERNAL"
}

Correlation IDs

Each request carries a UUID v4 correlation_id. The responder echoes it verbatim. Callers maintain a HashMap<String, oneshot::Sender> to route responses back to waiting futures. On TTL expiry the entry is removed and the caller receives Err(Timeout).

TTL Semantics

  • Caller specifies ttl_ms (default 5000, max 30000).
  • If the responding peer does not answer within the TTL, the caller synthesises a TIMEOUT error response locally.
  • Responders do not need to track TTLs; they answer as fast as they can.
  • Callers may use stale cached results if ttl_ms == 0 is supplied and a cache entry exists (opt-in freshness trade-off).

Error Codes

Code Meaning
NOT_FOUND Resource does not exist
TIMEOUT Peer did not respond within TTL
PEER_OFFLINE No live peer with the requested capability is connected
UNAUTHORIZED Caller lacks permission (future, when auth lands)
INTERNAL Unexpected server-side error

Peer-Offline Handling

  • Before sending a request the caller checks whether any peer that can serve the method is currently connected.
  • If no peer is online, the caller immediately returns PEER_OFFLINE without queuing (fail-fast).
  • For idempotent reads, callers may fall back to a local CRDT-materialized view if PEER_OFFLINE or TIMEOUT is received.
  • Non-idempotent reads (e.g., exec_shell) must not be retried automatically.

Method Naming Convention

<noun>.<verb> — e.g. pipeline.get, agents.list, health.check, events.poll.


6. In-Memory State → CRDT Collection Migration

Location Field Current Type Proposed CRDT Type Rationale
gateway.rs::GatewayState pending_tokens HashMap<String, PendingToken> LWW-map keyed by token UUID, with expires_at TTL field Tokens are short-lived; LWW is fine; GC by TTL
gateway.rs::GatewayState joined_agents Vec<JoinedAgent> Extend existing nodes CRDT collection with agent metadata fields (label, address, assigned_project, last_seen_ms) Nodes collection already exists for CRDT mesh peers
agents/pool/mod.rs::AgentPool merge_jobs HashMap<String, MergeJob> LWW-map keyed by story_id; fields: node_id, status, started_at, error Required for cross-node merge visibility
agents/pool/mod.rs::AgentPool agents (running agent handles) HashMap<String, StoryAgent> LWW-map active_agents keyed by story_id; fields: node_id, agent_name, pid(optional), started_at, status Process handles stay local; only metadata replicated
http/context.rs::AppContext test_job_registry HashMap<WorkPath, TestJob> (TestJobRegistry) LWW-map test_jobs keyed by story_id; fields: node_id, status, started_at, finished_at Needed so any node can query test run status
agents/pool/auto_assign agent throttle / last-seen timestamps Local variables / in-memory LWW-map agent_throttle keyed by agent_name; field: last_dispatched_at Prevents double-dispatch on multi-node
gateway.rs::GatewayState active_project Arc<RwLock<String>> LWW register in gateway_config collection, field active_project Single-value; LWW is correct
gateway.rs::GatewayState projects (BTreeMap) Arc<RwLock<BTreeMap<String, ProjectEntry>>> LWW-map in gateway_config.projects keyed by project name Infrequently mutated; LWW correct

Summary of Proposed New CRDT Collections

Collection Type Notes
tokens LWW-map Join tokens with TTL; garbage-collect on expiry
nodes LWW-map (extend existing) Already exists; add agent metadata fields
merge_jobs LWW-map One entry per story; overwritten on each merge attempt
active_agents LWW-map One entry per story; metadata only (not process handles)
test_jobs LWW-map One entry per story; test run status
agent_throttle LWW-map One entry per agent name; last-dispatched timestamp
gateway_config LWW-map (or flat LWW fields) active_project, projects map

7. Migration Order and Dependencies

Blocking Dependency

Story 665 (Ed25519 auth) must land before any write operation is migrated to the CRDT bus. Unsigned writes on a shared bus would allow any connected peer to forge mutations. Read RPCs do not require auth.

Wave 0 — Foundation (no story 665 needed)

These can land in parallel with or before story 665:

  1. Extend nodes CRDT collection with label, address, assigned_project, last_seen_ms fields. This is a pure schema addition.
  2. Add merge_jobs and active_agents LWW-maps to the CRDT document schema (additive; existing nodes ignore unknown fields via serde(default)).
  3. Implement unsigned read-RPC multiplexer on the existing /crdt-sync WebSocket channel (new kind: "rpc_request"/"rpc_response" frame types, ignored by old peers).

Wave 1 — Migrate Heartbeat + Agent Registration (after nodes schema extended)

  • Replace POST /gateway/agents/:id/heartbeat HTTP call with a CRDT LWW write to nodes[id].last_seen_ms.
  • Replace POST /gateway/register with a CRDT insert into nodes collection.
  • Replace POST /gateway/tokens / token validation with CRDT tokens map read/write.
  • Blocks on story 665 for the write side; read queries (list agents, check token) can migrate via read-RPC first.

Wave 2 — Migrate Read Endpoints to Read-RPC (no auth required)

Can land in parallel with Wave 1 write migration:

  • GET /healthhealth.check RPC (gateway reads from CRDT nodes liveness)
  • GET /gateway/agentsagents.list RPC reading from CRDT nodes
  • GET /api/events polling loop → subscribe to CRDT op stream directly (eliminate polling)
  • GET /api/gateway/pipelinepipeline.get RPC or direct CRDT materialisation (already replicated)
  • GET /api/agentsactive_agents.list RPC reading from CRDT active_agents

Wave 3 — Migrate Merge and Test Job Tracking (after waves 01)

  • Replace merge_jobs HashMap with CRDT merge_jobs map writes on merge start/completion.
  • Replace test_job_registry HashMap with CRDT test_jobs map writes on test start/completion.
  • Enables: any node can query merge or test status without HTTP call to the node that started the job.

Wave 4 — Migrate Gateway Config Writes (after story 665)

  • POST /api/gateway/switch, POST /api/gateway/projects, DELETE /api/gateway/projects/:name → CRDT gateway_config LWW writes.
  • Low urgency; these are infrequent admin operations. Can keep HTTP as a thin wrapper that writes to CRDT.

Endpoints That Stay HTTP

Endpoint Reason
/webhook/whatsapp, /webhook/slack External platform callbacks; must remain HTTP
/oauth/authorize, /callback OAuth redirect flow; must remain HTTP
/api/io/*, /api/io/shell/exec Local filesystem/shell; process-local, not cross-node
/api/io/fs/* Same — local I/O only
/mcp External MCP clients (Claude Code CLI) speak HTTP/SSE; gateway proxy stays HTTP
/assets/*, /, /*path Static frontend assets
/api/anthropic/key, PUT /api/bot/config Credentials — must stay local, never in CRDT
GET /debug/crdt Debug only; HTTP fine

Dependency Graph Summary

story 665 (Ed25519 auth)
    └── Wave 1 write migrations (heartbeat, register, assign, tokens)
        └── Wave 4 gateway config writes

Wave 0 (schema extensions + read-RPC multiplexer)  [can start now, parallel]
    └── Wave 2 read endpoint migrations            [can start now, parallel]
    └── Wave 3 merge/test job tracking             [after Wave 0 schema]

Critical path: Story 665 → Wave 1 → Wave 4. Everything else is parallel.