Files

T

dave c50a04445c spike(814): add gateway update command design doc

Documents chat-driven `update` bot command for multi-project gateway:
command surface, auth (room+role guard, future Ed25519), Docker-managed
rollout sequence, automatic and manual rollback, open questions, and
dependencies.

2026-04-29 18:17:19 +00:00

10 KiB

Raw Blame History

Spike 814: Chat-Driven Update Command for Multi-Project Gateway

1. Problem Statement

In a multi-project gateway deployment (Docker Compose or similar), each project runs as its own container. Today, updating a project container requires direct operator access to the host — docker pull, docker compose up -d <project>, or equivalent. There is no way to trigger an update from chat.

This spike designs a update bot command that:

Can be typed in the Matrix/Slack/Discord chat room.
Pulls the latest image (or rebuilds from source) for one or all project containers managed by the gateway.
Reports progress and outcome back to the room.
Supports rollback when a container fails to start cleanly.

2. Command Surface

Basic syntax

update [<project>|all] [--rollback]

Invocation	Effect
`update huskies`	Update and restart the `huskies` container.
`update all`	Update every registered project container, one at a time.
`update` (no args)	Same as `update all`.
`update huskies --rollback`	Roll back `huskies` to its previous image tag.

Progress feedback

The bot posts incremental updates to the room (editing the same message where the platform supports it):

[huskies] Pulling image…  ⏳
[huskies] Image pulled (sha256:abc123). Stopping container…
[huskies] Container stopped. Starting new container…
[huskies] Health check passed ✅ (2 s)

On failure:

[huskies] Health check failed after 30 s ❌
[huskies] Rolling back to previous image (sha256:def456)…
[huskies] Rollback complete ✅

Error cases

Condition	Response
Unknown project name	`Unknown project 'foo'. Known projects: huskies, robot-studio`
No Docker socket access	`Update not available: Docker socket not mounted`
Rollback with no previous image	`No previous image recorded for 'huskies'; cannot roll back`
Project container not managed by Docker	`'huskies' is not a container-managed project; rebuild it manually`

3. Auth

3.1 Threat model

The update command triggers container replacement — a privileged operation equivalent to docker compose up -d. An unauthenticated attacker who can send a message to the bot room could force a rolling restart or roll back a working container.

3.2 Proposed approach: room + role guard

Layer 1 — Room restriction. The update command is only accepted in a designated ops room, configured in bot.toml (or projects.toml):

[gateway.ops_room]
room_id = "!abc123:homeserver.example.com"

Messages from other rooms are rejected with: The update command is only available in the ops room.

Layer 2 — Sender role check (Matrix/Slack). The bot checks the sender's power level (Matrix) or admin status (Slack/Discord). Only users with power level ≥ 50 (moderator) on Matrix, or workspace admin on Slack, may issue update. Unapproved senders receive: You do not have permission to issue update commands.

Layer 3 — Confirmation prompt for destructive operations. update all affects every project. The bot responds with a confirmation challenge:

This will restart all 3 project containers. Reply `yes` within 60 s to confirm, or `no` to cancel.

Single-project updates (update huskies) do not require confirmation — they are already scoped.

3.3 Future: Ed25519 operator token

When story 665 (Ed25519 auth) lands, the gateway's node identity keypair can sign an operator token. The bot verifies the token against the node's public key before acting. This removes the room/role dependency and allows the command to be issued programmatically (e.g. from a CI pipeline via MCP).

For now the room + role guard is sufficient.

4. Rollout Approach

4.1 Docker-managed containers (primary path)

The gateway process has access to the Docker socket (mounted as a volume at /var/run/docker.sock). The update sequence for a single project:

Record current image — read the running container's image digest (store in gateway's update_history LWW-map in CRDT, keyed by project name).
Pull new image — docker pull <image> (or the compose-file equivalent tag).
Drain connections — gateway marks the project as updating; new proxy requests return 503 with a Retry-After: 5 header; in-flight requests are allowed to complete (30 s grace window).
Stop old container — docker stop --time=30 <container_name>.
Start new container — docker start <container_name> (or docker compose up -d <service>).
Health check — poll the project's /health endpoint until 200 OK or 30 s timeout.
Restore routing — remove the updating flag; proxy resumes normal operation.

Steps 1–7 are serialised per project. When update all is used, projects are updated one at a time (not in parallel) to limit blast radius.

4.2 Source-rebuild path (non-Docker / dev mode)

When Docker is not available (the gateway binary is running directly on the host, not in a container), the update command falls back to the existing rebuild_and_restart flow (server/src/rebuild.rs): cargo build → re-exec. This path cannot update individual projects independently — it rebuilds the gateway itself.

4.3 Gateway state during update

normal → updating → (success) normal
                  → (failure) rolling_back → normal

The CRDT gateway_config collection gains two new LWW fields per project:

Field	Type	Purpose
`update_state`	`"idle" \| "updating" \| "rolling_back"`	Current update lifecycle stage
`update_started_at`	`u64` (unix ms)	When the update was triggered
`previous_image`	`string`	Image digest before the most recent update
`current_image`	`string`	Image digest currently running

These fields are replicated to all nodes so that other gateway instances and headless agents can observe update progress without polling HTTP.

5. Rollback Approach

5.1 Automatic rollback

If the health check in step 6 (§4.1) times out or returns a non-200 status, the gateway automatically:

Logs the failure: [update] health check failed for huskies after 30 s.
Posts to the ops room: Health check failed. Rolling back….
Runs docker stop on the new container.
Pulls and starts the previous image digest (stored in previous_image).
Re-runs the health check on the rolled-back container.
Reports outcome to the room.

If the rollback health check also fails, the bot reports:

Rollback failed. Manual intervention required. Previous image: sha256:def456

and sets update_state = "error" in the CRDT. The ops room is notified; no further automatic action is taken.

5.2 Manual rollback

An operator can issue update huskies --rollback at any time when the project is in idle state. The command replays steps 3–7 of §4.1 with previous_image substituted for the target image. previous_image is overwritten with the image that was displaced, so repeated rollbacks alternate between two images.

5.3 Rollback unavailability

Rollback is unavailable when:

No previous_image is recorded (first-ever update on this installation).
update_state is already "updating" or "rolling_back" (only one concurrent update per project).

6. Implementation Sketch

6.1 New files

Path	Purpose
`server/src/chat/commands/update.rs`	Synchronous `handle_update` stub (returns `None` — async, like `rebuild`)
`server/src/service/gateway/update.rs`	Core update/rollback logic; calls Docker API or falls back to `rebuild.rs`
`server/src/service/gateway/docker.rs`	Thin wrapper around Docker socket HTTP API (`/containers/:id/start` etc.)

6.2 New CRDT fields

Extend the gateway_config CRDT document (already exists per Spike 679 §6) with:

projects.<name>.update_state (LWW string)
projects.<name>.update_started_at (LWW u64)
projects.<name>.previous_image (LWW string)
projects.<name>.current_image (LWW string)

6.3 Gateway HTTP changes

Add one endpoint for the Docker-fallback check:

GET /gateway/update/available
→ {"available": true, "mode": "docker"} | {"available": true, "mode": "rebuild"} | {"available": false}

The frontend can use this to show/hide an "Update" button in the gateway project list.

6.4 Async dispatch

update is an async command (like rebuild, htop, start). The command keyword is detected in on_room_message before try_handle_command is invoked. The handler spawns a tokio::spawn task, posts incremental updates via the existing transport's send_message / edit_message API, and returns.

7. Open Questions

#	Question	Notes
1	Should the Docker socket be mounted in the gateway container by default?	Security trade-off: socket access = container escape risk. Alternative: `docker exec` via a sidecar.
2	Should `update all` use a sequential or rolling strategy?	Sequential is safer; rolling is faster. Sequential chosen for v1.
3	How do we handle projects not managed by Docker (e.g. running on bare metal)?	Fallback to `rebuild` covers the gateway itself; project-specific fallback is out of scope for v1.
4	Should the confirmation challenge expire?	Yes — 60 s timeout, configurable in `bot.toml`.
5	Should update history be persisted beyond CRDT (i.e. across full gateway restarts)?	CRDT persists to SQLite, so yes, as long as the CRDT DB survives the restart.
6	Multi-gateway HA: which node triggers the actual Docker call?	The node that owns the Docker socket. CRDT `update_state` prevents double-triggering.

8. Dependencies

Story / Spike	Dependency type
Spike 679 (HTTP → CRDT bus)	Soft — `gateway_config` LWW collection needed for update state; can stub without it
Story 665 (Ed25519 auth)	Soft — operator token auth is a future hardening step; room+role guard suffices for v1
`server/src/rebuild.rs`	Direct — reuse `rebuild_and_restart` for the non-Docker path
`server/src/gateway_relay.rs`	Indirect — update state changes should trigger relay events to connected frontends

10 KiB Raw Blame History Unescape Escape