c50a04445c
Documents chat-driven `update` bot command for multi-project gateway: command surface, auth (room+role guard, future Ed25519), Docker-managed rollout sequence, automatic and manual rollback, open questions, and dependencies.
242 lines
10 KiB
Markdown
242 lines
10 KiB
Markdown
# Spike 814: Chat-Driven Update Command for Multi-Project Gateway
|
||
|
||
## 1. Problem Statement
|
||
|
||
In a multi-project gateway deployment (Docker Compose or similar), each project runs as its own container.
|
||
Today, updating a project container requires direct operator access to the host — `docker pull`, `docker compose up -d <project>`, or equivalent.
|
||
There is no way to trigger an update from chat.
|
||
|
||
This spike designs a `update` bot command that:
|
||
- Can be typed in the Matrix/Slack/Discord chat room.
|
||
- Pulls the latest image (or rebuilds from source) for one or all project containers managed by the gateway.
|
||
- Reports progress and outcome back to the room.
|
||
- Supports rollback when a container fails to start cleanly.
|
||
|
||
---
|
||
|
||
## 2. Command Surface
|
||
|
||
### Basic syntax
|
||
|
||
```
|
||
update [<project>|all] [--rollback]
|
||
```
|
||
|
||
| Invocation | Effect |
|
||
|-----------|--------|
|
||
| `update huskies` | Update and restart the `huskies` container. |
|
||
| `update all` | Update every registered project container, one at a time. |
|
||
| `update` (no args) | Same as `update all`. |
|
||
| `update huskies --rollback` | Roll back `huskies` to its previous image tag. |
|
||
|
||
### Progress feedback
|
||
|
||
The bot posts incremental updates to the room (editing the same message where the platform supports it):
|
||
|
||
```
|
||
[huskies] Pulling image… ⏳
|
||
[huskies] Image pulled (sha256:abc123). Stopping container…
|
||
[huskies] Container stopped. Starting new container…
|
||
[huskies] Health check passed ✅ (2 s)
|
||
```
|
||
|
||
On failure:
|
||
```
|
||
[huskies] Health check failed after 30 s ❌
|
||
[huskies] Rolling back to previous image (sha256:def456)…
|
||
[huskies] Rollback complete ✅
|
||
```
|
||
|
||
### Error cases
|
||
|
||
| Condition | Response |
|
||
|-----------|----------|
|
||
| Unknown project name | `Unknown project 'foo'. Known projects: huskies, robot-studio` |
|
||
| No Docker socket access | `Update not available: Docker socket not mounted` |
|
||
| Rollback with no previous image | `No previous image recorded for 'huskies'; cannot roll back` |
|
||
| Project container not managed by Docker | `'huskies' is not a container-managed project; rebuild it manually` |
|
||
|
||
---
|
||
|
||
## 3. Auth
|
||
|
||
### 3.1 Threat model
|
||
|
||
The update command triggers container replacement — a privileged operation equivalent to `docker compose up -d`.
|
||
An unauthenticated attacker who can send a message to the bot room could force a rolling restart or roll back a working container.
|
||
|
||
### 3.2 Proposed approach: room + role guard
|
||
|
||
**Layer 1 — Room restriction.**
|
||
The update command is only accepted in a designated *ops room*, configured in `bot.toml` (or `projects.toml`):
|
||
|
||
```toml
|
||
[gateway.ops_room]
|
||
room_id = "!abc123:homeserver.example.com"
|
||
```
|
||
|
||
Messages from other rooms are rejected with: `The update command is only available in the ops room.`
|
||
|
||
**Layer 2 — Sender role check (Matrix/Slack).**
|
||
The bot checks the sender's power level (Matrix) or admin status (Slack/Discord).
|
||
Only users with power level ≥ 50 (moderator) on Matrix, or workspace admin on Slack, may issue `update`.
|
||
Unapproved senders receive: `You do not have permission to issue update commands.`
|
||
|
||
**Layer 3 — Confirmation prompt for destructive operations.**
|
||
`update all` affects every project.
|
||
The bot responds with a confirmation challenge:
|
||
|
||
```
|
||
This will restart all 3 project containers. Reply `yes` within 60 s to confirm, or `no` to cancel.
|
||
```
|
||
|
||
Single-project updates (`update huskies`) do **not** require confirmation — they are already scoped.
|
||
|
||
### 3.3 Future: Ed25519 operator token
|
||
|
||
When story 665 (Ed25519 auth) lands, the gateway's node identity keypair can sign an operator token.
|
||
The bot verifies the token against the node's public key before acting.
|
||
This removes the room/role dependency and allows the command to be issued programmatically
|
||
(e.g. from a CI pipeline via MCP).
|
||
|
||
For now the room + role guard is sufficient.
|
||
|
||
---
|
||
|
||
## 4. Rollout Approach
|
||
|
||
### 4.1 Docker-managed containers (primary path)
|
||
|
||
The gateway process has access to the Docker socket (mounted as a volume at `/var/run/docker.sock`).
|
||
The update sequence for a single project:
|
||
|
||
1. **Record current image** — read the running container's image digest (store in gateway's `update_history` LWW-map in CRDT, keyed by project name).
|
||
2. **Pull new image** — `docker pull <image>` (or the compose-file equivalent tag).
|
||
3. **Drain connections** — gateway marks the project as `updating`; new proxy requests return 503 with a `Retry-After: 5` header; in-flight requests are allowed to complete (30 s grace window).
|
||
4. **Stop old container** — `docker stop --time=30 <container_name>`.
|
||
5. **Start new container** — `docker start <container_name>` (or `docker compose up -d <service>`).
|
||
6. **Health check** — poll the project's `/health` endpoint until 200 OK or 30 s timeout.
|
||
7. **Restore routing** — remove the `updating` flag; proxy resumes normal operation.
|
||
|
||
Steps 1–7 are serialised per project. When `update all` is used, projects are updated **one at a time** (not in parallel) to limit blast radius.
|
||
|
||
### 4.2 Source-rebuild path (non-Docker / dev mode)
|
||
|
||
When Docker is not available (the gateway binary is running directly on the host, not in a container),
|
||
the update command falls back to the existing `rebuild_and_restart` flow (`server/src/rebuild.rs`):
|
||
`cargo build` → re-exec.
|
||
This path cannot update individual projects independently — it rebuilds the gateway itself.
|
||
|
||
### 4.3 Gateway state during update
|
||
|
||
```
|
||
normal → updating → (success) normal
|
||
→ (failure) rolling_back → normal
|
||
```
|
||
|
||
The CRDT `gateway_config` collection gains two new LWW fields per project:
|
||
|
||
| Field | Type | Purpose |
|
||
|-------|------|---------|
|
||
| `update_state` | `"idle" \| "updating" \| "rolling_back"` | Current update lifecycle stage |
|
||
| `update_started_at` | `u64` (unix ms) | When the update was triggered |
|
||
| `previous_image` | `string` | Image digest before the most recent update |
|
||
| `current_image` | `string` | Image digest currently running |
|
||
|
||
These fields are replicated to all nodes so that other gateway instances and headless agents
|
||
can observe update progress without polling HTTP.
|
||
|
||
---
|
||
|
||
## 5. Rollback Approach
|
||
|
||
### 5.1 Automatic rollback
|
||
|
||
If the health check in step 6 (§4.1) times out or returns a non-200 status, the gateway automatically:
|
||
|
||
1. Logs the failure: `[update] health check failed for huskies after 30 s`.
|
||
2. Posts to the ops room: `Health check failed. Rolling back…`.
|
||
3. Runs `docker stop` on the new container.
|
||
4. Pulls and starts the previous image digest (stored in `previous_image`).
|
||
5. Re-runs the health check on the rolled-back container.
|
||
6. Reports outcome to the room.
|
||
|
||
If the rollback health check also fails, the bot reports:
|
||
```
|
||
Rollback failed. Manual intervention required. Previous image: sha256:def456
|
||
```
|
||
and sets `update_state = "error"` in the CRDT. The ops room is notified; no further automatic action is taken.
|
||
|
||
### 5.2 Manual rollback
|
||
|
||
An operator can issue `update huskies --rollback` at any time when the project is in `idle` state.
|
||
The command replays steps 3–7 of §4.1 with `previous_image` substituted for the target image.
|
||
`previous_image` is overwritten with the image that was displaced, so repeated rollbacks alternate between two images.
|
||
|
||
### 5.3 Rollback unavailability
|
||
|
||
Rollback is unavailable when:
|
||
- No `previous_image` is recorded (first-ever update on this installation).
|
||
- `update_state` is already `"updating"` or `"rolling_back"` (only one concurrent update per project).
|
||
|
||
---
|
||
|
||
## 6. Implementation Sketch
|
||
|
||
### 6.1 New files
|
||
|
||
| Path | Purpose |
|
||
|------|---------|
|
||
| `server/src/chat/commands/update.rs` | Synchronous `handle_update` stub (returns `None` — async, like `rebuild`) |
|
||
| `server/src/service/gateway/update.rs` | Core update/rollback logic; calls Docker API or falls back to `rebuild.rs` |
|
||
| `server/src/service/gateway/docker.rs` | Thin wrapper around Docker socket HTTP API (`/containers/:id/start` etc.) |
|
||
|
||
### 6.2 New CRDT fields
|
||
|
||
Extend the `gateway_config` CRDT document (already exists per Spike 679 §6) with:
|
||
- `projects.<name>.update_state` (LWW string)
|
||
- `projects.<name>.update_started_at` (LWW u64)
|
||
- `projects.<name>.previous_image` (LWW string)
|
||
- `projects.<name>.current_image` (LWW string)
|
||
|
||
### 6.3 Gateway HTTP changes
|
||
|
||
Add one endpoint for the Docker-fallback check:
|
||
|
||
```
|
||
GET /gateway/update/available
|
||
→ {"available": true, "mode": "docker"} | {"available": true, "mode": "rebuild"} | {"available": false}
|
||
```
|
||
|
||
The frontend can use this to show/hide an "Update" button in the gateway project list.
|
||
|
||
### 6.4 Async dispatch
|
||
|
||
`update` is an async command (like `rebuild`, `htop`, `start`).
|
||
The command keyword is detected in `on_room_message` before `try_handle_command` is invoked.
|
||
The handler spawns a `tokio::spawn` task, posts incremental updates via the existing transport's `send_message` / `edit_message` API, and returns.
|
||
|
||
---
|
||
|
||
## 7. Open Questions
|
||
|
||
| # | Question | Notes |
|
||
|---|----------|-------|
|
||
| 1 | Should the Docker socket be mounted in the gateway container by default? | Security trade-off: socket access = container escape risk. Alternative: `docker exec` via a sidecar. |
|
||
| 2 | Should `update all` use a sequential or rolling strategy? | Sequential is safer; rolling is faster. Sequential chosen for v1. |
|
||
| 3 | How do we handle projects not managed by Docker (e.g. running on bare metal)? | Fallback to `rebuild` covers the gateway itself; project-specific fallback is out of scope for v1. |
|
||
| 4 | Should the confirmation challenge expire? | Yes — 60 s timeout, configurable in `bot.toml`. |
|
||
| 5 | Should update history be persisted beyond CRDT (i.e. across full gateway restarts)? | CRDT persists to SQLite, so yes, as long as the CRDT DB survives the restart. |
|
||
| 6 | Multi-gateway HA: which node triggers the actual Docker call? | The node that owns the Docker socket. CRDT `update_state` prevents double-triggering. |
|
||
|
||
---
|
||
|
||
## 8. Dependencies
|
||
|
||
| Story / Spike | Dependency type |
|
||
|--------------|----------------|
|
||
| Spike 679 (HTTP → CRDT bus) | Soft — `gateway_config` LWW collection needed for update state; can stub without it |
|
||
| Story 665 (Ed25519 auth) | Soft — operator token auth is a future hardening step; room+role guard suffices for v1 |
|
||
| `server/src/rebuild.rs` | Direct — reuse `rebuild_and_restart` for the non-Docker path |
|
||
| `server/src/gateway_relay.rs` | Indirect — update state changes should trigger relay events to connected frontends |
|