diff --git a/.huskies/specs/tech/SPIKE_GATEWAY_UPDATE_COMMAND.md b/.huskies/specs/tech/SPIKE_GATEWAY_UPDATE_COMMAND.md new file mode 100644 index 00000000..614f0e55 --- /dev/null +++ b/.huskies/specs/tech/SPIKE_GATEWAY_UPDATE_COMMAND.md @@ -0,0 +1,241 @@ +# Spike 814: Chat-Driven Update Command for Multi-Project Gateway + +## 1. Problem Statement + +In a multi-project gateway deployment (Docker Compose or similar), each project runs as its own container. +Today, updating a project container requires direct operator access to the host — `docker pull`, `docker compose up -d `, or equivalent. +There is no way to trigger an update from chat. + +This spike designs a `update` bot command that: +- Can be typed in the Matrix/Slack/Discord chat room. +- Pulls the latest image (or rebuilds from source) for one or all project containers managed by the gateway. +- Reports progress and outcome back to the room. +- Supports rollback when a container fails to start cleanly. + +--- + +## 2. Command Surface + +### Basic syntax + +``` +update [|all] [--rollback] +``` + +| Invocation | Effect | +|-----------|--------| +| `update huskies` | Update and restart the `huskies` container. | +| `update all` | Update every registered project container, one at a time. | +| `update` (no args) | Same as `update all`. | +| `update huskies --rollback` | Roll back `huskies` to its previous image tag. | + +### Progress feedback + +The bot posts incremental updates to the room (editing the same message where the platform supports it): + +``` +[huskies] Pulling image… ⏳ +[huskies] Image pulled (sha256:abc123). Stopping container… +[huskies] Container stopped. Starting new container… +[huskies] Health check passed ✅ (2 s) +``` + +On failure: +``` +[huskies] Health check failed after 30 s ❌ +[huskies] Rolling back to previous image (sha256:def456)… +[huskies] Rollback complete ✅ +``` + +### Error cases + +| Condition | Response | +|-----------|----------| +| Unknown project name | `Unknown project 'foo'. Known projects: huskies, robot-studio` | +| No Docker socket access | `Update not available: Docker socket not mounted` | +| Rollback with no previous image | `No previous image recorded for 'huskies'; cannot roll back` | +| Project container not managed by Docker | `'huskies' is not a container-managed project; rebuild it manually` | + +--- + +## 3. Auth + +### 3.1 Threat model + +The update command triggers container replacement — a privileged operation equivalent to `docker compose up -d`. +An unauthenticated attacker who can send a message to the bot room could force a rolling restart or roll back a working container. + +### 3.2 Proposed approach: room + role guard + +**Layer 1 — Room restriction.** +The update command is only accepted in a designated *ops room*, configured in `bot.toml` (or `projects.toml`): + +```toml +[gateway.ops_room] +room_id = "!abc123:homeserver.example.com" +``` + +Messages from other rooms are rejected with: `The update command is only available in the ops room.` + +**Layer 2 — Sender role check (Matrix/Slack).** +The bot checks the sender's power level (Matrix) or admin status (Slack/Discord). +Only users with power level ≥ 50 (moderator) on Matrix, or workspace admin on Slack, may issue `update`. +Unapproved senders receive: `You do not have permission to issue update commands.` + +**Layer 3 — Confirmation prompt for destructive operations.** +`update all` affects every project. +The bot responds with a confirmation challenge: + +``` +This will restart all 3 project containers. Reply `yes` within 60 s to confirm, or `no` to cancel. +``` + +Single-project updates (`update huskies`) do **not** require confirmation — they are already scoped. + +### 3.3 Future: Ed25519 operator token + +When story 665 (Ed25519 auth) lands, the gateway's node identity keypair can sign an operator token. +The bot verifies the token against the node's public key before acting. +This removes the room/role dependency and allows the command to be issued programmatically +(e.g. from a CI pipeline via MCP). + +For now the room + role guard is sufficient. + +--- + +## 4. Rollout Approach + +### 4.1 Docker-managed containers (primary path) + +The gateway process has access to the Docker socket (mounted as a volume at `/var/run/docker.sock`). +The update sequence for a single project: + +1. **Record current image** — read the running container's image digest (store in gateway's `update_history` LWW-map in CRDT, keyed by project name). +2. **Pull new image** — `docker pull ` (or the compose-file equivalent tag). +3. **Drain connections** — gateway marks the project as `updating`; new proxy requests return 503 with a `Retry-After: 5` header; in-flight requests are allowed to complete (30 s grace window). +4. **Stop old container** — `docker stop --time=30 `. +5. **Start new container** — `docker start ` (or `docker compose up -d `). +6. **Health check** — poll the project's `/health` endpoint until 200 OK or 30 s timeout. +7. **Restore routing** — remove the `updating` flag; proxy resumes normal operation. + +Steps 1–7 are serialised per project. When `update all` is used, projects are updated **one at a time** (not in parallel) to limit blast radius. + +### 4.2 Source-rebuild path (non-Docker / dev mode) + +When Docker is not available (the gateway binary is running directly on the host, not in a container), +the update command falls back to the existing `rebuild_and_restart` flow (`server/src/rebuild.rs`): +`cargo build` → re-exec. +This path cannot update individual projects independently — it rebuilds the gateway itself. + +### 4.3 Gateway state during update + +``` +normal → updating → (success) normal + → (failure) rolling_back → normal +``` + +The CRDT `gateway_config` collection gains two new LWW fields per project: + +| Field | Type | Purpose | +|-------|------|---------| +| `update_state` | `"idle" \| "updating" \| "rolling_back"` | Current update lifecycle stage | +| `update_started_at` | `u64` (unix ms) | When the update was triggered | +| `previous_image` | `string` | Image digest before the most recent update | +| `current_image` | `string` | Image digest currently running | + +These fields are replicated to all nodes so that other gateway instances and headless agents +can observe update progress without polling HTTP. + +--- + +## 5. Rollback Approach + +### 5.1 Automatic rollback + +If the health check in step 6 (§4.1) times out or returns a non-200 status, the gateway automatically: + +1. Logs the failure: `[update] health check failed for huskies after 30 s`. +2. Posts to the ops room: `Health check failed. Rolling back…`. +3. Runs `docker stop` on the new container. +4. Pulls and starts the previous image digest (stored in `previous_image`). +5. Re-runs the health check on the rolled-back container. +6. Reports outcome to the room. + +If the rollback health check also fails, the bot reports: +``` +Rollback failed. Manual intervention required. Previous image: sha256:def456 +``` +and sets `update_state = "error"` in the CRDT. The ops room is notified; no further automatic action is taken. + +### 5.2 Manual rollback + +An operator can issue `update huskies --rollback` at any time when the project is in `idle` state. +The command replays steps 3–7 of §4.1 with `previous_image` substituted for the target image. +`previous_image` is overwritten with the image that was displaced, so repeated rollbacks alternate between two images. + +### 5.3 Rollback unavailability + +Rollback is unavailable when: +- No `previous_image` is recorded (first-ever update on this installation). +- `update_state` is already `"updating"` or `"rolling_back"` (only one concurrent update per project). + +--- + +## 6. Implementation Sketch + +### 6.1 New files + +| Path | Purpose | +|------|---------| +| `server/src/chat/commands/update.rs` | Synchronous `handle_update` stub (returns `None` — async, like `rebuild`) | +| `server/src/service/gateway/update.rs` | Core update/rollback logic; calls Docker API or falls back to `rebuild.rs` | +| `server/src/service/gateway/docker.rs` | Thin wrapper around Docker socket HTTP API (`/containers/:id/start` etc.) | + +### 6.2 New CRDT fields + +Extend the `gateway_config` CRDT document (already exists per Spike 679 §6) with: +- `projects..update_state` (LWW string) +- `projects..update_started_at` (LWW u64) +- `projects..previous_image` (LWW string) +- `projects..current_image` (LWW string) + +### 6.3 Gateway HTTP changes + +Add one endpoint for the Docker-fallback check: + +``` +GET /gateway/update/available +→ {"available": true, "mode": "docker"} | {"available": true, "mode": "rebuild"} | {"available": false} +``` + +The frontend can use this to show/hide an "Update" button in the gateway project list. + +### 6.4 Async dispatch + +`update` is an async command (like `rebuild`, `htop`, `start`). +The command keyword is detected in `on_room_message` before `try_handle_command` is invoked. +The handler spawns a `tokio::spawn` task, posts incremental updates via the existing transport's `send_message` / `edit_message` API, and returns. + +--- + +## 7. Open Questions + +| # | Question | Notes | +|---|----------|-------| +| 1 | Should the Docker socket be mounted in the gateway container by default? | Security trade-off: socket access = container escape risk. Alternative: `docker exec` via a sidecar. | +| 2 | Should `update all` use a sequential or rolling strategy? | Sequential is safer; rolling is faster. Sequential chosen for v1. | +| 3 | How do we handle projects not managed by Docker (e.g. running on bare metal)? | Fallback to `rebuild` covers the gateway itself; project-specific fallback is out of scope for v1. | +| 4 | Should the confirmation challenge expire? | Yes — 60 s timeout, configurable in `bot.toml`. | +| 5 | Should update history be persisted beyond CRDT (i.e. across full gateway restarts)? | CRDT persists to SQLite, so yes, as long as the CRDT DB survives the restart. | +| 6 | Multi-gateway HA: which node triggers the actual Docker call? | The node that owns the Docker socket. CRDT `update_state` prevents double-triggering. | + +--- + +## 8. Dependencies + +| Story / Spike | Dependency type | +|--------------|----------------| +| Spike 679 (HTTP → CRDT bus) | Soft — `gateway_config` LWW collection needed for update state; can stub without it | +| Story 665 (Ed25519 auth) | Soft — operator token auth is a future hardening step; room+role guard suffices for v1 | +| `server/src/rebuild.rs` | Direct — reuse `rebuild_and_restart` for the non-Docker path | +| `server/src/gateway_relay.rs` | Indirect — update state changes should trigger relay events to connected frontends |