huskies/.huskies/specs/tech/SPIKE_GATEWAY_UPDATE_COMMAND.md

# Spike 814: Chat-Driven Update Command for Multi-Project Gateway

## 1. Problem Statement

In a multi-project gateway deployment (Docker Compose or similar), each project runs as its own container.
Today, updating a project container requires direct operator access to the host — `docker pull`, `docker compose up -d <project>`, or equivalent.
There is no way to trigger an update from chat.

This spike designs a `update` bot command that:
- Can be typed in the Matrix/Slack/Discord chat room.
- Pulls the latest image (or rebuilds from source) for one or all project containers managed by the gateway.
- Reports progress and outcome back to the room.
- Supports rollback when a container fails to start cleanly.

---

## 2. Command Surface

### Basic syntax

```
update [<project>|all] [--rollback]
```

| Invocation | Effect |
|-----------|--------|
| `update huskies` | Update and restart the `huskies` container. |
| `update all` | Update every registered project container, one at a time. |
| `update` (no args) | Same as `update all`. |
| `update huskies --rollback` | Roll back `huskies` to its previous image tag. |

### Progress feedback

The bot posts incremental updates to the room (editing the same message where the platform supports it):

```
[huskies] Pulling image…  ⏳
[huskies] Image pulled (sha256:abc123). Stopping container…
[huskies] Container stopped. Starting new container…
[huskies] Health check passed ✅ (2 s)
```

On failure:
```
[huskies] Health check failed after 30 s ❌
[huskies] Rolling back to previous image (sha256:def456)…
[huskies] Rollback complete ✅
```

### Error cases

| Condition | Response |
|-----------|----------|
| Unknown project name | `Unknown project 'foo'. Known projects: huskies, robot-studio` |
| No Docker socket access | `Update not available: Docker socket not mounted` |
| Rollback with no previous image | `No previous image recorded for 'huskies'; cannot roll back` |
| Project container not managed by Docker | `'huskies' is not a container-managed project; rebuild it manually` |

---

## 3. Auth

### 3.1 Threat model

The update command triggers container replacement — a privileged operation equivalent to `docker compose up -d`.
An unauthenticated attacker who can send a message to the bot room could force a rolling restart or roll back a working container.

### 3.2 Proposed approach: room + role guard

**Layer 1 — Room restriction.**
The update command is only accepted in a designated *ops room*, configured in `bot.toml` (or `projects.toml`):

```toml
[gateway.ops_room]
room_id = "!abc123:homeserver.example.com"
```

Messages from other rooms are rejected with: `The update command is only available in the ops room.`

**Layer 2 — Sender role check (Matrix/Slack).**
The bot checks the sender's power level (Matrix) or admin status (Slack/Discord).
Only users with power level ≥ 50 (moderator) on Matrix, or workspace admin on Slack, may issue `update`.
Unapproved senders receive: `You do not have permission to issue update commands.`

**Layer 3 — Confirmation prompt for destructive operations.**
`update all` affects every project.
The bot responds with a confirmation challenge:

```
This will restart all 3 project containers. Reply `yes` within 60 s to confirm, or `no` to cancel.
```

Single-project updates (`update huskies`) do **not** require confirmation — they are already scoped.

### 3.3 Future: Ed25519 operator token

When story 665 (Ed25519 auth) lands, the gateway's node identity keypair can sign an operator token.
The bot verifies the token against the node's public key before acting.
This removes the room/role dependency and allows the command to be issued programmatically
(e.g. from a CI pipeline via MCP).

For now the room + role guard is sufficient.

---

## 4. Rollout Approach

### 4.1 Docker-managed containers (primary path)

The gateway process has access to the Docker socket (mounted as a volume at `/var/run/docker.sock`).
The update sequence for a single project:

1. **Record current image** — read the running container's image digest (store in gateway's `update_history` LWW-map in CRDT, keyed by project name).
2. **Pull new image** — `docker pull <image>` (or the compose-file equivalent tag).
3. **Drain connections** — gateway marks the project as `updating`; new proxy requests return 503 with a `Retry-After: 5` header; in-flight requests are allowed to complete (30 s grace window).
4. **Stop old container** — `docker stop --time=30 <container_name>`.
5. **Start new container** — `docker start <container_name>` (or `docker compose up -d <service>`).
6. **Health check** — poll the project's `/health` endpoint until 200 OK or 30 s timeout.
7. **Restore routing** — remove the `updating` flag; proxy resumes normal operation.

Steps 1–7 are serialised per project. When `update all` is used, projects are updated **one at a time** (not in parallel) to limit blast radius.

### 4.2 Source-rebuild path (non-Docker / dev mode)

When Docker is not available (the gateway binary is running directly on the host, not in a container),
the update command falls back to the existing `rebuild_and_restart` flow (`server/src/rebuild.rs`):
`cargo build` → re-exec.
This path cannot update individual projects independently — it rebuilds the gateway itself.

### 4.3 Gateway state during update

```
normal → updating → (success) normal
                  → (failure) rolling_back → normal
```

The CRDT `gateway_config` collection gains two new LWW fields per project:

| Field | Type | Purpose |
|-------|------|---------|
| `update_state` | `"idle" \| "updating" \| "rolling_back"` | Current update lifecycle stage |
| `update_started_at` | `u64` (unix ms) | When the update was triggered |
| `previous_image` | `string` | Image digest before the most recent update |
| `current_image` | `string` | Image digest currently running |

These fields are replicated to all nodes so that other gateway instances and headless agents
can observe update progress without polling HTTP.

---

## 5. Rollback Approach

### 5.1 Automatic rollback

If the health check in step 6 (§4.1) times out or returns a non-200 status, the gateway automatically:

1. Logs the failure: `[update] health check failed for huskies after 30 s`.
2. Posts to the ops room: `Health check failed. Rolling back…`.
3. Runs `docker stop` on the new container.
4. Pulls and starts the previous image digest (stored in `previous_image`).
5. Re-runs the health check on the rolled-back container.
6. Reports outcome to the room.

If the rollback health check also fails, the bot reports:
```
Rollback failed. Manual intervention required. Previous image: sha256:def456
```
and sets `update_state = "error"` in the CRDT. The ops room is notified; no further automatic action is taken.

### 5.2 Manual rollback

An operator can issue `update huskies --rollback` at any time when the project is in `idle` state.
The command replays steps 3–7 of §4.1 with `previous_image` substituted for the target image.
`previous_image` is overwritten with the image that was displaced, so repeated rollbacks alternate between two images.

### 5.3 Rollback unavailability

Rollback is unavailable when:
- No `previous_image` is recorded (first-ever update on this installation).
- `update_state` is already `"updating"` or `"rolling_back"` (only one concurrent update per project).

---

## 6. Implementation Sketch

### 6.1 New files

| Path | Purpose |
|------|---------|
| `server/src/chat/commands/update.rs` | Synchronous `handle_update` stub (returns `None` — async, like `rebuild`) |
| `server/src/service/gateway/update.rs` | Core update/rollback logic; calls Docker API or falls back to `rebuild.rs` |
| `server/src/service/gateway/docker.rs` | Thin wrapper around Docker socket HTTP API (`/containers/:id/start` etc.) |

### 6.2 New CRDT fields

Extend the `gateway_config` CRDT document (already exists per Spike 679 §6) with:
- `projects.<name>.update_state` (LWW string)
- `projects.<name>.update_started_at` (LWW u64)
- `projects.<name>.previous_image` (LWW string)
- `projects.<name>.current_image` (LWW string)

### 6.3 Gateway HTTP changes

Add one endpoint for the Docker-fallback check:

```
GET /gateway/update/available
→ {"available": true, "mode": "docker"} | {"available": true, "mode": "rebuild"} | {"available": false}
```

The frontend can use this to show/hide an "Update" button in the gateway project list.

### 6.4 Async dispatch

`update` is an async command (like `rebuild`, `htop`, `start`).
The command keyword is detected in `on_room_message` before `try_handle_command` is invoked.
The handler spawns a `tokio::spawn` task, posts incremental updates via the existing transport's `send_message` / `edit_message` API, and returns.

---

## 7. Open Questions

| # | Question | Notes |
|---|----------|-------|
| 1 | Should the Docker socket be mounted in the gateway container by default? | Security trade-off: socket access = container escape risk. Alternative: `docker exec` via a sidecar. |
| 2 | Should `update all` use a sequential or rolling strategy? | Sequential is safer; rolling is faster. Sequential chosen for v1. |
| 3 | How do we handle projects not managed by Docker (e.g. running on bare metal)? | Fallback to `rebuild` covers the gateway itself; project-specific fallback is out of scope for v1. |
| 4 | Should the confirmation challenge expire? | Yes — 60 s timeout, configurable in `bot.toml`. |
| 5 | Should update history be persisted beyond CRDT (i.e. across full gateway restarts)? | CRDT persists to SQLite, so yes, as long as the CRDT DB survives the restart. |
| 6 | Multi-gateway HA: which node triggers the actual Docker call? | The node that owns the Docker socket. CRDT `update_state` prevents double-triggering. |

---

## 8. Dependencies

| Story / Spike | Dependency type |
|--------------|----------------|
| Spike 679 (HTTP → CRDT bus) | Soft — `gateway_config` LWW collection needed for update state; can stub without it |
| Story 665 (Ed25519 auth) | Soft — operator token auth is a future hardening step; room+role guard suffices for v1 |
| `server/src/rebuild.rs` | Direct — reuse `rebuild_and_restart` for the non-Docker path |
| `server/src/gateway_relay.rs` | Indirect — update state changes should trigger relay events to connected frontends |