huskies: merge 811

2026-05-14 18:24:25 +00:00
parent 03a0ca258a
commit bb5abcd042
2 changed files with 381 additions and 0 deletions
@@ -0,0 +1,280 @@
+# Spike 811: Fly.io Machines API Integration for Multi-Tenant Huskies SaaS
+
+## Goal
+
+Investigate how to operate huskies as a hosted multi-tenant SaaS on
+[Fly.io Machines](https://fly.io/docs/machines/). Each tenant owns one or
+more huskies *project* containers; a fronting gateway routes traffic by
+tenant and provisions/destroys backing machines on demand. This document
+captures the architecture, the API surface we need, and the operational
+concerns that need answers before we start writing production code.
+
+## Architecture at a Glance
+
+```
+┌──────────────────────┐        ┌───────────────────────────────────────────┐
+│ Browser / CLI / Bot  │───────▶│ huskies-gateway  (Fly app: huskies-gw)    │
+└──────────────────────┘  HTTPS │   * authenticates tenant                  │
+                                │   * picks active project for tenant       │
+                                │   * proxies /mcp /ws /api to machine      │
+                                │   * provisions machines via Machines API  │
+                                └──────────────────┬────────────────────────┘
+                                                   │ .flycast (Wireguard)
+                                                   ▼
+                          ┌────────────────────────────────────────────────┐
+                          │ huskies-project-{tenant}-{project}             │
+                          │   (Fly app: huskies-projects, machine per tier)│
+                          │   * runs `huskies --port 3001 /data/project`   │
+                          │   * persistent volume mounted at /data         │
+                          │   * .huskies/ + sled CRDT live on volume       │
+                          └────────────────────────────────────────────────┘
+```
+
+Two Fly apps:
+
+* `huskies-gw` — small, always-on, replicated across regions; runs the
+  existing `huskies --gateway` binary plus a thin **Fly orchestrator**
+  layer that calls the Machines API.
+* `huskies-projects` — single Fly app holding *one machine per tenant
+  project*. Using one app (rather than one app per tenant) keeps quota
+  management, IAM, and image distribution simple while still giving us
+  per-machine networking (`{machine_id}.vm.huskies-projects.internal`)
+  and per-tenant Fly volumes.
+
+## Listed Concerns
+
+The story brief flags the following concerns. Each is addressed below.
+
+1. Machine lifecycle & API surface
+2. Tenant isolation
+3. Persistence and volumes
+4. Networking & routing
+5. Secrets and tenant credentials
+6. Cost model and idle-shutdown
+7. Wake-on-request / cold-start latency
+8. Observability and logs
+9. Disaster recovery and backups
+10. Quotas and abuse limits
+
+---
+
+### 1. Machine Lifecycle & API Surface
+
+Fly Machines is a REST API at `https://api.machines.dev/v1`. Auth is a
+single bearer token per Fly organization (`FLY_API_TOKEN`).
+
+Endpoints we will call:
+
+| Verb | Path | Use |
+|------|------|-----|
+| `POST` | `/apps/{app}/machines` | Create a new project machine |
+| `GET`  | `/apps/{app}/machines/{id}` | Poll status |
+| `GET`  | `/apps/{app}/machines/{id}/wait?state=started&timeout=30` | Block until state |
+| `POST` | `/apps/{app}/machines/{id}/start` | Wake a stopped machine |
+| `POST` | `/apps/{app}/machines/{id}/stop` | Graceful stop (idle scale-to-zero) |
+| `POST` | `/apps/{app}/machines/{id}/suspend` | Suspend RAM-to-disk (fast wake) |
+| `DELETE` | `/apps/{app}/machines/{id}?force=true` | Destroy permanently |
+| `GET`  | `/apps/{app}/machines` | Enumerate during reconcile |
+| `POST` | `/apps/{app}/volumes` | Create persistent volume for tenant |
+| `DELETE` | `/apps/{app}/volumes/{id}` | Reclaim volume when tenant deletes project |
+
+States the orchestrator observes: `created → starting → started → stopping
+→ stopped → destroying → destroyed` (`replacing` and `suspending` are
+transient).
+
+A successful provisioning sequence is:
+
+1. `POST /volumes` (one-time per tenant project, 1 GiB default).
+2. `POST /machines` with `config = { image, env, mounts: [{volume, path:"/data"}], guest, services }`.
+3. `GET /machines/{id}/wait?state=started` (~10–20 s on cold start).
+4. Cache `{tenant, project} → machine_id` in the gateway CRDT
+   (`gateway_projects` LWW-map already exists — extend the value with
+   `machine_id`, `volume_id`, `last_used_at`).
+
+Destruction:
+
+1. `POST /machines/{id}/stop` (graceful, lets sled flush).
+2. `DELETE /machines/{id}?force=true`.
+3. Optionally `DELETE /volumes/{id}` (only when tenant explicitly deletes
+   the project; idle stop must **never** delete volumes).
+
+### 2. Tenant Isolation
+
+* **Filesystem:** each machine has its own ephemeral root and its own
+  Fly volume mounted at `/data`. Volumes are not shareable across
+  machines, so tenants cannot read each other's CRDT.
+* **Network:** machines on the same Fly app can reach each other via
+  6PN private networking. We must explicitly *not* expose the project
+  server externally; only the gateway holds a public IP. Project
+  machines bind to `[::]:3001` and rely on `.flycast` private routing.
+* **Credentials:** project machines never see the gateway's
+  `FLY_API_TOKEN`. Tenant-supplied secrets (Anthropic key, Matrix
+  password, etc.) are stored as Fly secrets *scoped to the machine* via
+  the `secrets` field at create time, encrypted at rest by Fly.
+* **CPU/RAM:** `guest = { cpu_kind: "shared", cpus: 2, memory_mb: 2048 }`
+  is a sensible default; larger tenants get `performance` cpus. Hard
+  caps prevent a runaway agent from eating a neighbour's quota.
+
+### 3. Persistence and Volumes
+
+* Fly volumes are zone-pinned. We pick the volume region from the
+  tenant's primary region (`PRIMARY_REGION` env on the gateway), with
+  fallback to `iad`.
+* The volume holds:
+  * `/data/project/.huskies/` — pipeline.db (sled), bot.toml, project.toml
+  * `/data/project/.git` — repository (initially cloned at first run)
+  * `/data/project/` — working tree
+* Sled needs a clean shutdown. The orchestrator must always `stop`
+  before `destroy`. We rely on Fly's `kill_signal = "SIGTERM"` + the
+  existing huskies shutdown path in `rebuild.rs`.
+* **Snapshots:** Fly snapshots volumes daily by default (5-day
+  retention). For paid tiers we extend retention via `snapshot_retention`
+  on the volume.
+
+### 4. Networking & Routing
+
+The gateway already proxies MCP/WS/REST by active project. For SaaS we
+add tenant resolution **before** the project lookup:
+
+```
+Host: alice.huskies.app   →  tenant = alice
+  ↓
+GET /tenants/alice/projects/foo → project_id, machine_id
+  ↓
+proxy to fdaa:0:abcd:a7b:e2:1::3:3001  (or {machine_id}.vm.huskies-projects.internal:3001)
+```
+
+* Tenant resolution lives in a new `tenants` CRDT LWW-map keyed by
+  subdomain → tenant_id; reuses the existing CRDT bus.
+* Internal DNS: `<machine_id>.vm.huskies-projects.internal` resolves on
+  the private network. `<app>.flycast` is the load-balanced anycast
+  name; we prefer the explicit machine address since each tenant has
+  exactly one project machine at a time.
+* TLS terminates at the Fly edge for `*.huskies.app`. The gateway
+  receives plain HTTP/2 inside 6PN.
+
+### 5. Secrets and Tenant Credentials
+
+* `FLY_API_TOKEN` lives only on the gateway (`fly secrets set
+  FLY_API_TOKEN=… -a huskies-gw`).
+* Per-tenant `ANTHROPIC_API_KEY`, `MATRIX_PASSWORD`, etc. are POSTed by
+  the tenant in the SaaS UI, encrypted with the gateway's KMS key, and
+  passed to the machine at create time via the Machines API
+  `config.env` (Fly stores env values encrypted).
+* Rotation: changing a tenant secret means `POST /machines/{id}/update`
+  with the new env, which triggers a rolling replace. The orchestrator
+  schedules this during the tenant's idle window when possible.
+
+### 6. Cost Model and Idle-Shutdown
+
+Indicative pricing (us-east, 2026):
+
+| Machine | Hourly | Notes |
+|---------|--------|-------|
+| `shared-cpu-2x@2048` always-on | ~$0.027 | $19/mo if 24×7 |
+| `shared-cpu-2x@2048` suspended | ~$0.0009 | $0.65/mo idle |
+| Volume 1 GiB | ~$0.0002 | $0.15/mo |
+
+Multi-tenant pricing requires **suspend on idle**:
+
+* Auto-stop: in the machine config, set `services[].auto_stop_machines
+  = "suspend"` and `services[].auto_start_machines = true`. Fly's
+  internal proxy stops the machine after the configured `min_machines`
+  count is zero and there is no incoming traffic for ~5 min.
+* On the next request, the proxy auto-wakes the machine. Suspend resume
+  is ~300 ms (RAM snapshot from disk); a full `stopped → started` is
+  10–20 s. We prefer `suspend` for SaaS.
+* For long-lived agents (a coder agent running on the machine), the
+  gateway sends keepalive pings so Fly does not idle-stop while work is
+  in progress. Implementation: gateway tracks `active_agents` count for
+  each machine in CRDT; if `>0`, hit `/api/agents` once per minute.
+
+### 7. Wake-on-Request / Cold-Start Latency
+
+Three latency tiers:
+
+| Tier | Wake | When |
+|------|------|------|
+| Suspended | ~300 ms | Default for active tenants |
+| Stopped | 10–20 s | Tenants idle > 7 days |
+| Destroyed | 60–90 s (clone + boot) | Free tier reaped after 30 d |
+
+The gateway returns a `202 Accepted` with a `Retry-After: 1` header
+while wake is in progress and surfaces a "warming up" splash. The
+existing `huskies-gw` MCP code path needs an explicit wake call for
+in-flight requests because Fly's automatic wake only triggers on TCP
+SYN to a registered service port.
+
+### 8. Observability and Logs
+
+* `fly logs -a huskies-projects -i <machine_id>` streams stdout/stderr.
+  We expose this through the gateway as `GET /api/admin/tenants/{id}/logs`.
+* Each machine ships logs to the gateway via a sidecar `vector`
+  process? Decision: **no** — Fly's built-in NATS log shipper is enough
+  for v1; revisit if log volume grows.
+* Metrics: Fly auto-exports per-machine CPU/RAM/network as Prometheus
+  series scrapeable from a `huskies-metrics` machine in the same 6PN.
+  We hook into Grafana Cloud's free tier for the dashboard.
+
+### 9. Disaster Recovery and Backups
+
+* Volume snapshots (daily) cover hardware failure.
+* The CRDT replicates to the gateway over the existing `/crdt-sync`
+  WebSocket. The gateway keeps a 30-day rolling backup of each tenant's
+  CRDT in S3 (`s3://huskies-backups/{tenant}/{date}.ops`). This lets us
+  reconstruct the project tree even if a Fly volume is unrecoverable.
+* Restore flow: provision a fresh machine + volume, replay the latest
+  snapshot, then replay incremental ops from S3. Documented in a
+  follow-up runbook story.
+
+### 10. Quotas and Abuse Limits
+
+* Per-tenant: max 2 concurrent agents, max 8 GiB volume, max 4 CPU,
+  max 200 OAuth-paid model dollars per month. Enforced in the gateway
+  before calling the Machines API. Over-quota → `429 Too Many Requests`
+  with a Stripe upsell page.
+* Per-Fly-app: Fly soft-limits 1000 machines per app. At scale we
+  shard tenants across `huskies-projects-{0..9}` apps using
+  `consistent_hash(tenant_id)`.
+* Abuse: every tenant signs up with a verified email + Stripe card.
+  Free tier capped at 1 project, suspended after 7 days idle, destroyed
+  after 30 days idle.
+
+---
+
+## Decisions
+
+| Decision | Choice | Rejected alternative |
+|----------|--------|----------------------|
+| Apps topology | **Single `huskies-projects` app, one machine per tenant** | One app per tenant: clean isolation, but blows out Fly app quotas and complicates IAM |
+| Idle strategy | **Suspend, not stop** | Stop: cheaper but 20 s cold start is poor UX for chat |
+| Secrets path | **Machine env via Machines API at create time** | Fly app-level secrets: shared across all tenant machines, leaks across tenants |
+| State storage | **Per-tenant Fly volume holding sled + git** | Object storage only: would require rewriting sled backend |
+| Tenant resolution | **Subdomain → CRDT `tenants` LWW-map** | Path prefix routing: harder to issue per-tenant TLS, breaks browser cookies |
+| Volume retention | **Never delete on idle stop; only on explicit project deletion** | Auto-delete after N days idle: too easy to lose user data |
+
+## Open Questions
+
+1. How do we hand off long-running coder agents during a Fly host
+   evacuation (machine replace event)? Suspend won't survive a host
+   reboot; we may need a "draining" hook that finishes the current AC
+   and commits before allowing replacement.
+2. Should the gateway also live as Fly machines (auto-scale) or stay
+   as Fly app v1 with replicas? Probably the former for global routing,
+   but that's a separate spike.
+3. Billing surfaces: do we pass through Fly's per-machine cost to the
+   tenant, or amortize it into a flat per-project price? Product call.
+4. Outbound network egress (model API calls, git pushes) is metered by
+   Fly. At Claude Opus rates, model API egress dwarfs everything else,
+   so this is a rounding error — confirm at 100-tenant scale.
+
+## Proof-of-Concept Script
+
+A working sketch lives at
+[`fly_multitenant_poc.sh`](./fly_multitenant_poc.sh). It demonstrates
+end-to-end: read `FLY_API_TOKEN`, create a volume, create a machine
+attached to it, wait until started, stop, and destroy. The script is
+runnable but is **not** what production code looks like — production
+will translate these calls into Rust against a typed `flyio_machines`
+client crate, called from a new `server::service::cloud::fly`
+module that the gateway invokes on tenant signup.
@@ -0,0 +1,101 @@
+#!/usr/bin/env bash
+# fly_multitenant_poc.sh — Proof of concept for Spike 811.
+#
+# Demonstrates the Fly.io Machines API calls that the huskies gateway
+# will eventually make to provision and tear down a per-tenant project
+# machine. Run against a real Fly org with FLY_API_TOKEN set, or read it
+# as a commented sketch — the calls are the contract.
+#
+# This is NOT production code. Production will issue these requests
+# from Rust (see server::service::cloud::fly) with retries, structured
+# errors, and CRDT writes to record machine_id/volume_id. The shell
+# script exists so the spec is verifiable end-to-end.
+#
+# Required env:
+#   FLY_API_TOKEN   - org-scoped Fly token
+#   FLY_APP         - name of the huskies-projects Fly app (must exist)
+#   TENANT_ID       - identifier used to tag and name the machine
+#   REGION          - Fly region code, e.g. "iad" (default: iad)
+
+set -euo pipefail
+
+: "${FLY_API_TOKEN:?FLY_API_TOKEN must be set}"
+: "${FLY_APP:?FLY_APP must be set}"
+: "${TENANT_ID:?TENANT_ID must be set}"
+REGION="${REGION:-iad}"
+IMAGE="registry.fly.io/huskies-projects:latest"
+
+API="https://api.machines.dev/v1"
+AUTH=(-H "Authorization: Bearer ${FLY_API_TOKEN}" -H "Content-Type: application/json")
+
+echo "==> 1. Create a 1 GiB persistent volume for tenant ${TENANT_ID}"
+VOLUME_JSON=$(curl -sS -X POST "${API}/apps/${FLY_APP}/volumes" "${AUTH[@]}" --data @- <<EOF
+{
+  "name":   "huskies_${TENANT_ID}",
+  "region": "${REGION}",
+  "size_gb": 1
+}
+EOF
+)
+VOLUME_ID=$(echo "${VOLUME_JSON}" | jq -r .id)
+echo "    volume_id = ${VOLUME_ID}"
+
+echo "==> 2. Create a machine attached to the volume, with auto-suspend"
+MACHINE_JSON=$(curl -sS -X POST "${API}/apps/${FLY_APP}/machines" "${AUTH[@]}" --data @- <<EOF
+{
+  "name":   "huskies-${TENANT_ID}",
+  "region": "${REGION}",
+  "config": {
+    "image": "${IMAGE}",
+    "env": {
+      "TENANT_ID":      "${TENANT_ID}",
+      "HUSKIES_PORT":   "3001",
+      "PRIMARY_REGION": "${REGION}"
+    },
+    "guest": { "cpu_kind": "shared", "cpus": 2, "memory_mb": 2048 },
+    "mounts": [ { "volume": "${VOLUME_ID}", "path": "/data" } ],
+    "services": [ {
+      "ports": [
+        { "port": 443, "handlers": ["tls","http"] },
+        { "port": 80,  "handlers": ["http"] }
+      ],
+      "protocol": "tcp",
+      "internal_port": 3001,
+      "auto_stop_machines":  "suspend",
+      "auto_start_machines": true,
+      "min_machines_running": 0
+    } ],
+    "metadata": { "tenant": "${TENANT_ID}", "managed_by": "huskies-gw" },
+    "restart": { "policy": "on-failure", "max_retries": 5 }
+  }
+}
+EOF
+)
+MACHINE_ID=$(echo "${MACHINE_JSON}" | jq -r .id)
+PRIVATE_IP=$(echo "${MACHINE_JSON}" | jq -r .private_ip)
+echo "    machine_id = ${MACHINE_ID}"
+echo "    private_ip = ${PRIVATE_IP}"
+
+echo "==> 3. Wait for the machine to reach 'started' (long-poll, 60s timeout)"
+curl -sS "${API}/apps/${FLY_APP}/machines/${MACHINE_ID}/wait?state=started&timeout=60" "${AUTH[@]}" \
+  | jq -r '"    state = " + .ok'
+
+echo "    machine reachable at ${MACHINE_ID}.vm.${FLY_APP}.internal:3001"
+
+# ----- At this point the gateway would record (tenant, machine_id, volume_id)
+# ----- into the CRDT and start proxying traffic. We pause here.
+sleep 2
+
+echo "==> 4. Graceful stop (lets sled flush; idle-suspend uses the same path)"
+curl -sS -X POST "${API}/apps/${FLY_APP}/machines/${MACHINE_ID}/stop" "${AUTH[@]}" \
+  --data '{"signal":"SIGTERM","timeout":"30s"}' > /dev/null
+
+echo "==> 5. Destroy the machine"
+curl -sS -X DELETE "${API}/apps/${FLY_APP}/machines/${MACHINE_ID}?force=true" "${AUTH[@]}" > /dev/null
+echo "    machine destroyed"
+
+echo "==> 6. Reclaim the volume (only when the tenant deletes the project)"
+curl -sS -X DELETE "${API}/apps/${FLY_APP}/volumes/${VOLUME_ID}" "${AUTH[@]}" > /dev/null
+echo "    volume reclaimed"
+
+echo "==> done."