diff --git a/.huskies/specs/tech/SPIKE_FLY_MULTITENANT.md b/.huskies/specs/tech/SPIKE_FLY_MULTITENANT.md new file mode 100644 index 00000000..80bd8843 --- /dev/null +++ b/.huskies/specs/tech/SPIKE_FLY_MULTITENANT.md @@ -0,0 +1,280 @@ +# Spike 811: Fly.io Machines API Integration for Multi-Tenant Huskies SaaS + +## Goal + +Investigate how to operate huskies as a hosted multi-tenant SaaS on +[Fly.io Machines](https://fly.io/docs/machines/). Each tenant owns one or +more huskies *project* containers; a fronting gateway routes traffic by +tenant and provisions/destroys backing machines on demand. This document +captures the architecture, the API surface we need, and the operational +concerns that need answers before we start writing production code. + +## Architecture at a Glance + +``` +┌──────────────────────┐ ┌───────────────────────────────────────────┐ +│ Browser / CLI / Bot │───────▶│ huskies-gateway (Fly app: huskies-gw) │ +└──────────────────────┘ HTTPS │ * authenticates tenant │ + │ * picks active project for tenant │ + │ * proxies /mcp /ws /api to machine │ + │ * provisions machines via Machines API │ + └──────────────────┬────────────────────────┘ + │ .flycast (Wireguard) + ▼ + ┌────────────────────────────────────────────────┐ + │ huskies-project-{tenant}-{project} │ + │ (Fly app: huskies-projects, machine per tier)│ + │ * runs `huskies --port 3001 /data/project` │ + │ * persistent volume mounted at /data │ + │ * .huskies/ + sled CRDT live on volume │ + └────────────────────────────────────────────────┘ +``` + +Two Fly apps: + +* `huskies-gw` — small, always-on, replicated across regions; runs the + existing `huskies --gateway` binary plus a thin **Fly orchestrator** + layer that calls the Machines API. +* `huskies-projects` — single Fly app holding *one machine per tenant + project*. Using one app (rather than one app per tenant) keeps quota + management, IAM, and image distribution simple while still giving us + per-machine networking (`{machine_id}.vm.huskies-projects.internal`) + and per-tenant Fly volumes. + +## Listed Concerns + +The story brief flags the following concerns. Each is addressed below. + +1. Machine lifecycle & API surface +2. Tenant isolation +3. Persistence and volumes +4. Networking & routing +5. Secrets and tenant credentials +6. Cost model and idle-shutdown +7. Wake-on-request / cold-start latency +8. Observability and logs +9. Disaster recovery and backups +10. Quotas and abuse limits + +--- + +### 1. Machine Lifecycle & API Surface + +Fly Machines is a REST API at `https://api.machines.dev/v1`. Auth is a +single bearer token per Fly organization (`FLY_API_TOKEN`). + +Endpoints we will call: + +| Verb | Path | Use | +|------|------|-----| +| `POST` | `/apps/{app}/machines` | Create a new project machine | +| `GET` | `/apps/{app}/machines/{id}` | Poll status | +| `GET` | `/apps/{app}/machines/{id}/wait?state=started&timeout=30` | Block until state | +| `POST` | `/apps/{app}/machines/{id}/start` | Wake a stopped machine | +| `POST` | `/apps/{app}/machines/{id}/stop` | Graceful stop (idle scale-to-zero) | +| `POST` | `/apps/{app}/machines/{id}/suspend` | Suspend RAM-to-disk (fast wake) | +| `DELETE` | `/apps/{app}/machines/{id}?force=true` | Destroy permanently | +| `GET` | `/apps/{app}/machines` | Enumerate during reconcile | +| `POST` | `/apps/{app}/volumes` | Create persistent volume for tenant | +| `DELETE` | `/apps/{app}/volumes/{id}` | Reclaim volume when tenant deletes project | + +States the orchestrator observes: `created → starting → started → stopping +→ stopped → destroying → destroyed` (`replacing` and `suspending` are +transient). + +A successful provisioning sequence is: + +1. `POST /volumes` (one-time per tenant project, 1 GiB default). +2. `POST /machines` with `config = { image, env, mounts: [{volume, path:"/data"}], guest, services }`. +3. `GET /machines/{id}/wait?state=started` (~10–20 s on cold start). +4. Cache `{tenant, project} → machine_id` in the gateway CRDT + (`gateway_projects` LWW-map already exists — extend the value with + `machine_id`, `volume_id`, `last_used_at`). + +Destruction: + +1. `POST /machines/{id}/stop` (graceful, lets sled flush). +2. `DELETE /machines/{id}?force=true`. +3. Optionally `DELETE /volumes/{id}` (only when tenant explicitly deletes + the project; idle stop must **never** delete volumes). + +### 2. Tenant Isolation + +* **Filesystem:** each machine has its own ephemeral root and its own + Fly volume mounted at `/data`. Volumes are not shareable across + machines, so tenants cannot read each other's CRDT. +* **Network:** machines on the same Fly app can reach each other via + 6PN private networking. We must explicitly *not* expose the project + server externally; only the gateway holds a public IP. Project + machines bind to `[::]:3001` and rely on `.flycast` private routing. +* **Credentials:** project machines never see the gateway's + `FLY_API_TOKEN`. Tenant-supplied secrets (Anthropic key, Matrix + password, etc.) are stored as Fly secrets *scoped to the machine* via + the `secrets` field at create time, encrypted at rest by Fly. +* **CPU/RAM:** `guest = { cpu_kind: "shared", cpus: 2, memory_mb: 2048 }` + is a sensible default; larger tenants get `performance` cpus. Hard + caps prevent a runaway agent from eating a neighbour's quota. + +### 3. Persistence and Volumes + +* Fly volumes are zone-pinned. We pick the volume region from the + tenant's primary region (`PRIMARY_REGION` env on the gateway), with + fallback to `iad`. +* The volume holds: + * `/data/project/.huskies/` — pipeline.db (sled), bot.toml, project.toml + * `/data/project/.git` — repository (initially cloned at first run) + * `/data/project/` — working tree +* Sled needs a clean shutdown. The orchestrator must always `stop` + before `destroy`. We rely on Fly's `kill_signal = "SIGTERM"` + the + existing huskies shutdown path in `rebuild.rs`. +* **Snapshots:** Fly snapshots volumes daily by default (5-day + retention). For paid tiers we extend retention via `snapshot_retention` + on the volume. + +### 4. Networking & Routing + +The gateway already proxies MCP/WS/REST by active project. For SaaS we +add tenant resolution **before** the project lookup: + +``` +Host: alice.huskies.app → tenant = alice + ↓ +GET /tenants/alice/projects/foo → project_id, machine_id + ↓ +proxy to fdaa:0:abcd:a7b:e2:1::3:3001 (or {machine_id}.vm.huskies-projects.internal:3001) +``` + +* Tenant resolution lives in a new `tenants` CRDT LWW-map keyed by + subdomain → tenant_id; reuses the existing CRDT bus. +* Internal DNS: `.vm.huskies-projects.internal` resolves on + the private network. `.flycast` is the load-balanced anycast + name; we prefer the explicit machine address since each tenant has + exactly one project machine at a time. +* TLS terminates at the Fly edge for `*.huskies.app`. The gateway + receives plain HTTP/2 inside 6PN. + +### 5. Secrets and Tenant Credentials + +* `FLY_API_TOKEN` lives only on the gateway (`fly secrets set + FLY_API_TOKEN=… -a huskies-gw`). +* Per-tenant `ANTHROPIC_API_KEY`, `MATRIX_PASSWORD`, etc. are POSTed by + the tenant in the SaaS UI, encrypted with the gateway's KMS key, and + passed to the machine at create time via the Machines API + `config.env` (Fly stores env values encrypted). +* Rotation: changing a tenant secret means `POST /machines/{id}/update` + with the new env, which triggers a rolling replace. The orchestrator + schedules this during the tenant's idle window when possible. + +### 6. Cost Model and Idle-Shutdown + +Indicative pricing (us-east, 2026): + +| Machine | Hourly | Notes | +|---------|--------|-------| +| `shared-cpu-2x@2048` always-on | ~$0.027 | $19/mo if 24×7 | +| `shared-cpu-2x@2048` suspended | ~$0.0009 | $0.65/mo idle | +| Volume 1 GiB | ~$0.0002 | $0.15/mo | + +Multi-tenant pricing requires **suspend on idle**: + +* Auto-stop: in the machine config, set `services[].auto_stop_machines + = "suspend"` and `services[].auto_start_machines = true`. Fly's + internal proxy stops the machine after the configured `min_machines` + count is zero and there is no incoming traffic for ~5 min. +* On the next request, the proxy auto-wakes the machine. Suspend resume + is ~300 ms (RAM snapshot from disk); a full `stopped → started` is + 10–20 s. We prefer `suspend` for SaaS. +* For long-lived agents (a coder agent running on the machine), the + gateway sends keepalive pings so Fly does not idle-stop while work is + in progress. Implementation: gateway tracks `active_agents` count for + each machine in CRDT; if `>0`, hit `/api/agents` once per minute. + +### 7. Wake-on-Request / Cold-Start Latency + +Three latency tiers: + +| Tier | Wake | When | +|------|------|------| +| Suspended | ~300 ms | Default for active tenants | +| Stopped | 10–20 s | Tenants idle > 7 days | +| Destroyed | 60–90 s (clone + boot) | Free tier reaped after 30 d | + +The gateway returns a `202 Accepted` with a `Retry-After: 1` header +while wake is in progress and surfaces a "warming up" splash. The +existing `huskies-gw` MCP code path needs an explicit wake call for +in-flight requests because Fly's automatic wake only triggers on TCP +SYN to a registered service port. + +### 8. Observability and Logs + +* `fly logs -a huskies-projects -i ` streams stdout/stderr. + We expose this through the gateway as `GET /api/admin/tenants/{id}/logs`. +* Each machine ships logs to the gateway via a sidecar `vector` + process? Decision: **no** — Fly's built-in NATS log shipper is enough + for v1; revisit if log volume grows. +* Metrics: Fly auto-exports per-machine CPU/RAM/network as Prometheus + series scrapeable from a `huskies-metrics` machine in the same 6PN. + We hook into Grafana Cloud's free tier for the dashboard. + +### 9. Disaster Recovery and Backups + +* Volume snapshots (daily) cover hardware failure. +* The CRDT replicates to the gateway over the existing `/crdt-sync` + WebSocket. The gateway keeps a 30-day rolling backup of each tenant's + CRDT in S3 (`s3://huskies-backups/{tenant}/{date}.ops`). This lets us + reconstruct the project tree even if a Fly volume is unrecoverable. +* Restore flow: provision a fresh machine + volume, replay the latest + snapshot, then replay incremental ops from S3. Documented in a + follow-up runbook story. + +### 10. Quotas and Abuse Limits + +* Per-tenant: max 2 concurrent agents, max 8 GiB volume, max 4 CPU, + max 200 OAuth-paid model dollars per month. Enforced in the gateway + before calling the Machines API. Over-quota → `429 Too Many Requests` + with a Stripe upsell page. +* Per-Fly-app: Fly soft-limits 1000 machines per app. At scale we + shard tenants across `huskies-projects-{0..9}` apps using + `consistent_hash(tenant_id)`. +* Abuse: every tenant signs up with a verified email + Stripe card. + Free tier capped at 1 project, suspended after 7 days idle, destroyed + after 30 days idle. + +--- + +## Decisions + +| Decision | Choice | Rejected alternative | +|----------|--------|----------------------| +| Apps topology | **Single `huskies-projects` app, one machine per tenant** | One app per tenant: clean isolation, but blows out Fly app quotas and complicates IAM | +| Idle strategy | **Suspend, not stop** | Stop: cheaper but 20 s cold start is poor UX for chat | +| Secrets path | **Machine env via Machines API at create time** | Fly app-level secrets: shared across all tenant machines, leaks across tenants | +| State storage | **Per-tenant Fly volume holding sled + git** | Object storage only: would require rewriting sled backend | +| Tenant resolution | **Subdomain → CRDT `tenants` LWW-map** | Path prefix routing: harder to issue per-tenant TLS, breaks browser cookies | +| Volume retention | **Never delete on idle stop; only on explicit project deletion** | Auto-delete after N days idle: too easy to lose user data | + +## Open Questions + +1. How do we hand off long-running coder agents during a Fly host + evacuation (machine replace event)? Suspend won't survive a host + reboot; we may need a "draining" hook that finishes the current AC + and commits before allowing replacement. +2. Should the gateway also live as Fly machines (auto-scale) or stay + as Fly app v1 with replicas? Probably the former for global routing, + but that's a separate spike. +3. Billing surfaces: do we pass through Fly's per-machine cost to the + tenant, or amortize it into a flat per-project price? Product call. +4. Outbound network egress (model API calls, git pushes) is metered by + Fly. At Claude Opus rates, model API egress dwarfs everything else, + so this is a rounding error — confirm at 100-tenant scale. + +## Proof-of-Concept Script + +A working sketch lives at +[`fly_multitenant_poc.sh`](./fly_multitenant_poc.sh). It demonstrates +end-to-end: read `FLY_API_TOKEN`, create a volume, create a machine +attached to it, wait until started, stop, and destroy. The script is +runnable but is **not** what production code looks like — production +will translate these calls into Rust against a typed `flyio_machines` +client crate, called from a new `server::service::cloud::fly` +module that the gateway invokes on tenant signup. diff --git a/.huskies/specs/tech/fly_multitenant_poc.sh b/.huskies/specs/tech/fly_multitenant_poc.sh new file mode 100755 index 00000000..a7b159ab --- /dev/null +++ b/.huskies/specs/tech/fly_multitenant_poc.sh @@ -0,0 +1,101 @@ +#!/usr/bin/env bash +# fly_multitenant_poc.sh — Proof of concept for Spike 811. +# +# Demonstrates the Fly.io Machines API calls that the huskies gateway +# will eventually make to provision and tear down a per-tenant project +# machine. Run against a real Fly org with FLY_API_TOKEN set, or read it +# as a commented sketch — the calls are the contract. +# +# This is NOT production code. Production will issue these requests +# from Rust (see server::service::cloud::fly) with retries, structured +# errors, and CRDT writes to record machine_id/volume_id. The shell +# script exists so the spec is verifiable end-to-end. +# +# Required env: +# FLY_API_TOKEN - org-scoped Fly token +# FLY_APP - name of the huskies-projects Fly app (must exist) +# TENANT_ID - identifier used to tag and name the machine +# REGION - Fly region code, e.g. "iad" (default: iad) + +set -euo pipefail + +: "${FLY_API_TOKEN:?FLY_API_TOKEN must be set}" +: "${FLY_APP:?FLY_APP must be set}" +: "${TENANT_ID:?TENANT_ID must be set}" +REGION="${REGION:-iad}" +IMAGE="registry.fly.io/huskies-projects:latest" + +API="https://api.machines.dev/v1" +AUTH=(-H "Authorization: Bearer ${FLY_API_TOKEN}" -H "Content-Type: application/json") + +echo "==> 1. Create a 1 GiB persistent volume for tenant ${TENANT_ID}" +VOLUME_JSON=$(curl -sS -X POST "${API}/apps/${FLY_APP}/volumes" "${AUTH[@]}" --data @- < 2. Create a machine attached to the volume, with auto-suspend" +MACHINE_JSON=$(curl -sS -X POST "${API}/apps/${FLY_APP}/machines" "${AUTH[@]}" --data @- < 3. Wait for the machine to reach 'started' (long-poll, 60s timeout)" +curl -sS "${API}/apps/${FLY_APP}/machines/${MACHINE_ID}/wait?state=started&timeout=60" "${AUTH[@]}" \ + | jq -r '" state = " + .ok' + +echo " machine reachable at ${MACHINE_ID}.vm.${FLY_APP}.internal:3001" + +# ----- At this point the gateway would record (tenant, machine_id, volume_id) +# ----- into the CRDT and start proxying traffic. We pause here. +sleep 2 + +echo "==> 4. Graceful stop (lets sled flush; idle-suspend uses the same path)" +curl -sS -X POST "${API}/apps/${FLY_APP}/machines/${MACHINE_ID}/stop" "${AUTH[@]}" \ + --data '{"signal":"SIGTERM","timeout":"30s"}' > /dev/null + +echo "==> 5. Destroy the machine" +curl -sS -X DELETE "${API}/apps/${FLY_APP}/machines/${MACHINE_ID}?force=true" "${AUTH[@]}" > /dev/null +echo " machine destroyed" + +echo "==> 6. Reclaim the volume (only when the tenant deletes the project)" +curl -sS -X DELETE "${API}/apps/${FLY_APP}/volumes/${VOLUME_ID}" "${AUTH[@]}" > /dev/null +echo " volume reclaimed" + +echo "==> done."