# Spike 811: Fly.io Machines API Integration for Multi-Tenant Huskies SaaS ## Goal Investigate how to operate huskies as a hosted multi-tenant SaaS on [Fly.io Machines](https://fly.io/docs/machines/). Each tenant owns one or more huskies *project* containers; a fronting gateway routes traffic by tenant and provisions/destroys backing machines on demand. This document captures the architecture, the API surface we need, and the operational concerns that need answers before we start writing production code. ## Architecture at a Glance ``` ┌──────────────────────┐ ┌───────────────────────────────────────────┐ │ Browser / CLI / Bot │───────▶│ huskies-gateway (Fly app: huskies-gw) │ └──────────────────────┘ HTTPS │ * authenticates tenant │ │ * picks active project for tenant │ │ * proxies /mcp /ws /api to machine │ │ * provisions machines via Machines API │ └──────────────────┬────────────────────────┘ │ .flycast (Wireguard) ▼ ┌────────────────────────────────────────────────┐ │ huskies-project-{tenant}-{project} │ │ (Fly app: huskies-projects, machine per tier)│ │ * runs `huskies --port 3001 /data/project` │ │ * persistent volume mounted at /data │ │ * .huskies/ + sled CRDT live on volume │ └────────────────────────────────────────────────┘ ``` Two Fly apps: * `huskies-gw` — small, always-on, replicated across regions; runs the existing `huskies --gateway` binary plus a thin **Fly orchestrator** layer that calls the Machines API. * `huskies-projects` — single Fly app holding *one machine per tenant project*. Using one app (rather than one app per tenant) keeps quota management, IAM, and image distribution simple while still giving us per-machine networking (`{machine_id}.vm.huskies-projects.internal`) and per-tenant Fly volumes. ## Listed Concerns The story brief flags the following concerns. Each is addressed below. 1. Machine lifecycle & API surface 2. Tenant isolation 3. Persistence and volumes 4. Networking & routing 5. Secrets and tenant credentials 6. Cost model and idle-shutdown 7. Wake-on-request / cold-start latency 8. Observability and logs 9. Disaster recovery and backups 10. Quotas and abuse limits --- ### 1. Machine Lifecycle & API Surface Fly Machines is a REST API at `https://api.machines.dev/v1`. Auth is a single bearer token per Fly organization (`FLY_API_TOKEN`). Endpoints we will call: | Verb | Path | Use | |------|------|-----| | `POST` | `/apps/{app}/machines` | Create a new project machine | | `GET` | `/apps/{app}/machines/{id}` | Poll status | | `GET` | `/apps/{app}/machines/{id}/wait?state=started&timeout=30` | Block until state | | `POST` | `/apps/{app}/machines/{id}/start` | Wake a stopped machine | | `POST` | `/apps/{app}/machines/{id}/stop` | Graceful stop (idle scale-to-zero) | | `POST` | `/apps/{app}/machines/{id}/suspend` | Suspend RAM-to-disk (fast wake) | | `DELETE` | `/apps/{app}/machines/{id}?force=true` | Destroy permanently | | `GET` | `/apps/{app}/machines` | Enumerate during reconcile | | `POST` | `/apps/{app}/volumes` | Create persistent volume for tenant | | `DELETE` | `/apps/{app}/volumes/{id}` | Reclaim volume when tenant deletes project | States the orchestrator observes: `created → starting → started → stopping → stopped → destroying → destroyed` (`replacing` and `suspending` are transient). A successful provisioning sequence is: 1. `POST /volumes` (one-time per tenant project, 1 GiB default). 2. `POST /machines` with `config = { image, env, mounts: [{volume, path:"/data"}], guest, services }`. 3. `GET /machines/{id}/wait?state=started` (~10–20 s on cold start). 4. Cache `{tenant, project} → machine_id` in the gateway CRDT (`gateway_projects` LWW-map already exists — extend the value with `machine_id`, `volume_id`, `last_used_at`). Destruction: 1. `POST /machines/{id}/stop` (graceful, lets sled flush). 2. `DELETE /machines/{id}?force=true`. 3. Optionally `DELETE /volumes/{id}` (only when tenant explicitly deletes the project; idle stop must **never** delete volumes). ### 2. Tenant Isolation * **Filesystem:** each machine has its own ephemeral root and its own Fly volume mounted at `/data`. Volumes are not shareable across machines, so tenants cannot read each other's CRDT. * **Network:** machines on the same Fly app can reach each other via 6PN private networking. We must explicitly *not* expose the project server externally; only the gateway holds a public IP. Project machines bind to `[::]:3001` and rely on `.flycast` private routing. * **Credentials:** project machines never see the gateway's `FLY_API_TOKEN`. Tenant-supplied secrets (Anthropic key, Matrix password, etc.) are stored as Fly secrets *scoped to the machine* via the `secrets` field at create time, encrypted at rest by Fly. * **CPU/RAM:** `guest = { cpu_kind: "shared", cpus: 2, memory_mb: 2048 }` is a sensible default; larger tenants get `performance` cpus. Hard caps prevent a runaway agent from eating a neighbour's quota. ### 3. Persistence and Volumes * Fly volumes are zone-pinned. We pick the volume region from the tenant's primary region (`PRIMARY_REGION` env on the gateway), with fallback to `iad`. * The volume holds: * `/data/project/.huskies/` — pipeline.db (sled), bot.toml, project.toml * `/data/project/.git` — repository (initially cloned at first run) * `/data/project/` — working tree * Sled needs a clean shutdown. The orchestrator must always `stop` before `destroy`. We rely on Fly's `kill_signal = "SIGTERM"` + the existing huskies shutdown path in `rebuild.rs`. * **Snapshots:** Fly snapshots volumes daily by default (5-day retention). For paid tiers we extend retention via `snapshot_retention` on the volume. ### 4. Networking & Routing The gateway already proxies MCP/WS/REST by active project. For SaaS we add tenant resolution **before** the project lookup: ``` Host: alice.huskies.app → tenant = alice ↓ GET /tenants/alice/projects/foo → project_id, machine_id ↓ proxy to fdaa:0:abcd:a7b:e2:1::3:3001 (or {machine_id}.vm.huskies-projects.internal:3001) ``` * Tenant resolution lives in a new `tenants` CRDT LWW-map keyed by subdomain → tenant_id; reuses the existing CRDT bus. * Internal DNS: `.vm.huskies-projects.internal` resolves on the private network. `.flycast` is the load-balanced anycast name; we prefer the explicit machine address since each tenant has exactly one project machine at a time. * TLS terminates at the Fly edge for `*.huskies.app`. The gateway receives plain HTTP/2 inside 6PN. ### 5. Secrets and Tenant Credentials * `FLY_API_TOKEN` lives only on the gateway (`fly secrets set FLY_API_TOKEN=… -a huskies-gw`). * Per-tenant `ANTHROPIC_API_KEY`, `MATRIX_PASSWORD`, etc. are POSTed by the tenant in the SaaS UI, encrypted with the gateway's KMS key, and passed to the machine at create time via the Machines API `config.env` (Fly stores env values encrypted). * Rotation: changing a tenant secret means `POST /machines/{id}/update` with the new env, which triggers a rolling replace. The orchestrator schedules this during the tenant's idle window when possible. ### 6. Cost Model and Idle-Shutdown Indicative pricing (us-east, 2026): | Machine | Hourly | Notes | |---------|--------|-------| | `shared-cpu-2x@2048` always-on | ~$0.027 | $19/mo if 24×7 | | `shared-cpu-2x@2048` suspended | ~$0.0009 | $0.65/mo idle | | Volume 1 GiB | ~$0.0002 | $0.15/mo | Multi-tenant pricing requires **suspend on idle**: * Auto-stop: in the machine config, set `services[].auto_stop_machines = "suspend"` and `services[].auto_start_machines = true`. Fly's internal proxy stops the machine after the configured `min_machines` count is zero and there is no incoming traffic for ~5 min. * On the next request, the proxy auto-wakes the machine. Suspend resume is ~300 ms (RAM snapshot from disk); a full `stopped → started` is 10–20 s. We prefer `suspend` for SaaS. * For long-lived agents (a coder agent running on the machine), the gateway sends keepalive pings so Fly does not idle-stop while work is in progress. Implementation: gateway tracks `active_agents` count for each machine in CRDT; if `>0`, hit `/api/agents` once per minute. ### 7. Wake-on-Request / Cold-Start Latency Three latency tiers: | Tier | Wake | When | |------|------|------| | Suspended | ~300 ms | Default for active tenants | | Stopped | 10–20 s | Tenants idle > 7 days | | Destroyed | 60–90 s (clone + boot) | Free tier reaped after 30 d | The gateway returns a `202 Accepted` with a `Retry-After: 1` header while wake is in progress and surfaces a "warming up" splash. The existing `huskies-gw` MCP code path needs an explicit wake call for in-flight requests because Fly's automatic wake only triggers on TCP SYN to a registered service port. ### 8. Observability and Logs * `fly logs -a huskies-projects -i ` streams stdout/stderr. We expose this through the gateway as `GET /api/admin/tenants/{id}/logs`. * Each machine ships logs to the gateway via a sidecar `vector` process? Decision: **no** — Fly's built-in NATS log shipper is enough for v1; revisit if log volume grows. * Metrics: Fly auto-exports per-machine CPU/RAM/network as Prometheus series scrapeable from a `huskies-metrics` machine in the same 6PN. We hook into Grafana Cloud's free tier for the dashboard. ### 9. Disaster Recovery and Backups * Volume snapshots (daily) cover hardware failure. * The CRDT replicates to the gateway over the existing `/crdt-sync` WebSocket. The gateway keeps a 30-day rolling backup of each tenant's CRDT in S3 (`s3://huskies-backups/{tenant}/{date}.ops`). This lets us reconstruct the project tree even if a Fly volume is unrecoverable. * Restore flow: provision a fresh machine + volume, replay the latest snapshot, then replay incremental ops from S3. Documented in a follow-up runbook story. ### 10. Quotas and Abuse Limits * Per-tenant: max 2 concurrent agents, max 8 GiB volume, max 4 CPU, max 200 OAuth-paid model dollars per month. Enforced in the gateway before calling the Machines API. Over-quota → `429 Too Many Requests` with a Stripe upsell page. * Per-Fly-app: Fly soft-limits 1000 machines per app. At scale we shard tenants across `huskies-projects-{0..9}` apps using `consistent_hash(tenant_id)`. * Abuse: every tenant signs up with a verified email + Stripe card. Free tier capped at 1 project, suspended after 7 days idle, destroyed after 30 days idle. --- ## Decisions | Decision | Choice | Rejected alternative | |----------|--------|----------------------| | Apps topology | **Single `huskies-projects` app, one machine per tenant** | One app per tenant: clean isolation, but blows out Fly app quotas and complicates IAM | | Idle strategy | **Suspend, not stop** | Stop: cheaper but 20 s cold start is poor UX for chat | | Secrets path | **Machine env via Machines API at create time** | Fly app-level secrets: shared across all tenant machines, leaks across tenants | | State storage | **Per-tenant Fly volume holding sled + git** | Object storage only: would require rewriting sled backend | | Tenant resolution | **Subdomain → CRDT `tenants` LWW-map** | Path prefix routing: harder to issue per-tenant TLS, breaks browser cookies | | Volume retention | **Never delete on idle stop; only on explicit project deletion** | Auto-delete after N days idle: too easy to lose user data | ## Open Questions 1. How do we hand off long-running coder agents during a Fly host evacuation (machine replace event)? Suspend won't survive a host reboot; we may need a "draining" hook that finishes the current AC and commits before allowing replacement. 2. Should the gateway also live as Fly machines (auto-scale) or stay as Fly app v1 with replicas? Probably the former for global routing, but that's a separate spike. 3. Billing surfaces: do we pass through Fly's per-machine cost to the tenant, or amortize it into a flat per-project price? Product call. 4. Outbound network egress (model API calls, git pushes) is metered by Fly. At Claude Opus rates, model API egress dwarfs everything else, so this is a rounding error — confirm at 100-tenant scale. ## Proof-of-Concept Script A working sketch lives at [`fly_multitenant_poc.sh`](./fly_multitenant_poc.sh). It demonstrates end-to-end: read `FLY_API_TOKEN`, create a volume, create a machine attached to it, wait until started, stop, and destroy. The script is runnable but is **not** what production code looks like — production will translate these calls into Rust against a typed `flyio_machines` client crate, called from a new `server::service::cloud::fly` module that the gateway invokes on tenant signup.