Files
huskies/.huskies/specs/tech/SPIKE_FLY_MULTITENANT.md
T
2026-05-14 18:32:37 +00:00

14 KiB
Raw Blame History

Spike 811: Fly.io Machines API Integration for Multi-Tenant Huskies SaaS

Goal

Investigate how to operate huskies as a hosted multi-tenant SaaS on Fly.io Machines. Each tenant owns one or more huskies project containers; a fronting gateway routes traffic by tenant and provisions/destroys backing machines on demand. This document captures the architecture, the API surface we need, and the operational concerns that need answers before we start writing production code.

Architecture at a Glance

┌──────────────────────┐        ┌───────────────────────────────────────────┐
│ Browser / CLI / Bot  │───────▶│ huskies-gateway  (Fly app: huskies-gw)    │
└──────────────────────┘  HTTPS │   * authenticates tenant                  │
                                │   * picks active project for tenant       │
                                │   * proxies /mcp /ws /api to machine      │
                                │   * provisions machines via Machines API  │
                                └──────────────────┬────────────────────────┘
                                                   │ .flycast (Wireguard)
                                                   ▼
                          ┌────────────────────────────────────────────────┐
                          │ huskies-project-{tenant}-{project}             │
                          │   (Fly app: huskies-projects, machine per tier)│
                          │   * runs `huskies --port 3001 /data/project`   │
                          │   * persistent volume mounted at /data         │
                          │   * .huskies/ + sled CRDT live on volume       │
                          └────────────────────────────────────────────────┘

Two Fly apps:

  • huskies-gw — small, always-on, replicated across regions; runs the existing huskies --gateway binary plus a thin Fly orchestrator layer that calls the Machines API.
  • huskies-projects — single Fly app holding one machine per tenant project. Using one app (rather than one app per tenant) keeps quota management, IAM, and image distribution simple while still giving us per-machine networking ({machine_id}.vm.huskies-projects.internal) and per-tenant Fly volumes.

Listed Concerns

The story brief flags the following concerns. Each is addressed below.

  1. Machine lifecycle & API surface
  2. Tenant isolation
  3. Persistence and volumes
  4. Networking & routing
  5. Secrets and tenant credentials
  6. Cost model and idle-shutdown
  7. Wake-on-request / cold-start latency
  8. Observability and logs
  9. Disaster recovery and backups
  10. Quotas and abuse limits

1. Machine Lifecycle & API Surface

Fly Machines is a REST API at https://api.machines.dev/v1. Auth is a single bearer token per Fly organization (FLY_API_TOKEN).

Endpoints we will call:

Verb Path Use
POST /apps/{app}/machines Create a new project machine
GET /apps/{app}/machines/{id} Poll status
GET /apps/{app}/machines/{id}/wait?state=started&timeout=30 Block until state
POST /apps/{app}/machines/{id}/start Wake a stopped machine
POST /apps/{app}/machines/{id}/stop Graceful stop (idle scale-to-zero)
POST /apps/{app}/machines/{id}/suspend Suspend RAM-to-disk (fast wake)
DELETE /apps/{app}/machines/{id}?force=true Destroy permanently
GET /apps/{app}/machines Enumerate during reconcile
POST /apps/{app}/volumes Create persistent volume for tenant
DELETE /apps/{app}/volumes/{id} Reclaim volume when tenant deletes project

States the orchestrator observes: created → starting → started → stopping → stopped → destroying → destroyed (replacing and suspending are transient).

A successful provisioning sequence is:

  1. POST /volumes (one-time per tenant project, 1 GiB default).
  2. POST /machines with config = { image, env, mounts: [{volume, path:"/data"}], guest, services }.
  3. GET /machines/{id}/wait?state=started (~1020 s on cold start).
  4. Cache {tenant, project} → machine_id in the gateway CRDT (gateway_projects LWW-map already exists — extend the value with machine_id, volume_id, last_used_at).

Destruction:

  1. POST /machines/{id}/stop (graceful, lets sled flush).
  2. DELETE /machines/{id}?force=true.
  3. Optionally DELETE /volumes/{id} (only when tenant explicitly deletes the project; idle stop must never delete volumes).

2. Tenant Isolation

  • Filesystem: each machine has its own ephemeral root and its own Fly volume mounted at /data. Volumes are not shareable across machines, so tenants cannot read each other's CRDT.
  • Network: machines on the same Fly app can reach each other via 6PN private networking. We must explicitly not expose the project server externally; only the gateway holds a public IP. Project machines bind to [::]:3001 and rely on .flycast private routing.
  • Credentials: project machines never see the gateway's FLY_API_TOKEN. Tenant-supplied secrets (Anthropic key, Matrix password, etc.) are stored as Fly secrets scoped to the machine via the secrets field at create time, encrypted at rest by Fly.
  • CPU/RAM: guest = { cpu_kind: "shared", cpus: 2, memory_mb: 2048 } is a sensible default; larger tenants get performance cpus. Hard caps prevent a runaway agent from eating a neighbour's quota.

3. Persistence and Volumes

  • Fly volumes are zone-pinned. We pick the volume region from the tenant's primary region (PRIMARY_REGION env on the gateway), with fallback to iad.
  • The volume holds:
    • /data/project/.huskies/ — pipeline.db (sled), bot.toml, project.toml
    • /data/project/.git — repository (initially cloned at first run)
    • /data/project/ — working tree
  • Sled needs a clean shutdown. The orchestrator must always stop before destroy. We rely on Fly's kill_signal = "SIGTERM" + the existing huskies shutdown path in rebuild.rs.
  • Snapshots: Fly snapshots volumes daily by default (5-day retention). For paid tiers we extend retention via snapshot_retention on the volume.

4. Networking & Routing

The gateway already proxies MCP/WS/REST by active project. For SaaS we add tenant resolution before the project lookup:

Host: alice.huskies.app   →  tenant = alice
  ↓
GET /tenants/alice/projects/foo → project_id, machine_id
  ↓
proxy to fdaa:0:abcd:a7b:e2:1::3:3001  (or {machine_id}.vm.huskies-projects.internal:3001)
  • Tenant resolution lives in a new tenants CRDT LWW-map keyed by subdomain → tenant_id; reuses the existing CRDT bus.
  • Internal DNS: <machine_id>.vm.huskies-projects.internal resolves on the private network. <app>.flycast is the load-balanced anycast name; we prefer the explicit machine address since each tenant has exactly one project machine at a time.
  • TLS terminates at the Fly edge for *.huskies.app. The gateway receives plain HTTP/2 inside 6PN.

5. Secrets and Tenant Credentials

  • FLY_API_TOKEN lives only on the gateway (fly secrets set FLY_API_TOKEN=… -a huskies-gw).
  • Per-tenant ANTHROPIC_API_KEY, MATRIX_PASSWORD, etc. are POSTed by the tenant in the SaaS UI, encrypted with the gateway's KMS key, and passed to the machine at create time via the Machines API config.env (Fly stores env values encrypted).
  • Rotation: changing a tenant secret means POST /machines/{id}/update with the new env, which triggers a rolling replace. The orchestrator schedules this during the tenant's idle window when possible.

6. Cost Model and Idle-Shutdown

Indicative pricing (us-east, 2026):

Machine Hourly Notes
shared-cpu-2x@2048 always-on ~$0.027 $19/mo if 24×7
shared-cpu-2x@2048 suspended ~$0.0009 $0.65/mo idle
Volume 1 GiB ~$0.0002 $0.15/mo

Multi-tenant pricing requires suspend on idle:

  • Auto-stop: in the machine config, set services[].auto_stop_machines = "suspend" and services[].auto_start_machines = true. Fly's internal proxy stops the machine after the configured min_machines count is zero and there is no incoming traffic for ~5 min.
  • On the next request, the proxy auto-wakes the machine. Suspend resume is ~300 ms (RAM snapshot from disk); a full stopped → started is 1020 s. We prefer suspend for SaaS.
  • For long-lived agents (a coder agent running on the machine), the gateway sends keepalive pings so Fly does not idle-stop while work is in progress. Implementation: gateway tracks active_agents count for each machine in CRDT; if >0, hit /api/agents once per minute.

7. Wake-on-Request / Cold-Start Latency

Three latency tiers:

Tier Wake When
Suspended ~300 ms Default for active tenants
Stopped 1020 s Tenants idle > 7 days
Destroyed 6090 s (clone + boot) Free tier reaped after 30 d

The gateway returns a 202 Accepted with a Retry-After: 1 header while wake is in progress and surfaces a "warming up" splash. The existing huskies-gw MCP code path needs an explicit wake call for in-flight requests because Fly's automatic wake only triggers on TCP SYN to a registered service port.

8. Observability and Logs

  • fly logs -a huskies-projects -i <machine_id> streams stdout/stderr. We expose this through the gateway as GET /api/admin/tenants/{id}/logs.
  • Each machine ships logs to the gateway via a sidecar vector process? Decision: no — Fly's built-in NATS log shipper is enough for v1; revisit if log volume grows.
  • Metrics: Fly auto-exports per-machine CPU/RAM/network as Prometheus series scrapeable from a huskies-metrics machine in the same 6PN. We hook into Grafana Cloud's free tier for the dashboard.

9. Disaster Recovery and Backups

  • Volume snapshots (daily) cover hardware failure.
  • The CRDT replicates to the gateway over the existing /crdt-sync WebSocket. The gateway keeps a 30-day rolling backup of each tenant's CRDT in S3 (s3://huskies-backups/{tenant}/{date}.ops). This lets us reconstruct the project tree even if a Fly volume is unrecoverable.
  • Restore flow: provision a fresh machine + volume, replay the latest snapshot, then replay incremental ops from S3. Documented in a follow-up runbook story.

10. Quotas and Abuse Limits

  • Per-tenant: max 2 concurrent agents, max 8 GiB volume, max 4 CPU, max 200 OAuth-paid model dollars per month. Enforced in the gateway before calling the Machines API. Over-quota → 429 Too Many Requests with a Stripe upsell page.
  • Per-Fly-app: Fly soft-limits 1000 machines per app. At scale we shard tenants across huskies-projects-{0..9} apps using consistent_hash(tenant_id).
  • Abuse: every tenant signs up with a verified email + Stripe card. Free tier capped at 1 project, suspended after 7 days idle, destroyed after 30 days idle.

Decisions

Decision Choice Rejected alternative
Apps topology Single huskies-projects app, one machine per tenant One app per tenant: clean isolation, but blows out Fly app quotas and complicates IAM
Idle strategy Suspend, not stop Stop: cheaper but 20 s cold start is poor UX for chat
Secrets path Machine env via Machines API at create time Fly app-level secrets: shared across all tenant machines, leaks across tenants
State storage Per-tenant Fly volume holding sled + git Object storage only: would require rewriting sled backend
Tenant resolution Subdomain → CRDT tenants LWW-map Path prefix routing: harder to issue per-tenant TLS, breaks browser cookies
Volume retention Never delete on idle stop; only on explicit project deletion Auto-delete after N days idle: too easy to lose user data

Open Questions

  1. How do we hand off long-running coder agents during a Fly host evacuation (machine replace event)? Suspend won't survive a host reboot; we may need a "draining" hook that finishes the current AC and commits before allowing replacement.
  2. Should the gateway also live as Fly machines (auto-scale) or stay as Fly app v1 with replicas? Probably the former for global routing, but that's a separate spike.
  3. Billing surfaces: do we pass through Fly's per-machine cost to the tenant, or amortize it into a flat per-project price? Product call.
  4. Outbound network egress (model API calls, git pushes) is metered by Fly. At Claude Opus rates, model API egress dwarfs everything else, so this is a rounding error — confirm at 100-tenant scale.

Proof-of-Concept Script

A working sketch lives at fly_multitenant_poc.sh. It demonstrates end-to-end: read FLY_API_TOKEN, create a volume, create a machine attached to it, wait until started, stop, and destroy. The script is runnable but is not what production code looks like — production will translate these calls into Rust against a typed flyio_machines client crate, called from a new server::service::cloud::fly module that the gateway invokes on tenant signup.