14 KiB
Spike 811: Fly.io Machines API Integration for Multi-Tenant Huskies SaaS
Goal
Investigate how to operate huskies as a hosted multi-tenant SaaS on Fly.io Machines. Each tenant owns one or more huskies project containers; a fronting gateway routes traffic by tenant and provisions/destroys backing machines on demand. This document captures the architecture, the API surface we need, and the operational concerns that need answers before we start writing production code.
Architecture at a Glance
┌──────────────────────┐ ┌───────────────────────────────────────────┐
│ Browser / CLI / Bot │───────▶│ huskies-gateway (Fly app: huskies-gw) │
└──────────────────────┘ HTTPS │ * authenticates tenant │
│ * picks active project for tenant │
│ * proxies /mcp /ws /api to machine │
│ * provisions machines via Machines API │
└──────────────────┬────────────────────────┘
│ .flycast (Wireguard)
▼
┌────────────────────────────────────────────────┐
│ huskies-project-{tenant}-{project} │
│ (Fly app: huskies-projects, machine per tier)│
│ * runs `huskies --port 3001 /data/project` │
│ * persistent volume mounted at /data │
│ * .huskies/ + sled CRDT live on volume │
└────────────────────────────────────────────────┘
Two Fly apps:
huskies-gw— small, always-on, replicated across regions; runs the existinghuskies --gatewaybinary plus a thin Fly orchestrator layer that calls the Machines API.huskies-projects— single Fly app holding one machine per tenant project. Using one app (rather than one app per tenant) keeps quota management, IAM, and image distribution simple while still giving us per-machine networking ({machine_id}.vm.huskies-projects.internal) and per-tenant Fly volumes.
Listed Concerns
The story brief flags the following concerns. Each is addressed below.
- Machine lifecycle & API surface
- Tenant isolation
- Persistence and volumes
- Networking & routing
- Secrets and tenant credentials
- Cost model and idle-shutdown
- Wake-on-request / cold-start latency
- Observability and logs
- Disaster recovery and backups
- Quotas and abuse limits
1. Machine Lifecycle & API Surface
Fly Machines is a REST API at https://api.machines.dev/v1. Auth is a
single bearer token per Fly organization (FLY_API_TOKEN).
Endpoints we will call:
| Verb | Path | Use |
|---|---|---|
POST |
/apps/{app}/machines |
Create a new project machine |
GET |
/apps/{app}/machines/{id} |
Poll status |
GET |
/apps/{app}/machines/{id}/wait?state=started&timeout=30 |
Block until state |
POST |
/apps/{app}/machines/{id}/start |
Wake a stopped machine |
POST |
/apps/{app}/machines/{id}/stop |
Graceful stop (idle scale-to-zero) |
POST |
/apps/{app}/machines/{id}/suspend |
Suspend RAM-to-disk (fast wake) |
DELETE |
/apps/{app}/machines/{id}?force=true |
Destroy permanently |
GET |
/apps/{app}/machines |
Enumerate during reconcile |
POST |
/apps/{app}/volumes |
Create persistent volume for tenant |
DELETE |
/apps/{app}/volumes/{id} |
Reclaim volume when tenant deletes project |
States the orchestrator observes: created → starting → started → stopping → stopped → destroying → destroyed (replacing and suspending are
transient).
A successful provisioning sequence is:
POST /volumes(one-time per tenant project, 1 GiB default).POST /machineswithconfig = { image, env, mounts: [{volume, path:"/data"}], guest, services }.GET /machines/{id}/wait?state=started(~10–20 s on cold start).- Cache
{tenant, project} → machine_idin the gateway CRDT (gateway_projectsLWW-map already exists — extend the value withmachine_id,volume_id,last_used_at).
Destruction:
POST /machines/{id}/stop(graceful, lets sled flush).DELETE /machines/{id}?force=true.- Optionally
DELETE /volumes/{id}(only when tenant explicitly deletes the project; idle stop must never delete volumes).
2. Tenant Isolation
- Filesystem: each machine has its own ephemeral root and its own
Fly volume mounted at
/data. Volumes are not shareable across machines, so tenants cannot read each other's CRDT. - Network: machines on the same Fly app can reach each other via
6PN private networking. We must explicitly not expose the project
server externally; only the gateway holds a public IP. Project
machines bind to
[::]:3001and rely on.flycastprivate routing. - Credentials: project machines never see the gateway's
FLY_API_TOKEN. Tenant-supplied secrets (Anthropic key, Matrix password, etc.) are stored as Fly secrets scoped to the machine via thesecretsfield at create time, encrypted at rest by Fly. - CPU/RAM:
guest = { cpu_kind: "shared", cpus: 2, memory_mb: 2048 }is a sensible default; larger tenants getperformancecpus. Hard caps prevent a runaway agent from eating a neighbour's quota.
3. Persistence and Volumes
- Fly volumes are zone-pinned. We pick the volume region from the
tenant's primary region (
PRIMARY_REGIONenv on the gateway), with fallback toiad. - The volume holds:
/data/project/.huskies/— pipeline.db (sled), bot.toml, project.toml/data/project/.git— repository (initially cloned at first run)/data/project/— working tree
- Sled needs a clean shutdown. The orchestrator must always
stopbeforedestroy. We rely on Fly'skill_signal = "SIGTERM"+ the existing huskies shutdown path inrebuild.rs. - Snapshots: Fly snapshots volumes daily by default (5-day
retention). For paid tiers we extend retention via
snapshot_retentionon the volume.
4. Networking & Routing
The gateway already proxies MCP/WS/REST by active project. For SaaS we add tenant resolution before the project lookup:
Host: alice.huskies.app → tenant = alice
↓
GET /tenants/alice/projects/foo → project_id, machine_id
↓
proxy to fdaa:0:abcd:a7b:e2:1::3:3001 (or {machine_id}.vm.huskies-projects.internal:3001)
- Tenant resolution lives in a new
tenantsCRDT LWW-map keyed by subdomain → tenant_id; reuses the existing CRDT bus. - Internal DNS:
<machine_id>.vm.huskies-projects.internalresolves on the private network.<app>.flycastis the load-balanced anycast name; we prefer the explicit machine address since each tenant has exactly one project machine at a time. - TLS terminates at the Fly edge for
*.huskies.app. The gateway receives plain HTTP/2 inside 6PN.
5. Secrets and Tenant Credentials
FLY_API_TOKENlives only on the gateway (fly secrets set FLY_API_TOKEN=… -a huskies-gw).- Per-tenant
ANTHROPIC_API_KEY,MATRIX_PASSWORD, etc. are POSTed by the tenant in the SaaS UI, encrypted with the gateway's KMS key, and passed to the machine at create time via the Machines APIconfig.env(Fly stores env values encrypted). - Rotation: changing a tenant secret means
POST /machines/{id}/updatewith the new env, which triggers a rolling replace. The orchestrator schedules this during the tenant's idle window when possible.
6. Cost Model and Idle-Shutdown
Indicative pricing (us-east, 2026):
| Machine | Hourly | Notes |
|---|---|---|
shared-cpu-2x@2048 always-on |
~$0.027 | $19/mo if 24×7 |
shared-cpu-2x@2048 suspended |
~$0.0009 | $0.65/mo idle |
| Volume 1 GiB | ~$0.0002 | $0.15/mo |
Multi-tenant pricing requires suspend on idle:
- Auto-stop: in the machine config, set
services[].auto_stop_machines = "suspend"andservices[].auto_start_machines = true. Fly's internal proxy stops the machine after the configuredmin_machinescount is zero and there is no incoming traffic for ~5 min. - On the next request, the proxy auto-wakes the machine. Suspend resume
is ~300 ms (RAM snapshot from disk); a full
stopped → startedis 10–20 s. We prefersuspendfor SaaS. - For long-lived agents (a coder agent running on the machine), the
gateway sends keepalive pings so Fly does not idle-stop while work is
in progress. Implementation: gateway tracks
active_agentscount for each machine in CRDT; if>0, hit/api/agentsonce per minute.
7. Wake-on-Request / Cold-Start Latency
Three latency tiers:
| Tier | Wake | When |
|---|---|---|
| Suspended | ~300 ms | Default for active tenants |
| Stopped | 10–20 s | Tenants idle > 7 days |
| Destroyed | 60–90 s (clone + boot) | Free tier reaped after 30 d |
The gateway returns a 202 Accepted with a Retry-After: 1 header
while wake is in progress and surfaces a "warming up" splash. The
existing huskies-gw MCP code path needs an explicit wake call for
in-flight requests because Fly's automatic wake only triggers on TCP
SYN to a registered service port.
8. Observability and Logs
fly logs -a huskies-projects -i <machine_id>streams stdout/stderr. We expose this through the gateway asGET /api/admin/tenants/{id}/logs.- Each machine ships logs to the gateway via a sidecar
vectorprocess? Decision: no — Fly's built-in NATS log shipper is enough for v1; revisit if log volume grows. - Metrics: Fly auto-exports per-machine CPU/RAM/network as Prometheus
series scrapeable from a
huskies-metricsmachine in the same 6PN. We hook into Grafana Cloud's free tier for the dashboard.
9. Disaster Recovery and Backups
- Volume snapshots (daily) cover hardware failure.
- The CRDT replicates to the gateway over the existing
/crdt-syncWebSocket. The gateway keeps a 30-day rolling backup of each tenant's CRDT in S3 (s3://huskies-backups/{tenant}/{date}.ops). This lets us reconstruct the project tree even if a Fly volume is unrecoverable. - Restore flow: provision a fresh machine + volume, replay the latest snapshot, then replay incremental ops from S3. Documented in a follow-up runbook story.
10. Quotas and Abuse Limits
- Per-tenant: max 2 concurrent agents, max 8 GiB volume, max 4 CPU,
max 200 OAuth-paid model dollars per month. Enforced in the gateway
before calling the Machines API. Over-quota →
429 Too Many Requestswith a Stripe upsell page. - Per-Fly-app: Fly soft-limits 1000 machines per app. At scale we
shard tenants across
huskies-projects-{0..9}apps usingconsistent_hash(tenant_id). - Abuse: every tenant signs up with a verified email + Stripe card. Free tier capped at 1 project, suspended after 7 days idle, destroyed after 30 days idle.
Decisions
| Decision | Choice | Rejected alternative |
|---|---|---|
| Apps topology | Single huskies-projects app, one machine per tenant |
One app per tenant: clean isolation, but blows out Fly app quotas and complicates IAM |
| Idle strategy | Suspend, not stop | Stop: cheaper but 20 s cold start is poor UX for chat |
| Secrets path | Machine env via Machines API at create time | Fly app-level secrets: shared across all tenant machines, leaks across tenants |
| State storage | Per-tenant Fly volume holding sled + git | Object storage only: would require rewriting sled backend |
| Tenant resolution | Subdomain → CRDT tenants LWW-map |
Path prefix routing: harder to issue per-tenant TLS, breaks browser cookies |
| Volume retention | Never delete on idle stop; only on explicit project deletion | Auto-delete after N days idle: too easy to lose user data |
Open Questions
- How do we hand off long-running coder agents during a Fly host evacuation (machine replace event)? Suspend won't survive a host reboot; we may need a "draining" hook that finishes the current AC and commits before allowing replacement.
- Should the gateway also live as Fly machines (auto-scale) or stay as Fly app v1 with replicas? Probably the former for global routing, but that's a separate spike.
- Billing surfaces: do we pass through Fly's per-machine cost to the tenant, or amortize it into a flat per-project price? Product call.
- Outbound network egress (model API calls, git pushes) is metered by Fly. At Claude Opus rates, model API egress dwarfs everything else, so this is a rounding error — confirm at 100-tenant scale.
Proof-of-Concept Script
A working sketch lives at
fly_multitenant_poc.sh. It demonstrates
end-to-end: read FLY_API_TOKEN, create a volume, create a machine
attached to it, wait until started, stop, and destroy. The script is
runnable but is not what production code looks like — production
will translate these calls into Rust against a typed flyio_machines
client crate, called from a new server::service::cloud::fly
module that the gateway invokes on tenant signup.