Files
huskies/.huskies/specs/tech/SPIKE_FLY_MULTITENANT.md
T
2026-05-14 18:32:37 +00:00

281 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Spike 811: Fly.io Machines API Integration for Multi-Tenant Huskies SaaS
## Goal
Investigate how to operate huskies as a hosted multi-tenant SaaS on
[Fly.io Machines](https://fly.io/docs/machines/). Each tenant owns one or
more huskies *project* containers; a fronting gateway routes traffic by
tenant and provisions/destroys backing machines on demand. This document
captures the architecture, the API surface we need, and the operational
concerns that need answers before we start writing production code.
## Architecture at a Glance
```
┌──────────────────────┐ ┌───────────────────────────────────────────┐
│ Browser / CLI / Bot │───────▶│ huskies-gateway (Fly app: huskies-gw) │
└──────────────────────┘ HTTPS │ * authenticates tenant │
│ * picks active project for tenant │
│ * proxies /mcp /ws /api to machine │
│ * provisions machines via Machines API │
└──────────────────┬────────────────────────┘
│ .flycast (Wireguard)
┌────────────────────────────────────────────────┐
│ huskies-project-{tenant}-{project} │
│ (Fly app: huskies-projects, machine per tier)│
│ * runs `huskies --port 3001 /data/project` │
│ * persistent volume mounted at /data │
│ * .huskies/ + sled CRDT live on volume │
└────────────────────────────────────────────────┘
```
Two Fly apps:
* `huskies-gw` — small, always-on, replicated across regions; runs the
existing `huskies --gateway` binary plus a thin **Fly orchestrator**
layer that calls the Machines API.
* `huskies-projects` — single Fly app holding *one machine per tenant
project*. Using one app (rather than one app per tenant) keeps quota
management, IAM, and image distribution simple while still giving us
per-machine networking (`{machine_id}.vm.huskies-projects.internal`)
and per-tenant Fly volumes.
## Listed Concerns
The story brief flags the following concerns. Each is addressed below.
1. Machine lifecycle & API surface
2. Tenant isolation
3. Persistence and volumes
4. Networking & routing
5. Secrets and tenant credentials
6. Cost model and idle-shutdown
7. Wake-on-request / cold-start latency
8. Observability and logs
9. Disaster recovery and backups
10. Quotas and abuse limits
---
### 1. Machine Lifecycle & API Surface
Fly Machines is a REST API at `https://api.machines.dev/v1`. Auth is a
single bearer token per Fly organization (`FLY_API_TOKEN`).
Endpoints we will call:
| Verb | Path | Use |
|------|------|-----|
| `POST` | `/apps/{app}/machines` | Create a new project machine |
| `GET` | `/apps/{app}/machines/{id}` | Poll status |
| `GET` | `/apps/{app}/machines/{id}/wait?state=started&timeout=30` | Block until state |
| `POST` | `/apps/{app}/machines/{id}/start` | Wake a stopped machine |
| `POST` | `/apps/{app}/machines/{id}/stop` | Graceful stop (idle scale-to-zero) |
| `POST` | `/apps/{app}/machines/{id}/suspend` | Suspend RAM-to-disk (fast wake) |
| `DELETE` | `/apps/{app}/machines/{id}?force=true` | Destroy permanently |
| `GET` | `/apps/{app}/machines` | Enumerate during reconcile |
| `POST` | `/apps/{app}/volumes` | Create persistent volume for tenant |
| `DELETE` | `/apps/{app}/volumes/{id}` | Reclaim volume when tenant deletes project |
States the orchestrator observes: `created → starting → started → stopping
→ stopped → destroying → destroyed` (`replacing` and `suspending` are
transient).
A successful provisioning sequence is:
1. `POST /volumes` (one-time per tenant project, 1 GiB default).
2. `POST /machines` with `config = { image, env, mounts: [{volume, path:"/data"}], guest, services }`.
3. `GET /machines/{id}/wait?state=started` (~1020 s on cold start).
4. Cache `{tenant, project} → machine_id` in the gateway CRDT
(`gateway_projects` LWW-map already exists — extend the value with
`machine_id`, `volume_id`, `last_used_at`).
Destruction:
1. `POST /machines/{id}/stop` (graceful, lets sled flush).
2. `DELETE /machines/{id}?force=true`.
3. Optionally `DELETE /volumes/{id}` (only when tenant explicitly deletes
the project; idle stop must **never** delete volumes).
### 2. Tenant Isolation
* **Filesystem:** each machine has its own ephemeral root and its own
Fly volume mounted at `/data`. Volumes are not shareable across
machines, so tenants cannot read each other's CRDT.
* **Network:** machines on the same Fly app can reach each other via
6PN private networking. We must explicitly *not* expose the project
server externally; only the gateway holds a public IP. Project
machines bind to `[::]:3001` and rely on `.flycast` private routing.
* **Credentials:** project machines never see the gateway's
`FLY_API_TOKEN`. Tenant-supplied secrets (Anthropic key, Matrix
password, etc.) are stored as Fly secrets *scoped to the machine* via
the `secrets` field at create time, encrypted at rest by Fly.
* **CPU/RAM:** `guest = { cpu_kind: "shared", cpus: 2, memory_mb: 2048 }`
is a sensible default; larger tenants get `performance` cpus. Hard
caps prevent a runaway agent from eating a neighbour's quota.
### 3. Persistence and Volumes
* Fly volumes are zone-pinned. We pick the volume region from the
tenant's primary region (`PRIMARY_REGION` env on the gateway), with
fallback to `iad`.
* The volume holds:
* `/data/project/.huskies/` — pipeline.db (sled), bot.toml, project.toml
* `/data/project/.git` — repository (initially cloned at first run)
* `/data/project/` — working tree
* Sled needs a clean shutdown. The orchestrator must always `stop`
before `destroy`. We rely on Fly's `kill_signal = "SIGTERM"` + the
existing huskies shutdown path in `rebuild.rs`.
* **Snapshots:** Fly snapshots volumes daily by default (5-day
retention). For paid tiers we extend retention via `snapshot_retention`
on the volume.
### 4. Networking & Routing
The gateway already proxies MCP/WS/REST by active project. For SaaS we
add tenant resolution **before** the project lookup:
```
Host: alice.huskies.app → tenant = alice
GET /tenants/alice/projects/foo → project_id, machine_id
proxy to fdaa:0:abcd:a7b:e2:1::3:3001 (or {machine_id}.vm.huskies-projects.internal:3001)
```
* Tenant resolution lives in a new `tenants` CRDT LWW-map keyed by
subdomain → tenant_id; reuses the existing CRDT bus.
* Internal DNS: `<machine_id>.vm.huskies-projects.internal` resolves on
the private network. `<app>.flycast` is the load-balanced anycast
name; we prefer the explicit machine address since each tenant has
exactly one project machine at a time.
* TLS terminates at the Fly edge for `*.huskies.app`. The gateway
receives plain HTTP/2 inside 6PN.
### 5. Secrets and Tenant Credentials
* `FLY_API_TOKEN` lives only on the gateway (`fly secrets set
FLY_API_TOKEN=… -a huskies-gw`).
* Per-tenant `ANTHROPIC_API_KEY`, `MATRIX_PASSWORD`, etc. are POSTed by
the tenant in the SaaS UI, encrypted with the gateway's KMS key, and
passed to the machine at create time via the Machines API
`config.env` (Fly stores env values encrypted).
* Rotation: changing a tenant secret means `POST /machines/{id}/update`
with the new env, which triggers a rolling replace. The orchestrator
schedules this during the tenant's idle window when possible.
### 6. Cost Model and Idle-Shutdown
Indicative pricing (us-east, 2026):
| Machine | Hourly | Notes |
|---------|--------|-------|
| `shared-cpu-2x@2048` always-on | ~$0.027 | $19/mo if 24×7 |
| `shared-cpu-2x@2048` suspended | ~$0.0009 | $0.65/mo idle |
| Volume 1 GiB | ~$0.0002 | $0.15/mo |
Multi-tenant pricing requires **suspend on idle**:
* Auto-stop: in the machine config, set `services[].auto_stop_machines
= "suspend"` and `services[].auto_start_machines = true`. Fly's
internal proxy stops the machine after the configured `min_machines`
count is zero and there is no incoming traffic for ~5 min.
* On the next request, the proxy auto-wakes the machine. Suspend resume
is ~300 ms (RAM snapshot from disk); a full `stopped → started` is
1020 s. We prefer `suspend` for SaaS.
* For long-lived agents (a coder agent running on the machine), the
gateway sends keepalive pings so Fly does not idle-stop while work is
in progress. Implementation: gateway tracks `active_agents` count for
each machine in CRDT; if `>0`, hit `/api/agents` once per minute.
### 7. Wake-on-Request / Cold-Start Latency
Three latency tiers:
| Tier | Wake | When |
|------|------|------|
| Suspended | ~300 ms | Default for active tenants |
| Stopped | 1020 s | Tenants idle > 7 days |
| Destroyed | 6090 s (clone + boot) | Free tier reaped after 30 d |
The gateway returns a `202 Accepted` with a `Retry-After: 1` header
while wake is in progress and surfaces a "warming up" splash. The
existing `huskies-gw` MCP code path needs an explicit wake call for
in-flight requests because Fly's automatic wake only triggers on TCP
SYN to a registered service port.
### 8. Observability and Logs
* `fly logs -a huskies-projects -i <machine_id>` streams stdout/stderr.
We expose this through the gateway as `GET /api/admin/tenants/{id}/logs`.
* Each machine ships logs to the gateway via a sidecar `vector`
process? Decision: **no** — Fly's built-in NATS log shipper is enough
for v1; revisit if log volume grows.
* Metrics: Fly auto-exports per-machine CPU/RAM/network as Prometheus
series scrapeable from a `huskies-metrics` machine in the same 6PN.
We hook into Grafana Cloud's free tier for the dashboard.
### 9. Disaster Recovery and Backups
* Volume snapshots (daily) cover hardware failure.
* The CRDT replicates to the gateway over the existing `/crdt-sync`
WebSocket. The gateway keeps a 30-day rolling backup of each tenant's
CRDT in S3 (`s3://huskies-backups/{tenant}/{date}.ops`). This lets us
reconstruct the project tree even if a Fly volume is unrecoverable.
* Restore flow: provision a fresh machine + volume, replay the latest
snapshot, then replay incremental ops from S3. Documented in a
follow-up runbook story.
### 10. Quotas and Abuse Limits
* Per-tenant: max 2 concurrent agents, max 8 GiB volume, max 4 CPU,
max 200 OAuth-paid model dollars per month. Enforced in the gateway
before calling the Machines API. Over-quota → `429 Too Many Requests`
with a Stripe upsell page.
* Per-Fly-app: Fly soft-limits 1000 machines per app. At scale we
shard tenants across `huskies-projects-{0..9}` apps using
`consistent_hash(tenant_id)`.
* Abuse: every tenant signs up with a verified email + Stripe card.
Free tier capped at 1 project, suspended after 7 days idle, destroyed
after 30 days idle.
---
## Decisions
| Decision | Choice | Rejected alternative |
|----------|--------|----------------------|
| Apps topology | **Single `huskies-projects` app, one machine per tenant** | One app per tenant: clean isolation, but blows out Fly app quotas and complicates IAM |
| Idle strategy | **Suspend, not stop** | Stop: cheaper but 20 s cold start is poor UX for chat |
| Secrets path | **Machine env via Machines API at create time** | Fly app-level secrets: shared across all tenant machines, leaks across tenants |
| State storage | **Per-tenant Fly volume holding sled + git** | Object storage only: would require rewriting sled backend |
| Tenant resolution | **Subdomain → CRDT `tenants` LWW-map** | Path prefix routing: harder to issue per-tenant TLS, breaks browser cookies |
| Volume retention | **Never delete on idle stop; only on explicit project deletion** | Auto-delete after N days idle: too easy to lose user data |
## Open Questions
1. How do we hand off long-running coder agents during a Fly host
evacuation (machine replace event)? Suspend won't survive a host
reboot; we may need a "draining" hook that finishes the current AC
and commits before allowing replacement.
2. Should the gateway also live as Fly machines (auto-scale) or stay
as Fly app v1 with replicas? Probably the former for global routing,
but that's a separate spike.
3. Billing surfaces: do we pass through Fly's per-machine cost to the
tenant, or amortize it into a flat per-project price? Product call.
4. Outbound network egress (model API calls, git pushes) is metered by
Fly. At Claude Opus rates, model API egress dwarfs everything else,
so this is a rounding error — confirm at 100-tenant scale.
## Proof-of-Concept Script
A working sketch lives at
[`fly_multitenant_poc.sh`](./fly_multitenant_poc.sh). It demonstrates
end-to-end: read `FLY_API_TOKEN`, create a volume, create a machine
attached to it, wait until started, stop, and destroy. The script is
runnable but is **not** what production code looks like — production
will translate these calls into Rust against a typed `flyio_machines`
client crate, called from a new `server::service::cloud::fly`
module that the gateway invokes on tenant signup.