From 93bc08574b9b8eded24e725f056a2e8daf8f70b1 Mon Sep 17 00:00:00 2001 From: dave Date: Fri, 27 Mar 2026 10:57:41 +0000 Subject: [PATCH] storkit: create 407_spike_fly_io_machines_for_multi_tenant_storkit_saas --- ..._machines_for_multi_tenant_storkit_saas.md | 195 ++++++++++++++++++ 1 file changed, 195 insertions(+) create mode 100644 .storkit/work/1_backlog/407_spike_fly_io_machines_for_multi_tenant_storkit_saas.md diff --git a/.storkit/work/1_backlog/407_spike_fly_io_machines_for_multi_tenant_storkit_saas.md b/.storkit/work/1_backlog/407_spike_fly_io_machines_for_multi_tenant_storkit_saas.md new file mode 100644 index 00000000..b6136dc8 --- /dev/null +++ b/.storkit/work/1_backlog/407_spike_fly_io_machines_for_multi_tenant_storkit_saas.md @@ -0,0 +1,195 @@ +--- +name: "Fly.io Machines for multi-tenant storkit SaaS — docs, security & pricing" +retry_count: 2 +blocked: true +--- + +# Spike 407: Fly.io Machines for multi-tenant storkit SaaS — docs, security & pricing + +## Question + +What do Fly.io's published docs, security claims, and pricing say about using Machines as the isolation layer for a multi-tenant storkit SaaS? Is there anything that rules it out before we write code? + +## Hypothesis + +Fly.io Machines (Firecracker-based microVMs) are a viable isolation primitive for tenants running arbitrary shell commands, and the pricing model is workable at early SaaS scale. + +## Timebox + +2 hours + +## Investigation Plan + +- [x] Read Fly.io Machines API docs — what are the core primitives (machine lifecycle, networking, volumes, secrets)? +- [x] Research Fly.io's published isolation model — what security guarantees do they document for Firecracker microVMs? Summarise claims and explicitly flag what would require independent security review before production use. +- [x] Research cold start time — what do Fly.io docs and community benchmarks claim? Note that real numbers require a test account (covered in spike 408). +- [x] Research persistent volume support — can a volume be attached per-tenant? What are the size/count limits? +- [x] Research secret injection options — env vars, Fly Secrets API, volume mounts. What's the right approach for per-tenant `~/.claude/.credentials.json`? +- [x] Research machine count and org limits — any hard caps that would block SaaS growth? +- [x] Research pricing — always-on vs stop-on-idle machine costs at 10, 100, 1000 tenants. Include volume and egress costs. +- [x] Identify any documented showstoppers. + +## Findings + +### 1. Core API Primitives + +Base URL: `https://api.machines.dev` (or `http://_api.internal:4280` from within 6PN). +Auth: `Authorization: Bearer `. + +**Machine lifecycle** — full REST API: +- `POST /v1/apps/{app}/machines` — create (+ optionally start via `skip_launch: false`) +- `POST /v1/apps/{app}/machines/{id}/start` — start stopped machine (~10ms same-region) +- `POST /v1/apps/{app}/machines/{id}/stop` — stop (SIGINT/SIGKILL, retains disk) +- `POST /v1/apps/{app}/machines/{id}/suspend` — snapshot RAM to disk for fast resume +- `DELETE /v1/apps/{app}/machines/{id}` — destroy (irreversible) +- `GET /v1/apps/{app}/machines/{id}/wait?state=started` — synchronize on state transitions + +Machine states: `created → started → stopped/suspended → destroyed`. +Leases (`POST .../lease`) provide exclusive mutation locks — useful for orchestration. + +**Rate limits**: 1 req/s per action per machine/app ID (burst to 3). Matters for rapid tenant provisioning. + +### 2. Isolation Model + +Each Fly Machine is a **Firecracker microVM** — a separate Linux kernel, not a container. Defense in depth: +1. KVM hardware-enforced memory and CPU isolation +2. Minimal device model (5 virtual devices vs QEMU's hundreds) +3. Rust VMM implementation (no C memory-safety bugs in VMM) +4. `seccomp-bpf` limits Firecracker process to ~40 syscalls with argument filters +5. Jailer chroots + namespaces + drops privileges around the Firecracker process + +From official docs: *"MicroVMs provide strong hardware-virtualization-based security and workload isolation, which allows us to safely run applications from different customers on shared hardware."* Full VM isolation prevents kernel sharing between apps. + +Tenants have full root inside their VM by design — the kernel boundary contains blast radius. + +**Claims requiring independent verification before production use:** +- Whether SMT/hyperthreading is disabled on hosts (directly relevant to Spectre/MDS side-channel attacks — Firecracker's own docs recommend disabling SMT for strict multi-tenancy, but Fly.io does not publicly document this) +- CPU dedication is explicitly described as "best-effort", not a hard guarantee +- Pentest scope/dates/findings for three named firms (Atredis Partners, Doyensec, Tetrel) are not published +- Whether the SOC 2 Type II report scope covers the Firecracker isolation layer specifically + +**Compliance**: SOC 2 Type II certified (report available on request), ISO 27001 datacenters (Equinix), HIPAA BAA available, GDPR DPA available. + +### 3. Network Isolation + +Each machine gets a private IPv6 (6PN) address. Key isolation controls: +- Cross-organization: Fly.io platform blocks all cross-org traffic at the platform level — strong boundary +- Intra-organization: **open by default** — any machine in the same org can reach any other + +For multi-tenant SaaS, this means tenant machines in the same Fly.io org are NOT network-isolated from each other unless you use **Custom Private Networks (6PNs)**: +- `POST /v1/apps` with a `network` field assigns that app to an isolated 6PN +- Apps on different 6PNs cannot reach each other via private networking (only via public IPs) +- **Assignment is permanent** — cannot be changed after app creation; plan upfront + +Stable machine addressing: `.vm..internal` (6PN addresses change on migration). + +### 4. Cold Start Times + +| Scenario | Documented Latency | +|---|---| +| Cold boot (create + start, same region) | ~300 ms | +| Start existing stopped machine (same region) | ~10 ms | +| Start stopped machine (cross-region) | ~150 ms | +| Resume from suspend (same region) | Sub-100ms (implied) | + +Community-observed: 400–600ms end-to-end (including app init) for stopped machine cold starts. +FLAME workloads report 3–8s in some restart-race conditions. + +Real latency numbers with our actual image size require a test account — covered by spike 408. + +### 5. Persistent Volume Support + +- Volumes are created via `POST /v1/apps/{app}/volumes` with `size_gb` (default 3 GB), region, encryption flag +- Attached to machine via `config.mounts[].volume` at create/update time +- **1:1 constraint**: one volume per machine, one machine per volume, same region required +- Volumes persist across machine stop/start/suspend/destroy — they are a separate resource +- Can extend volume online (`PUT .../volumes/{id}/extend`) +- Volume snapshots available (billed at $0.08/GB/month as of Jan 2026) +- No documented per-org volume count cap (separate from machine cap) + +For per-tenant `~/.claude/` home directories, attach one volume per tenant machine — straightforward. + +### 6. Secret Injection + +Four methods, in order of recommendation for sensitive credentials: + +1. **Fly Secrets** (`fly secrets set KEY=value`) — encrypted at rest, injected as env vars at boot to all machines in the app. **Secrets are per-app, not per-machine** — all machines in an app share the same secret set. For per-tenant isolated secrets, each tenant needs their own app (or use method 3). + +2. **`config.files` with `secret_name`** — writes a named secret to a file path inside the machine at start time: + ```json + {"guest_path": "/root/.claude/.credentials.json", "secret_name": "TENANT_CREDENTIALS"} + ``` + This is the right approach for per-tenant `~/.claude/.credentials.json` if tenants share an app — pair with `ignore_app_secrets: true` and per-process secret scoping. + +3. **`config.env`** — plain env vars in machine config, not encrypted at rest. Non-sensitive config only. + +4. **`config.processes[].secrets`** — inject named secrets only to specific process groups; `ignore_app_secrets: true` prevents inheritance of app-level secrets. + +**Recommended architecture**: One app per tenant (isolated 6PN + isolated secrets) is the cleanest security model. Secrets stored per app via Fly Secrets, credentials file written via `config.files` at boot. + +### 7. Machine Count and Org Limits + +| Limit | Default | Hard Cap | +|---|---|---| +| Machines per org (all states) | 50 | None architectural | + +- The 50-machine default is a **fail-safe**, not an architectural limit. Fly.io runs customers with 100,000+ machines. +- To raise: email `billing@fly.io` with requirements. +- **This limit will be hit immediately in any real multi-tenant deployment** — must budget for an early limit-raise request before launching. +- API rate limit of 1 req/s per action also needs consideration for bulk tenant provisioning scripts. + +### 8. Pricing (as of March 2026) + +**Compute (per second, billed only while running):** + +| Preset | Per Month always-on | +|---|---| +| shared-cpu-1x (256 MB) | $2.05 | +| shared-cpu-2x (512 MB) | $4.10 | +| performance-1x (2 GB) | $32.64 | + +**Storage**: $0.15/GB/month (provisioned, regardless of machine state) +**Egress**: $0.02/GB (North America/Europe), $0.04/GB (APAC/SA), $0.12/GB (Africa/India) +**Dedicated IPv4**: $2.00/month per app (shared IPv6 is free) + +**No free tier** for new orgs (eliminated 2024). No minimum spend, no base fee. + +**Monthly cost estimates** (1x shared-cpu-1x, 1 GB volume, 1 GB egress/tenant, US East): + +| Scenario | Per Tenant | 10 Tenants | 100 Tenants | 1,000 Tenants | +|---|---|---|---|---| +| Always-on (730h/month) | $2.22 | $22 | $222 | $2,220 | +| Autostop, 8h/day active | $0.92 | $9 | $92 | $920 | +| Autostop, 2h/day active | $0.53 | $5 | $53 | $530 | + +At scale, volume storage becomes the dominant cost when machines are idle. At 1,000 tenants autostopped, storage is ~$150/month vs compute of $170–$370/month. + +### 9. Showstoppers + +**None identified** that rule it out. The following require action before launch: + +| Risk | Severity | Mitigation | +|---|---|---| +| Default 50-machine org cap | High (blocks launch) | Email billing@fly.io early; no architectural cap | +| SMT/hyperthreading not documented | Medium (security) | Request confirmation from Fly.io support before production; mitigated by VM-level isolation | +| Intra-org network open by default | Medium (security) | Use one app per tenant with custom 6PNs | +| Secrets are per-app not per-machine | Low | Use one app per tenant or `config.files` with `secret_name` | +| Volume and machine must be same region | Low (ops) | Enforce region consistency in provisioning code | +| API rate limit 1 req/s per machine | Low | Throttle bulk provisioning loops | + +## Recommendation + +**Proceed.** Fly.io Machines are a viable isolation layer for multi-tenant storkit SaaS. + +**Architecture to validate in spike 408:** +- One Fly.io app per tenant (provides 6PN network isolation + isolated secrets) +- One Firecracker microVM per tenant app (shared-cpu-1x 256 MB baseline; adjust per observed usage) +- One persistent volume per tenant (1 GB baseline for `~/.claude/`, repos, storkit state) +- Autostop/autoresume enabled — 70–92% compute cost reduction vs always-on for typical dev tool usage +- Tenant credentials injected via `config.files` + Fly Secrets at machine start + +**Pricing verdict**: Workable at early SaaS scale. At 100 tenants with autostop (8h/day), costs ~$92/month; at 1,000 tenants ~$920/month. Margins are viable if per-tenant pricing is $5–$20/month. + +**Before production**: Confirm with Fly.io support whether SMT is disabled on worker hosts. Request org machine limit raised to 200–500 during private beta. + +**Spike 408 scope**: Validate cold start latency, autostop resume behavior, and volume persistence with a real test machine running the storkit container image.