storkit: create 407_spike_fly_io_machines_for_multi_tenant_storkit_saas

2026-03-27 10:57:41 +00:00
parent 3571511349
commit 93bc08574b
1 changed files with 195 additions and 0 deletions
@@ -0,0 +1,195 @@
 ---
 name: "Fly.io Machines for multi-tenant storkit SaaS — docs, security & pricing"
 retry_count: 2
 blocked: true
 ---
 # Spike 407: Fly.io Machines for multi-tenant storkit SaaS — docs, security & pricing
 ## Question
 What do Fly.io's published docs, security claims, and pricing say about using Machines as the isolation layer for a multi-tenant storkit SaaS? Is there anything that rules it out before we write code?
 ## Hypothesis
 Fly.io Machines (Firecracker-based microVMs) are a viable isolation primitive for tenants running arbitrary shell commands, and the pricing model is workable at early SaaS scale.
 ## Timebox
 2 hours
 ## Investigation Plan
 - [x] Read Fly.io Machines API docs — what are the core primitives (machine lifecycle, networking, volumes, secrets)?
 - [x] Research Fly.io's published isolation model — what security guarantees do they document for Firecracker microVMs? Summarise claims and explicitly flag what would require independent security review before production use.
 - [x] Research cold start time — what do Fly.io docs and community benchmarks claim? Note that real numbers require a test account (covered in spike 408).
 - [x] Research persistent volume support — can a volume be attached per-tenant? What are the size/count limits?
 - [x] Research secret injection options — env vars, Fly Secrets API, volume mounts. What's the right approach for per-tenant `~/.claude/.credentials.json`?
 - [x] Research machine count and org limits — any hard caps that would block SaaS growth?
 - [x] Research pricing — always-on vs stop-on-idle machine costs at 10, 100, 1000 tenants. Include volume and egress costs.
 - [x] Identify any documented showstoppers.
 ## Findings
 ### 1. Core API Primitives
 Base URL: `https://api.machines.dev` (or `http://_api.internal:4280` from within 6PN).
 Auth: `Authorization: Bearer <fly_api_token>`.
 **Machine lifecycle** — full REST API:
 - `POST /v1/apps/{app}/machines` — create (+ optionally start via `skip_launch: false`)
 - `POST /v1/apps/{app}/machines/{id}/start` — start stopped machine (~10ms same-region)
 - `POST /v1/apps/{app}/machines/{id}/stop` — stop (SIGINT/SIGKILL, retains disk)
 - `POST /v1/apps/{app}/machines/{id}/suspend` — snapshot RAM to disk for fast resume
 - `DELETE /v1/apps/{app}/machines/{id}` — destroy (irreversible)
 - `GET /v1/apps/{app}/machines/{id}/wait?state=started` — synchronize on state transitions
 Machine states: `created → started → stopped/suspended → destroyed`.
 Leases (`POST .../lease`) provide exclusive mutation locks — useful for orchestration.
 **Rate limits**: 1 req/s per action per machine/app ID (burst to 3). Matters for rapid tenant provisioning.
 ### 2. Isolation Model
 Each Fly Machine is a **Firecracker microVM** — a separate Linux kernel, not a container. Defense in depth:
 1. KVM hardware-enforced memory and CPU isolation
 2. Minimal device model (5 virtual devices vs QEMU's hundreds)
 3. Rust VMM implementation (no C memory-safety bugs in VMM)
 4. `seccomp-bpf` limits Firecracker process to ~40 syscalls with argument filters
 5. Jailer chroots + namespaces + drops privileges around the Firecracker process
 From official docs: *"MicroVMs provide strong hardware-virtualization-based security and workload isolation, which allows us to safely run applications from different customers on shared hardware."* Full VM isolation prevents kernel sharing between apps.
 Tenants have full root inside their VM by design — the kernel boundary contains blast radius.
 **Claims requiring independent verification before production use:**
 - Whether SMT/hyperthreading is disabled on hosts (directly relevant to Spectre/MDS side-channel attacks — Firecracker's own docs recommend disabling SMT for strict multi-tenancy, but Fly.io does not publicly document this)
 - CPU dedication is explicitly described as "best-effort", not a hard guarantee
 - Pentest scope/dates/findings for three named firms (Atredis Partners, Doyensec, Tetrel) are not published
 - Whether the SOC 2 Type II report scope covers the Firecracker isolation layer specifically
 **Compliance**: SOC 2 Type II certified (report available on request), ISO 27001 datacenters (Equinix), HIPAA BAA available, GDPR DPA available.
 ### 3. Network Isolation
 Each machine gets a private IPv6 (6PN) address. Key isolation controls:
 - Cross-organization: Fly.io platform blocks all cross-org traffic at the platform level — strong boundary
 - Intra-organization: **open by default** — any machine in the same org can reach any other
 For multi-tenant SaaS, this means tenant machines in the same Fly.io org are NOT network-isolated from each other unless you use **Custom Private Networks (6PNs)**:
 - `POST /v1/apps` with a `network` field assigns that app to an isolated 6PN
 - Apps on different 6PNs cannot reach each other via private networking (only via public IPs)
 - **Assignment is permanent** — cannot be changed after app creation; plan upfront
 Stable machine addressing: `<machine_id>.vm.<appname>.internal` (6PN addresses change on migration).
 ### 4. Cold Start Times
 | Scenario | Documented Latency |
 |---|---|
 | Cold boot (create + start, same region) | ~300 ms |
 | Start existing stopped machine (same region) | ~10 ms |
 | Start stopped machine (cross-region) | ~150 ms |
 | Resume from suspend (same region) | Sub-100ms (implied) |
 Community-observed: 400–600ms end-to-end (including app init) for stopped machine cold starts.
 FLAME workloads report 3–8s in some restart-race conditions.
 Real latency numbers with our actual image size require a test account — covered by spike 408.
 ### 5. Persistent Volume Support
 - Volumes are created via `POST /v1/apps/{app}/volumes` with `size_gb` (default 3 GB), region, encryption flag
 - Attached to machine via `config.mounts[].volume` at create/update time
 - **1:1 constraint**: one volume per machine, one machine per volume, same region required
 - Volumes persist across machine stop/start/suspend/destroy — they are a separate resource
 - Can extend volume online (`PUT .../volumes/{id}/extend`)
 - Volume snapshots available (billed at $0.08/GB/month as of Jan 2026)
 - No documented per-org volume count cap (separate from machine cap)
 For per-tenant `~/.claude/` home directories, attach one volume per tenant machine — straightforward.
 ### 6. Secret Injection
 Four methods, in order of recommendation for sensitive credentials:
 1. **Fly Secrets** (`fly secrets set KEY=value`) — encrypted at rest, injected as env vars at boot to all machines in the app. **Secrets are per-app, not per-machine** — all machines in an app share the same secret set. For per-tenant isolated secrets, each tenant needs their own app (or use method 3).
 2. **`config.files` with `secret_name`** — writes a named secret to a file path inside the machine at start time:
   ```json
   {"guest_path": "/root/.claude/.credentials.json", "secret_name": "TENANT_CREDENTIALS"}
   ```
   This is the right approach for per-tenant `~/.claude/.credentials.json` if tenants share an app — pair with `ignore_app_secrets: true` and per-process secret scoping.
 3. **`config.env`** — plain env vars in machine config, not encrypted at rest. Non-sensitive config only.
 4. **`config.processes[].secrets`** — inject named secrets only to specific process groups; `ignore_app_secrets: true` prevents inheritance of app-level secrets.
 **Recommended architecture**: One app per tenant (isolated 6PN + isolated secrets) is the cleanest security model. Secrets stored per app via Fly Secrets, credentials file written via `config.files` at boot.
 ### 7. Machine Count and Org Limits
 | Limit | Default | Hard Cap |
 |---|---|---|
 | Machines per org (all states) | 50 | None architectural |
 - The 50-machine default is a **fail-safe**, not an architectural limit. Fly.io runs customers with 100,000+ machines.
 - To raise: email `billing@fly.io` with requirements.
 - **This limit will be hit immediately in any real multi-tenant deployment** — must budget for an early limit-raise request before launching.
 - API rate limit of 1 req/s per action also needs consideration for bulk tenant provisioning scripts.
 ### 8. Pricing (as of March 2026)
 **Compute (per second, billed only while running):**
 | Preset | Per Month always-on |
 |---|---|
 | shared-cpu-1x (256 MB) | $2.05 |
 | shared-cpu-2x (512 MB) | $4.10 |
 | performance-1x (2 GB) | $32.64 |
 **Storage**: $0.15/GB/month (provisioned, regardless of machine state)
 **Egress**: $0.02/GB (North America/Europe), $0.04/GB (APAC/SA), $0.12/GB (Africa/India)
 **Dedicated IPv4**: $2.00/month per app (shared IPv6 is free)
 **No free tier** for new orgs (eliminated 2024). No minimum spend, no base fee.
 **Monthly cost estimates** (1x shared-cpu-1x, 1 GB volume, 1 GB egress/tenant, US East):
 | Scenario | Per Tenant | 10 Tenants | 100 Tenants | 1,000 Tenants |
 |---|---|---|---|---|
 | Always-on (730h/month) | $2.22 | $22 | $222 | $2,220 |
 | Autostop, 8h/day active | $0.92 | $9 | $92 | $920 |
 | Autostop, 2h/day active | $0.53 | $5 | $53 | $530 |
 At scale, volume storage becomes the dominant cost when machines are idle. At 1,000 tenants autostopped, storage is ~$150/month vs compute of $170–$370/month.
 ### 9. Showstoppers
 **None identified** that rule it out. The following require action before launch:
 | Risk | Severity | Mitigation |
 |---|---|---|
 | Default 50-machine org cap | High (blocks launch) | Email billing@fly.io early; no architectural cap |
 | SMT/hyperthreading not documented | Medium (security) | Request confirmation from Fly.io support before production; mitigated by VM-level isolation |
 | Intra-org network open by default | Medium (security) | Use one app per tenant with custom 6PNs |
 | Secrets are per-app not per-machine | Low | Use one app per tenant or `config.files` with `secret_name` |
 | Volume and machine must be same region | Low (ops) | Enforce region consistency in provisioning code |
 | API rate limit 1 req/s per machine | Low | Throttle bulk provisioning loops |
 ## Recommendation
 **Proceed.** Fly.io Machines are a viable isolation layer for multi-tenant storkit SaaS.
 **Architecture to validate in spike 408:**
 - One Fly.io app per tenant (provides 6PN network isolation + isolated secrets)
 - One Firecracker microVM per tenant app (shared-cpu-1x 256 MB baseline; adjust per observed usage)
 - One persistent volume per tenant (1 GB baseline for `~/.claude/`, repos, storkit state)
 - Autostop/autoresume enabled — 70–92% compute cost reduction vs always-on for typical dev tool usage
 - Tenant credentials injected via `config.files` + Fly Secrets at machine start
 **Pricing verdict**: Workable at early SaaS scale. At 100 tenants with autostop (8h/day), costs ~$92/month; at 1,000 tenants ~$920/month. Margins are viable if per-tenant pricing is $5–$20/month.
 **Before production**: Confirm with Fly.io support whether SMT is disabled on worker hosts. Request org machine limit raised to 200–500 during private beta.
 **Spike 408 scope**: Validate cold start latency, autostop resume behavior, and volume persistence with a real test machine running the storkit container image.