From 93bc08574b9b8eded24e725f056a2e8daf8f70b1 Mon Sep 17 00:00:00 2001
From: dave <futurechimp@users.noreply.github.com>
Date: Fri, 27 Mar 2026 10:57:41 +0000
Subject: [PATCH] storkit: create
 407_spike_fly_io_machines_for_multi_tenant_storkit_saas

---
 ..._machines_for_multi_tenant_storkit_saas.md | 195 ++++++++++++++++++
 1 file changed, 195 insertions(+)
 create mode 100644 .storkit/work/1_backlog/407_spike_fly_io_machines_for_multi_tenant_storkit_saas.md

diff --git a/.storkit/work/1_backlog/407_spike_fly_io_machines_for_multi_tenant_storkit_saas.md b/.storkit/work/1_backlog/407_spike_fly_io_machines_for_multi_tenant_storkit_saas.md
new file mode 100644
index 00000000..b6136dc8
--- /dev/null
+++ b/.storkit/work/1_backlog/407_spike_fly_io_machines_for_multi_tenant_storkit_saas.md
@@ -0,0 +1,195 @@
+---
+name: "Fly.io Machines for multi-tenant storkit SaaS — docs, security & pricing"
+retry_count: 2
+blocked: true
+---
+
+# Spike 407: Fly.io Machines for multi-tenant storkit SaaS — docs, security & pricing
+
+## Question
+
+What do Fly.io's published docs, security claims, and pricing say about using Machines as the isolation layer for a multi-tenant storkit SaaS? Is there anything that rules it out before we write code?
+
+## Hypothesis
+
+Fly.io Machines (Firecracker-based microVMs) are a viable isolation primitive for tenants running arbitrary shell commands, and the pricing model is workable at early SaaS scale.
+
+## Timebox
+
+2 hours
+
+## Investigation Plan
+
+- [x] Read Fly.io Machines API docs — what are the core primitives (machine lifecycle, networking, volumes, secrets)?
+- [x] Research Fly.io's published isolation model — what security guarantees do they document for Firecracker microVMs? Summarise claims and explicitly flag what would require independent security review before production use.
+- [x] Research cold start time — what do Fly.io docs and community benchmarks claim? Note that real numbers require a test account (covered in spike 408).
+- [x] Research persistent volume support — can a volume be attached per-tenant? What are the size/count limits?
+- [x] Research secret injection options — env vars, Fly Secrets API, volume mounts. What's the right approach for per-tenant `~/.claude/.credentials.json`?
+- [x] Research machine count and org limits — any hard caps that would block SaaS growth?
+- [x] Research pricing — always-on vs stop-on-idle machine costs at 10, 100, 1000 tenants. Include volume and egress costs.
+- [x] Identify any documented showstoppers.
+
+## Findings
+
+### 1. Core API Primitives
+
+Base URL: `https://api.machines.dev` (or `http://_api.internal:4280` from within 6PN).
+Auth: `Authorization: Bearer <fly_api_token>`.
+
+**Machine lifecycle** — full REST API:
+- `POST /v1/apps/{app}/machines` — create (+ optionally start via `skip_launch: false`)
+- `POST /v1/apps/{app}/machines/{id}/start` — start stopped machine (~10ms same-region)
+- `POST /v1/apps/{app}/machines/{id}/stop` — stop (SIGINT/SIGKILL, retains disk)
+- `POST /v1/apps/{app}/machines/{id}/suspend` — snapshot RAM to disk for fast resume
+- `DELETE /v1/apps/{app}/machines/{id}` — destroy (irreversible)
+- `GET /v1/apps/{app}/machines/{id}/wait?state=started` — synchronize on state transitions
+
+Machine states: `created → started → stopped/suspended → destroyed`.
+Leases (`POST .../lease`) provide exclusive mutation locks — useful for orchestration.
+
+**Rate limits**: 1 req/s per action per machine/app ID (burst to 3). Matters for rapid tenant provisioning.
+
+### 2. Isolation Model
+
+Each Fly Machine is a **Firecracker microVM** — a separate Linux kernel, not a container. Defense in depth:
+1. KVM hardware-enforced memory and CPU isolation
+2. Minimal device model (5 virtual devices vs QEMU's hundreds)
+3. Rust VMM implementation (no C memory-safety bugs in VMM)
+4. `seccomp-bpf` limits Firecracker process to ~40 syscalls with argument filters
+5. Jailer chroots + namespaces + drops privileges around the Firecracker process
+
+From official docs: *"MicroVMs provide strong hardware-virtualization-based security and workload isolation, which allows us to safely run applications from different customers on shared hardware."* Full VM isolation prevents kernel sharing between apps.
+
+Tenants have full root inside their VM by design — the kernel boundary contains blast radius.
+
+**Claims requiring independent verification before production use:**
+- Whether SMT/hyperthreading is disabled on hosts (directly relevant to Spectre/MDS side-channel attacks — Firecracker's own docs recommend disabling SMT for strict multi-tenancy, but Fly.io does not publicly document this)
+- CPU dedication is explicitly described as "best-effort", not a hard guarantee
+- Pentest scope/dates/findings for three named firms (Atredis Partners, Doyensec, Tetrel) are not published
+- Whether the SOC 2 Type II report scope covers the Firecracker isolation layer specifically
+
+**Compliance**: SOC 2 Type II certified (report available on request), ISO 27001 datacenters (Equinix), HIPAA BAA available, GDPR DPA available.
+
+### 3. Network Isolation
+
+Each machine gets a private IPv6 (6PN) address. Key isolation controls:
+- Cross-organization: Fly.io platform blocks all cross-org traffic at the platform level — strong boundary
+- Intra-organization: **open by default** — any machine in the same org can reach any other
+
+For multi-tenant SaaS, this means tenant machines in the same Fly.io org are NOT network-isolated from each other unless you use **Custom Private Networks (6PNs)**:
+- `POST /v1/apps` with a `network` field assigns that app to an isolated 6PN
+- Apps on different 6PNs cannot reach each other via private networking (only via public IPs)
+- **Assignment is permanent** — cannot be changed after app creation; plan upfront
+
+Stable machine addressing: `<machine_id>.vm.<appname>.internal` (6PN addresses change on migration).
+
+### 4. Cold Start Times
+
+| Scenario | Documented Latency |
+|---|---|
+| Cold boot (create + start, same region) | ~300 ms |
+| Start existing stopped machine (same region) | ~10 ms |
+| Start stopped machine (cross-region) | ~150 ms |
+| Resume from suspend (same region) | Sub-100ms (implied) |
+
+Community-observed: 400–600ms end-to-end (including app init) for stopped machine cold starts.
+FLAME workloads report 3–8s in some restart-race conditions.
+
+Real latency numbers with our actual image size require a test account — covered by spike 408.
+
+### 5. Persistent Volume Support
+
+- Volumes are created via `POST /v1/apps/{app}/volumes` with `size_gb` (default 3 GB), region, encryption flag
+- Attached to machine via `config.mounts[].volume` at create/update time
+- **1:1 constraint**: one volume per machine, one machine per volume, same region required
+- Volumes persist across machine stop/start/suspend/destroy — they are a separate resource
+- Can extend volume online (`PUT .../volumes/{id}/extend`)
+- Volume snapshots available (billed at $0.08/GB/month as of Jan 2026)
+- No documented per-org volume count cap (separate from machine cap)
+
+For per-tenant `~/.claude/` home directories, attach one volume per tenant machine — straightforward.
+
+### 6. Secret Injection
+
+Four methods, in order of recommendation for sensitive credentials:
+
+1. **Fly Secrets** (`fly secrets set KEY=value`) — encrypted at rest, injected as env vars at boot to all machines in the app. **Secrets are per-app, not per-machine** — all machines in an app share the same secret set. For per-tenant isolated secrets, each tenant needs their own app (or use method 3).
+
+2. **`config.files` with `secret_name`** — writes a named secret to a file path inside the machine at start time:
+   ```json
+   {"guest_path": "/root/.claude/.credentials.json", "secret_name": "TENANT_CREDENTIALS"}
+   ```
+   This is the right approach for per-tenant `~/.claude/.credentials.json` if tenants share an app — pair with `ignore_app_secrets: true` and per-process secret scoping.
+
+3. **`config.env`** — plain env vars in machine config, not encrypted at rest. Non-sensitive config only.
+
+4. **`config.processes[].secrets`** — inject named secrets only to specific process groups; `ignore_app_secrets: true` prevents inheritance of app-level secrets.
+
+**Recommended architecture**: One app per tenant (isolated 6PN + isolated secrets) is the cleanest security model. Secrets stored per app via Fly Secrets, credentials file written via `config.files` at boot.
+
+### 7. Machine Count and Org Limits
+
+| Limit | Default | Hard Cap |
+|---|---|---|
+| Machines per org (all states) | 50 | None architectural |
+
+- The 50-machine default is a **fail-safe**, not an architectural limit. Fly.io runs customers with 100,000+ machines.
+- To raise: email `billing@fly.io` with requirements.
+- **This limit will be hit immediately in any real multi-tenant deployment** — must budget for an early limit-raise request before launching.
+- API rate limit of 1 req/s per action also needs consideration for bulk tenant provisioning scripts.
+
+### 8. Pricing (as of March 2026)
+
+**Compute (per second, billed only while running):**
+
+| Preset | Per Month always-on |
+|---|---|
+| shared-cpu-1x (256 MB) | $2.05 |
+| shared-cpu-2x (512 MB) | $4.10 |
+| performance-1x (2 GB) | $32.64 |
+
+**Storage**: $0.15/GB/month (provisioned, regardless of machine state)
+**Egress**: $0.02/GB (North America/Europe), $0.04/GB (APAC/SA), $0.12/GB (Africa/India)
+**Dedicated IPv4**: $2.00/month per app (shared IPv6 is free)
+
+**No free tier** for new orgs (eliminated 2024). No minimum spend, no base fee.
+
+**Monthly cost estimates** (1x shared-cpu-1x, 1 GB volume, 1 GB egress/tenant, US East):
+
+| Scenario | Per Tenant | 10 Tenants | 100 Tenants | 1,000 Tenants |
+|---|---|---|---|---|
+| Always-on (730h/month) | $2.22 | $22 | $222 | $2,220 |
+| Autostop, 8h/day active | $0.92 | $9 | $92 | $920 |
+| Autostop, 2h/day active | $0.53 | $5 | $53 | $530 |
+
+At scale, volume storage becomes the dominant cost when machines are idle. At 1,000 tenants autostopped, storage is ~$150/month vs compute of $170–$370/month.
+
+### 9. Showstoppers
+
+**None identified** that rule it out. The following require action before launch:
+
+| Risk | Severity | Mitigation |
+|---|---|---|
+| Default 50-machine org cap | High (blocks launch) | Email billing@fly.io early; no architectural cap |
+| SMT/hyperthreading not documented | Medium (security) | Request confirmation from Fly.io support before production; mitigated by VM-level isolation |
+| Intra-org network open by default | Medium (security) | Use one app per tenant with custom 6PNs |
+| Secrets are per-app not per-machine | Low | Use one app per tenant or `config.files` with `secret_name` |
+| Volume and machine must be same region | Low (ops) | Enforce region consistency in provisioning code |
+| API rate limit 1 req/s per machine | Low | Throttle bulk provisioning loops |
+
+## Recommendation
+
+**Proceed.** Fly.io Machines are a viable isolation layer for multi-tenant storkit SaaS.
+
+**Architecture to validate in spike 408:**
+- One Fly.io app per tenant (provides 6PN network isolation + isolated secrets)
+- One Firecracker microVM per tenant app (shared-cpu-1x 256 MB baseline; adjust per observed usage)
+- One persistent volume per tenant (1 GB baseline for `~/.claude/`, repos, storkit state)
+- Autostop/autoresume enabled — 70–92% compute cost reduction vs always-on for typical dev tool usage
+- Tenant credentials injected via `config.files` + Fly Secrets at machine start
+
+**Pricing verdict**: Workable at early SaaS scale. At 100 tenants with autostop (8h/day), costs ~$92/month; at 1,000 tenants ~$920/month. Margins are viable if per-tenant pricing is $5–$20/month.
+
+**Before production**: Confirm with Fly.io support whether SMT is disabled on worker hosts. Request org machine limit raised to 200–500 during private beta.
+
+**Spike 408 scope**: Validate cold start latency, autostop resume behavior, and volume persistence with a real test machine running the storkit container image.