storkit: create 407_spike_fly_io_machines_for_multi_tenant_storkit_saas
This commit is contained in:
@@ -0,0 +1,195 @@
|
|||||||
|
---
|
||||||
|
name: "Fly.io Machines for multi-tenant storkit SaaS — docs, security & pricing"
|
||||||
|
retry_count: 2
|
||||||
|
blocked: true
|
||||||
|
---
|
||||||
|
|
||||||
|
# Spike 407: Fly.io Machines for multi-tenant storkit SaaS — docs, security & pricing
|
||||||
|
|
||||||
|
## Question
|
||||||
|
|
||||||
|
What do Fly.io's published docs, security claims, and pricing say about using Machines as the isolation layer for a multi-tenant storkit SaaS? Is there anything that rules it out before we write code?
|
||||||
|
|
||||||
|
## Hypothesis
|
||||||
|
|
||||||
|
Fly.io Machines (Firecracker-based microVMs) are a viable isolation primitive for tenants running arbitrary shell commands, and the pricing model is workable at early SaaS scale.
|
||||||
|
|
||||||
|
## Timebox
|
||||||
|
|
||||||
|
2 hours
|
||||||
|
|
||||||
|
## Investigation Plan
|
||||||
|
|
||||||
|
- [x] Read Fly.io Machines API docs — what are the core primitives (machine lifecycle, networking, volumes, secrets)?
|
||||||
|
- [x] Research Fly.io's published isolation model — what security guarantees do they document for Firecracker microVMs? Summarise claims and explicitly flag what would require independent security review before production use.
|
||||||
|
- [x] Research cold start time — what do Fly.io docs and community benchmarks claim? Note that real numbers require a test account (covered in spike 408).
|
||||||
|
- [x] Research persistent volume support — can a volume be attached per-tenant? What are the size/count limits?
|
||||||
|
- [x] Research secret injection options — env vars, Fly Secrets API, volume mounts. What's the right approach for per-tenant `~/.claude/.credentials.json`?
|
||||||
|
- [x] Research machine count and org limits — any hard caps that would block SaaS growth?
|
||||||
|
- [x] Research pricing — always-on vs stop-on-idle machine costs at 10, 100, 1000 tenants. Include volume and egress costs.
|
||||||
|
- [x] Identify any documented showstoppers.
|
||||||
|
|
||||||
|
## Findings
|
||||||
|
|
||||||
|
### 1. Core API Primitives
|
||||||
|
|
||||||
|
Base URL: `https://api.machines.dev` (or `http://_api.internal:4280` from within 6PN).
|
||||||
|
Auth: `Authorization: Bearer <fly_api_token>`.
|
||||||
|
|
||||||
|
**Machine lifecycle** — full REST API:
|
||||||
|
- `POST /v1/apps/{app}/machines` — create (+ optionally start via `skip_launch: false`)
|
||||||
|
- `POST /v1/apps/{app}/machines/{id}/start` — start stopped machine (~10ms same-region)
|
||||||
|
- `POST /v1/apps/{app}/machines/{id}/stop` — stop (SIGINT/SIGKILL, retains disk)
|
||||||
|
- `POST /v1/apps/{app}/machines/{id}/suspend` — snapshot RAM to disk for fast resume
|
||||||
|
- `DELETE /v1/apps/{app}/machines/{id}` — destroy (irreversible)
|
||||||
|
- `GET /v1/apps/{app}/machines/{id}/wait?state=started` — synchronize on state transitions
|
||||||
|
|
||||||
|
Machine states: `created → started → stopped/suspended → destroyed`.
|
||||||
|
Leases (`POST .../lease`) provide exclusive mutation locks — useful for orchestration.
|
||||||
|
|
||||||
|
**Rate limits**: 1 req/s per action per machine/app ID (burst to 3). Matters for rapid tenant provisioning.
|
||||||
|
|
||||||
|
### 2. Isolation Model
|
||||||
|
|
||||||
|
Each Fly Machine is a **Firecracker microVM** — a separate Linux kernel, not a container. Defense in depth:
|
||||||
|
1. KVM hardware-enforced memory and CPU isolation
|
||||||
|
2. Minimal device model (5 virtual devices vs QEMU's hundreds)
|
||||||
|
3. Rust VMM implementation (no C memory-safety bugs in VMM)
|
||||||
|
4. `seccomp-bpf` limits Firecracker process to ~40 syscalls with argument filters
|
||||||
|
5. Jailer chroots + namespaces + drops privileges around the Firecracker process
|
||||||
|
|
||||||
|
From official docs: *"MicroVMs provide strong hardware-virtualization-based security and workload isolation, which allows us to safely run applications from different customers on shared hardware."* Full VM isolation prevents kernel sharing between apps.
|
||||||
|
|
||||||
|
Tenants have full root inside their VM by design — the kernel boundary contains blast radius.
|
||||||
|
|
||||||
|
**Claims requiring independent verification before production use:**
|
||||||
|
- Whether SMT/hyperthreading is disabled on hosts (directly relevant to Spectre/MDS side-channel attacks — Firecracker's own docs recommend disabling SMT for strict multi-tenancy, but Fly.io does not publicly document this)
|
||||||
|
- CPU dedication is explicitly described as "best-effort", not a hard guarantee
|
||||||
|
- Pentest scope/dates/findings for three named firms (Atredis Partners, Doyensec, Tetrel) are not published
|
||||||
|
- Whether the SOC 2 Type II report scope covers the Firecracker isolation layer specifically
|
||||||
|
|
||||||
|
**Compliance**: SOC 2 Type II certified (report available on request), ISO 27001 datacenters (Equinix), HIPAA BAA available, GDPR DPA available.
|
||||||
|
|
||||||
|
### 3. Network Isolation
|
||||||
|
|
||||||
|
Each machine gets a private IPv6 (6PN) address. Key isolation controls:
|
||||||
|
- Cross-organization: Fly.io platform blocks all cross-org traffic at the platform level — strong boundary
|
||||||
|
- Intra-organization: **open by default** — any machine in the same org can reach any other
|
||||||
|
|
||||||
|
For multi-tenant SaaS, this means tenant machines in the same Fly.io org are NOT network-isolated from each other unless you use **Custom Private Networks (6PNs)**:
|
||||||
|
- `POST /v1/apps` with a `network` field assigns that app to an isolated 6PN
|
||||||
|
- Apps on different 6PNs cannot reach each other via private networking (only via public IPs)
|
||||||
|
- **Assignment is permanent** — cannot be changed after app creation; plan upfront
|
||||||
|
|
||||||
|
Stable machine addressing: `<machine_id>.vm.<appname>.internal` (6PN addresses change on migration).
|
||||||
|
|
||||||
|
### 4. Cold Start Times
|
||||||
|
|
||||||
|
| Scenario | Documented Latency |
|
||||||
|
|---|---|
|
||||||
|
| Cold boot (create + start, same region) | ~300 ms |
|
||||||
|
| Start existing stopped machine (same region) | ~10 ms |
|
||||||
|
| Start stopped machine (cross-region) | ~150 ms |
|
||||||
|
| Resume from suspend (same region) | Sub-100ms (implied) |
|
||||||
|
|
||||||
|
Community-observed: 400–600ms end-to-end (including app init) for stopped machine cold starts.
|
||||||
|
FLAME workloads report 3–8s in some restart-race conditions.
|
||||||
|
|
||||||
|
Real latency numbers with our actual image size require a test account — covered by spike 408.
|
||||||
|
|
||||||
|
### 5. Persistent Volume Support
|
||||||
|
|
||||||
|
- Volumes are created via `POST /v1/apps/{app}/volumes` with `size_gb` (default 3 GB), region, encryption flag
|
||||||
|
- Attached to machine via `config.mounts[].volume` at create/update time
|
||||||
|
- **1:1 constraint**: one volume per machine, one machine per volume, same region required
|
||||||
|
- Volumes persist across machine stop/start/suspend/destroy — they are a separate resource
|
||||||
|
- Can extend volume online (`PUT .../volumes/{id}/extend`)
|
||||||
|
- Volume snapshots available (billed at $0.08/GB/month as of Jan 2026)
|
||||||
|
- No documented per-org volume count cap (separate from machine cap)
|
||||||
|
|
||||||
|
For per-tenant `~/.claude/` home directories, attach one volume per tenant machine — straightforward.
|
||||||
|
|
||||||
|
### 6. Secret Injection
|
||||||
|
|
||||||
|
Four methods, in order of recommendation for sensitive credentials:
|
||||||
|
|
||||||
|
1. **Fly Secrets** (`fly secrets set KEY=value`) — encrypted at rest, injected as env vars at boot to all machines in the app. **Secrets are per-app, not per-machine** — all machines in an app share the same secret set. For per-tenant isolated secrets, each tenant needs their own app (or use method 3).
|
||||||
|
|
||||||
|
2. **`config.files` with `secret_name`** — writes a named secret to a file path inside the machine at start time:
|
||||||
|
```json
|
||||||
|
{"guest_path": "/root/.claude/.credentials.json", "secret_name": "TENANT_CREDENTIALS"}
|
||||||
|
```
|
||||||
|
This is the right approach for per-tenant `~/.claude/.credentials.json` if tenants share an app — pair with `ignore_app_secrets: true` and per-process secret scoping.
|
||||||
|
|
||||||
|
3. **`config.env`** — plain env vars in machine config, not encrypted at rest. Non-sensitive config only.
|
||||||
|
|
||||||
|
4. **`config.processes[].secrets`** — inject named secrets only to specific process groups; `ignore_app_secrets: true` prevents inheritance of app-level secrets.
|
||||||
|
|
||||||
|
**Recommended architecture**: One app per tenant (isolated 6PN + isolated secrets) is the cleanest security model. Secrets stored per app via Fly Secrets, credentials file written via `config.files` at boot.
|
||||||
|
|
||||||
|
### 7. Machine Count and Org Limits
|
||||||
|
|
||||||
|
| Limit | Default | Hard Cap |
|
||||||
|
|---|---|---|
|
||||||
|
| Machines per org (all states) | 50 | None architectural |
|
||||||
|
|
||||||
|
- The 50-machine default is a **fail-safe**, not an architectural limit. Fly.io runs customers with 100,000+ machines.
|
||||||
|
- To raise: email `billing@fly.io` with requirements.
|
||||||
|
- **This limit will be hit immediately in any real multi-tenant deployment** — must budget for an early limit-raise request before launching.
|
||||||
|
- API rate limit of 1 req/s per action also needs consideration for bulk tenant provisioning scripts.
|
||||||
|
|
||||||
|
### 8. Pricing (as of March 2026)
|
||||||
|
|
||||||
|
**Compute (per second, billed only while running):**
|
||||||
|
|
||||||
|
| Preset | Per Month always-on |
|
||||||
|
|---|---|
|
||||||
|
| shared-cpu-1x (256 MB) | $2.05 |
|
||||||
|
| shared-cpu-2x (512 MB) | $4.10 |
|
||||||
|
| performance-1x (2 GB) | $32.64 |
|
||||||
|
|
||||||
|
**Storage**: $0.15/GB/month (provisioned, regardless of machine state)
|
||||||
|
**Egress**: $0.02/GB (North America/Europe), $0.04/GB (APAC/SA), $0.12/GB (Africa/India)
|
||||||
|
**Dedicated IPv4**: $2.00/month per app (shared IPv6 is free)
|
||||||
|
|
||||||
|
**No free tier** for new orgs (eliminated 2024). No minimum spend, no base fee.
|
||||||
|
|
||||||
|
**Monthly cost estimates** (1x shared-cpu-1x, 1 GB volume, 1 GB egress/tenant, US East):
|
||||||
|
|
||||||
|
| Scenario | Per Tenant | 10 Tenants | 100 Tenants | 1,000 Tenants |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| Always-on (730h/month) | $2.22 | $22 | $222 | $2,220 |
|
||||||
|
| Autostop, 8h/day active | $0.92 | $9 | $92 | $920 |
|
||||||
|
| Autostop, 2h/day active | $0.53 | $5 | $53 | $530 |
|
||||||
|
|
||||||
|
At scale, volume storage becomes the dominant cost when machines are idle. At 1,000 tenants autostopped, storage is ~$150/month vs compute of $170–$370/month.
|
||||||
|
|
||||||
|
### 9. Showstoppers
|
||||||
|
|
||||||
|
**None identified** that rule it out. The following require action before launch:
|
||||||
|
|
||||||
|
| Risk | Severity | Mitigation |
|
||||||
|
|---|---|---|
|
||||||
|
| Default 50-machine org cap | High (blocks launch) | Email billing@fly.io early; no architectural cap |
|
||||||
|
| SMT/hyperthreading not documented | Medium (security) | Request confirmation from Fly.io support before production; mitigated by VM-level isolation |
|
||||||
|
| Intra-org network open by default | Medium (security) | Use one app per tenant with custom 6PNs |
|
||||||
|
| Secrets are per-app not per-machine | Low | Use one app per tenant or `config.files` with `secret_name` |
|
||||||
|
| Volume and machine must be same region | Low (ops) | Enforce region consistency in provisioning code |
|
||||||
|
| API rate limit 1 req/s per machine | Low | Throttle bulk provisioning loops |
|
||||||
|
|
||||||
|
## Recommendation
|
||||||
|
|
||||||
|
**Proceed.** Fly.io Machines are a viable isolation layer for multi-tenant storkit SaaS.
|
||||||
|
|
||||||
|
**Architecture to validate in spike 408:**
|
||||||
|
- One Fly.io app per tenant (provides 6PN network isolation + isolated secrets)
|
||||||
|
- One Firecracker microVM per tenant app (shared-cpu-1x 256 MB baseline; adjust per observed usage)
|
||||||
|
- One persistent volume per tenant (1 GB baseline for `~/.claude/`, repos, storkit state)
|
||||||
|
- Autostop/autoresume enabled — 70–92% compute cost reduction vs always-on for typical dev tool usage
|
||||||
|
- Tenant credentials injected via `config.files` + Fly Secrets at machine start
|
||||||
|
|
||||||
|
**Pricing verdict**: Workable at early SaaS scale. At 100 tenants with autostop (8h/day), costs ~$92/month; at 1,000 tenants ~$920/month. Margins are viable if per-tenant pricing is $5–$20/month.
|
||||||
|
|
||||||
|
**Before production**: Confirm with Fly.io support whether SMT is disabled on worker hosts. Request org machine limit raised to 200–500 during private beta.
|
||||||
|
|
||||||
|
**Spike 408 scope**: Validate cold start latency, autostop resume behavior, and volume persistence with a real test machine running the storkit container image.
|
||||||
Reference in New Issue
Block a user