siderolabs / omni

SaaS-simple deployment of Kubernetes - on your own hardware.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[feature] Multi-CSP VPC & ASG bootstrapping

rmvangun opened this issue · comments

Problem Description

There is very little CSP-specific code that's needed in order to launch a production-ready Talos cluster on to any CSP. Why not handle that for everyone?

  • Creates the VPC, ASG, and all necessary networking components according to best practices for security and availability
  • Version the CSP layer and allow for one-click upgrades from Omni
  • Allow for very basic options... Internet Gateway? NAT*? Zone selection? Some autoscaling options? Gah, the instance types 🤔

*Could you forward proxy via wireguard to Omni to effectively "airgap" the cluster in a VPC?

Solution

Maybe do it in Terraform and pay Hashicorp their license fees. Warning for drift and being able to version the Terraform is key I think. Open source the Terraform modules and let folks deploy it however they want, but permit that from Omni as well. It could be pretty minimal in the UI, with a terminal view (same look/feel as the log outputs already in Omni) to see the raw plan/apply if you care to look at it. Bootstrap the remote state using the CSP-native option (S3, etc.).

I believe all the major CSPs provide some OIDC option, so you can auth Omni to each CSP given a documented role/policy and you're off.

State resolution could be a pain if things go awry. I wonder if Terraform Stacks will make things smoother here, I haven't looked in to it much, but assuming it may handle multi-stage rollout / rollback more like Cloudformation Stacks so you'll have a higher degree of reliability in the deployment.

Alternative Solutions

  • Use Terraform CDK vs. HCL?
  • Use CSP native solutions (Cloudformation 🤮)
  • Pulumi, Crossplane, etc... but honestly the history and staying power of Terraform feels nicer, and you don't need anything fancy, just get that VPC and ASGs rolled out

Notes

Even if there isn't a ton of code associated with each CSP, I still think it would be valuable to handle this from Omni. Multi-cloud means multi-know-how. That ability to go from not knowing a thing about a CSP to having a production ready k8s cluster seems actually possible.

The magic of Omni/Talos is great but there's all this dang setup before you can really deploy it anywhere. Ironically I have an easier time building bare metal clusters than I do cloud with Talos (tells ya sumthin).