CloudFormation for Workers

Question

CloudFormation for Workers

asymmetric opened this issue 7 years ago · comments

We decided to try out CloudFormation on the workers ASG. The initial implementation will deal with scaling the workers up/down.

Benefits:

Declarative
Easier to do rolling updates when the CC changes
Allows us to simplify the operator

Cons:

Rolling updates are only triggered on LC change. How do we tie the 2 CCs together, so that the LC changes when the big CC changes?
Passing in variable-sized parameters (e.g. ELB listeners) is very cumbersome. We might want to use a Go template with proper for loops, instead of relying on AWS' broken system.

Limits:

http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cloudformation-limits.html

asymmetric · Answer 1 · Tue Jun 20 2017 00:02:53 GMT+0800 (China Standard Time)

Reading the docs, it seems like rolling updates only get triggered on a few events, notably when the LC changes.

This is not what will happen in our case though, as what will change is the big CC, not the one we pass as UserData.

Next step is investigating whether we can trigger a rolling update anyway.

/cc @rossf7

Ross Fairbanks · Answer 2 · Tue Jun 20 2017 00:23:36 GMT+0800 (China Standard Time)

For upgrading clusters we may need to version the large cloudconfig. If so can we change the small cloudconfig so it's different each time? If we can I'd like to trigger the rolling update the standard way.

Good to know about the limits. There are bound to be some downsides over using the Go SDK directly.

asymmetric · Answer 3 · Tue Jun 20 2017 00:33:12 GMT+0800 (China Standard Time)

Versioning sounds like a good idea.

By the way, won't the work in https://github.com/giantswarm/giantswarm/pull/1541 obviate our need to have 2 CCs? If certs are delivered through some other mean (i.e. not in the CC), then maybe we won't hit the 16KB limit in the first place.

Joe Salisbury · Answer 4 · Tue Jun 20 2017 00:41:19 GMT+0800 (China Standard Time)

Putting the large cloudconfig hash in the small one would make sense.

asymmetric · Answer 5 · Tue Jun 20 2017 17:30:10 GMT+0800 (China Standard Time)

@rossf7 as you mentioned in https://github.com/giantswarm/giantswarm/pull/1541#discussion_r122764100, yes, I think we could simply download the certs from S3 from within the small CC.

asymmetric · Answer 6 · Tue Jun 20 2017 18:19:59 GMT+0800 (China Standard Time)

I think this change to the how we fetch certs should be done as part of this ticket. Either that or versioning both CCs.

That's because IMO the most meaningful way to test whether CF works for us is to use it for rolling updates in our ASG.

WDYT?

Ross Fairbanks · Answer 7 · Tue Jun 20 2017 18:34:10 GMT+0800 (China Standard Time)

Totally agree on needing to give CF a meaningful test. I think we should version both CCs rather than downloading the certs separately.

Downloading the certs is only needed for rolling certs which has lower priority. It also has a bigger impact on KVM.

asymmetric · Answer 8 · Tue Jun 27 2017 19:54:18 GMT+0800 (China Standard Time)

Status Update

We create a template that contains the LC and the ASG. Other inputs (the LB, the SG rules etc) are still created via the API, as I haven't found an easy way to pass in arguments with variable length (e.g., the SG rules).

To work around that, we might want to run the AWS template through Go's templating system, which would allow us to use iterators.

The creation of the stack via the template completes successfully.

asymmetric · Answer 9 · Tue Jun 27 2017 22:53:44 GMT+0800 (China Standard Time)

Issue

How does the aws-operator know when a cluster is running an old CC, and therefore needs to be updated?

One option is using annotations. Basically, in the onAdd handler (which also runs on each resyncPeriod) we:

produce the big CC
calculate its checksum
check if the cluster already exists
if no:
- annotate the cluster with the checksum
if yes:
- compare the checksums
- proceed with updating the ASG if needed

Ross Fairbanks · Answer 10 · Tue Jun 27 2017 23:27:13 GMT+0800 (China Standard Time)

We want to go with the checksum approach for detecting if the CC has changed. For storage there are 2 options. Add the checksum to the TPO itself or use an annotation.

The benefit of adding it to the cluster TPO is its part of the state of the cluster. This means the operator is updating TPOs it manages. Discussed this with @xh3b4sd and we don't think this is a problem.

So I'd rather go with this instead of the annotation. Let me know if this causes major problems or is going to be a lot more work.

Paweł Kopiczko · Answer 11 · Wed Jun 28 2017 02:29:11 GMT+0800 (China Standard Time)

@rossf7 updating TPO might be a problem for "consistent deletes". We want to store the content of the TPO in a configmap. Maybe handling the update function can help, but I'm not sure yet. Just keep that in mind.

xh3b4sd · Answer 12 · Wed Jun 28 2017 05:29:30 GMT+0800 (China Standard Time)

Can the name of an TPO be updated? I only see this case breaking the story. It is also that the configmap tracking TPOs to be deleted can be updated on update events. So we should be good.

Ross Fairbanks · Answer 13 · Wed Jun 28 2017 15:01:33 GMT+0800 (China Standard Time)

The name of the TPO shouldn't change just the checksum. Yes we should update the config map tracking TPOs when we implement the consistent deletes.

asymmetric · Answer 14 · Wed Jun 28 2017 18:47:17 GMT+0800 (China Standard Time)

OK.

I think it could make sense later on to split the whole creation of the CC out of the operator, and into its own service. That way, the CC could simply be checksummed and uploaded to an S3 bucket, and the aws-operator would just fetch it and be done with it.

asymmetric · Answer 15 · Wed Jun 28 2017 21:58:19 GMT+0800 (China Standard Time)

I think a pre-condition for this is making the operator compliant with operatorkit.

I say this because we have to start reconciling the state in the event handlers (e.g. looking at existing clusters, comparing the checksums, making necessary changes) and doing this without first moving to the operatorkit model would be a waste of resources IMO.

So if you have nothing against it, I'm going to focus the rest of the week on other issues in the backlog.

Ross Fairbanks · Answer 16 · Wed Jun 28 2017 23:01:13 GMT+0800 (China Standard Time)

Yes I think we should use the reconciler support in operatorkit. But I think upgrading aws-operator to use the latest operatorkit needs to be a separate change.

We still need to evaluate how much Cloud Formation helps but also if it introduces new problems. We should do handover on that. In the meantime working on backlog issues makes sense.

asymmetric · Answer 17 · Thu Jun 29 2017 20:30:53 GMT+0800 (China Standard Time)

Status Update

Verifying that the stack was created

We use the builtin WaitUntilStackCreateComplete function. In a previous commit, I implemented it myself. The advantage was that we had more control over the timeout and logging.

Rolling update of the ASG

Rolling updates in an ASG are only triggered when there's a change to any of the parameters of its LaunchConfiguration, and provided that the ASG has enabled the UpdatePolicy: AutoScalingRollingUpdate property.

One simple way to test that this works in the current branch is to go to the page of the Stack created by the operator, and update it by changing one of its parameters (e.g. AssociatePublicIPAddress).

The parameter we are most interested in is UserData, but updating that from the operator is too big of a task for the time I have left.

asymmetric · Answer 18 · Fri Jun 30 2017 00:38:01 GMT+0800 (China Standard Time)

General feedback on using CloudFormation

Right now, we only use the template for the ASG and the LC. This is because of the drawbacks explained below (mainly, that the templating is too rigid for our usecase)

Pros

out-of-the box support for rolling updates of the instances in the ASGs
declarative instead of imperative
less code in the operator

Cons

the templating system (i.e. the functions AWS provides on top of the bare YAML) is barebones. For example, it does not support passing in a list of security group rules, which we need if we want to make a template generic.
this could be obviated by using Go's template package
the other possibility is to keep creating some of the resources in the SDK, and the rest via CF (which is what's happening in my branch)

asymmetric · Answer 19 · Fri Jun 30 2017 23:41:53 GMT+0800 (China Standard Time)

Status Update

By using go templates, I was able (1, 2) to easily pass Listeners and Security Group rules to the CF template.

It is definitely possible to go further this way, eliminating code from the operator.

I decided to only use go templates when AWS parameters system was too cumbersome. In all other scenarios, I stuck to the default way of doing things.

I didn't remove the old, manual code from the operator's service/create/service.go for lack of times, but there's plenty that can be removed.

xh3b4sd · Answer 20 · Wed Sep 27 2017 17:17:26 GMT+0800 (China Standard Time)

Closing in favour of https://github.com/giantswarm/giantswarm/issues/1850.