bottlerocket-os / bottlerocket-ecs-updater

A service to automatically manage Bottlerocket updates in an Amazon ECS cluster.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tune waiters max attempts

WilboMo opened this issue · comments

What I'd like:
Functions like sendCommand and waitUntilDrained have AWS API "waiters" which wait until an associated task is in a specific state before returning. These waiters ensure that sent commands or container instance tasks have stopped before allowing Updater to proceed in its operations safely.

The default for these waiters is 100 attempts, but this can easily be too short for some workloads. The waiters need to be tuned to ensure that they do not give up too early and cause an instance to be removed from operations prematurely.

This issue was original brought up in the following PR comment:

Similar to other `wait` this would also fail after `MaxAttempts: 100,` 
https://github.com/aws/aws-sdk-go/blob/v1.38.20/service/ecs/waiters.go#L199. 
Can you add a TODO or create an issue to decide if we would like to control this `MaxAttempts` number.

_Originally posted by @srgothi92 in 
https://github.com/bottlerocket-os/bottlerocket-ecs-updater/pull/35#r614898274_

Looking at the Waiter Api, I find the following mechanism allows significant cofigurability.

err := svc.WaitUntilTasksStoppedWithContext(
		aws.BackgroundContext(),
		taskStopInput,
		request.WithWaiterMaxAttempts(5),
		request.WithWaiterDelay(request.ConstantWaiterDelay(time.Duration(2) * time.Second)),
		request.WithWaiterLogger(aws.LoggerFunc(taskStopLogger)),
	)

This syntax allows overriding both the MaxAttempts and the delay b/w attempts. I would like to ask if this mechanism would be good to build on?

/cc: @srgothi92 @WilboMo @samuelkarp

Yeah, I think that makes a lot of sense and should work just fine. Thanks!

For WaitUntilTasksStopped

Another important research of this task can be to figure out what is a good number of attempts, may be the default is good enough. Plan that I had in mind was to launch lots of task and see how many attempts does it take until all of them are stopped. Once we know that we can add additional ~25% as buffer and decide on the max attempts number. If someone has some other idea please feel free to share.

For WaitUntilCommandExecuted

There are two types of wait, one for do update: apiclient update apply --reboot here and another for check update : apiclient update check. For do update we might want to wait longer because that can time, however for check update not so much but i think it is okay to wait same amount of time as do update because performance is not so important.

I think using the same wait settings for apiclient update check as we do for the update command is fine. It may be more than necessary but this would ensure that the instance has had plenty of time to activate and return the expected Json which would in turn ensure we don't accidentally have empty responses that trigger an error for certain checks.

The only caveat is that not tuning the waiter for update check is as @srgothi92 said, it would be a little slower. I think it's better to be conservative here and be accurate than to be fast.

Thank you both. Great input. I believe being "slow and accurate" with these waits is a desirable approach, from the customer's perspective, as well. It reduces risk of failed updates. I'll go ahead and adjust the pattern in the relevant places; followed by measurements during testing to tune the values.

Fixed with PR #50