Bottlerocket ECS Updater MVP

Question

Bottlerocket ECS Updater MVP

webern opened this issue 4 years ago · comments

Matthew James Briggs commented 4 years ago

Bottlerocket ECS Updater MVP

Service Binary/Process Name: bottlerocket-ecs-updater
Git Repository: bottlerocket-ecs-updater

Background

We want to provide a solution for automating Bottlerocket updates in ECS clusters.

This functionality will be similar to that provided by brupop.
A system will cause Bottlerocket nodes to apply OS updates as they become available through the waves system.

Throughout, the term node actually means ECS Container Instance.

Requirements

User does not have to manually initiate updates per-host.
Updates obey wave structure as normal.
Hosts are drained of ECS tasks (that are created by a service) before updating.
System should not interrupt tasks that are not part of a service.
Safe update velocity; one host at a time (for initial release).
Check health before moving on (perhaps with ECS healthcheck)

Service

The program will run externally to the nodes-under-management as a Fargate task.
Running the updater in the existing cluster capacity is possible (and could be a future feature), but it would be more complex since the program might want to update its own node.

What it Does

The service periodically communicates with the ECS control plane and Bottlerocket nodes via their respective APIs. When a Bottlerocket node indicates that it has an update available, the system will cause the node to be drained of services, apply the update, reboot the node, and undrain the node.

State Storage

We may need to store some state information beyond the lifetime of the program, but we have not yet figured out where to store it.
One idea is to use the ECS API putAttributes, but we need to research this further to make sure it is an appropriate use of the API.

Design

At regular intervals, a scheduled task will launch the program in a Fargate task.
This task will need an IAM role that allows it to interact with ECS to describe the cluster, drain nodes, undrain nodes and perform healthchecks.
The task might need an EC2 permission to determine whether an instance is running Bottlerocket or not (TBD).
The task will need SSM permissions to communicate with the Bottlerocket nodes it wants to manage (and the nodes will need SSM enabled in order to be managed).
SSM documents, the Fargate task, cron, etc. will be defined in a CloudFormation file.

Program Flow

Describe the cluster.
Build a list of instances ignoring those not running Bottlerocket (hopefully EC2 call not required?)
Query each instance to see if it has an update available (embarrassingly parallel to save time)
- Ignore/discard those that don’t need an update.
- List tasks and inspect tasks on each node that needs an update, discard nodes that have non-service tasks.
Prioritize the list (probably not too important), e.g.
- Current version descending
- Seed ascending
- Whatever
Proceed through the list of nodes to update, one-at-a-time.
- Drain the node
  - List-tasks for the node until all are stopped.
  - (TODO - timeout?, Then what?)
- Apply update, reboot.
- Wait for the node to re-appear (maybe check EC2). (TODO - timeout?, then what?)
- Wait for the node to become healthy in ECS.
- Undrain the node.
- Proceed to the next node.

Testing

Unit

We should use some mediator traits as injected dependencies dynamic dependencies to decouple the business logic from rusoto.

Integ

We should create a binary that we can run that will do an integration test.

Approximate requirements:

Use a pre-existing ECS cluster.
Detect pre-existing instances in the cluster and abort with error informing the user that the cluster should be empty.
Create multiple nodes taking a Bottlerocket AMI (version < latest), or getting it from SSM.
Create a workload service and run it on the nodes.
Assert health of the workload throughout test?
Containerize and deploy the local changeset version of bottlerocket-update-operator to an ECR repo.
Run the updater in a Fargate task.
Assert the nodes do update.
Cleanup.

Issues

This issue serves as a rollup of the issues that we need to close to get to the MVP. Waypoints along the way:

Samuel Karp · Answer 1 · Wed Jun 30 2021 02:03:29 GMT+0800 (China Standard Time)

Remaining tasks are each tracked in their own issues; closing this one.