bottlerocket-os / bottlerocket-ecs-updater

A service to automatically manage Bottlerocket updates in an Amazon ECS cluster.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add better error handling

WilboMo opened this issue · comments

What I'd like:
Implement a mechanism to audit errors returned by Updater's functions and methods. The errors needs to be scanned to determine if the arisen error is fatal to Updater's operations and thus the instance should be aborted, or if it is a minor problem which can be passed and re-tried the next time the instance is caught by Updater for processing.

This issue originates from the following PR comment:

... continue updating next instance and worry about updating this instance in next program iteration. 
Similar things for lot many errors, I think we need to scan all the errors and proceed for non fatal error.

_Originally posted by @srgothi92 in 
https://github.com/bottlerocket-os/bottlerocket-ecs-updater/pull/38#discussion_r614438390_

Some refactoring is done in PR-56 and PR-51 which partially addresses this issue, however we should scan the complete code base and make sure all the errors are handled properly. For each error it is important to decide on 3 actions:

  1. Is error fatal; if yes, make sure updater stops.
  2. Do we need to reset any cluster state on error
  3. Can we just log the error and continue

Additionally, we should try to run set of tests (manual or automated) which touches all the error and make sure they are handled correctly.

Re-verified all the cases and added log fixes wherever requires as part of PR-77