Add better error handling

Question

Add better error handling

WilboMo opened this issue 3 years ago · comments

What I'd like:
Implement a mechanism to audit errors returned by Updater's functions and methods. The errors needs to be scanned to determine if the arisen error is fatal to Updater's operations and thus the instance should be aborted, or if it is a minor problem which can be passed and re-tried the next time the instance is caught by Updater for processing.

This issue originates from the following PR comment:

... continue updating next instance and worry about updating this instance in next program iteration. 
Similar things for lot many errors, I think we need to scan all the errors and proceed for non fatal error.

_Originally posted by @srgothi92 in 
https://github.com/bottlerocket-os/bottlerocket-ecs-updater/pull/38#discussion_r614438390_

Will Moore · Answer 1 · Sat Apr 17 2021 03:32:41 GMT+0800 (China Standard Time)

#35 (comment)

Shailesh Gothi · Answer 2 · Wed May 19 2021 23:24:22 GMT+0800 (China Standard Time)

Some refactoring is done in PR-56 and PR-51 which partially addresses this issue, however we should scan the complete code base and make sure all the errors are handled properly. For each error it is important to decide on 3 actions:

Is error fatal; if yes, make sure updater stops.
Do we need to reset any cluster state on error
Can we just log the error and continue

Additionally, we should try to run set of tests (manual or automated) which touches all the error and make sure they are handled correctly.

Shailesh Gothi · Answer 3 · Sat Jun 19 2021 04:48:25 GMT+0800 (China Standard Time)

Re-verified all the cases and added log fixes wherever requires as part of PR-77