Netflix-Skunkworks / aardvark

Aardvark is a multi-account AWS IAM Access Advisor API

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Updater threads crash when AssumeRole denied by SCP

dfkunstler opened this issue · comments

Aardvark doesn't handle exceptions in UpdateAccountThread.run() defined in file /aardvark/manage.py. Aardvark also doesn't handle exceptions from CloudAux when it attempts to get IAM client from cloudaux.aws.sts.boto3_cached_conn() in method AccountToUpdate._get_client() defined in file /aardvark/updater/__init__.py. When there is an exception due to permission being denied by SCP, the thread crashes. Each occurrence reduces the effective thread count by 1 until they're all hung, at which point killing the OS process is the only option.

This issue was discovered in a deployment where AWS accounts slated for closure are first moved into an Organizations OU ("KIA") whose SCP denies all access. With ~40 AWS accounts in KIA out of ~850 total, Aardvark would process <100 accounts over several hours before all threads were crashed. The order of the accounts in the swag database appears to determine aardvark's processing order and thus timing of the crashes, so it's possible, for example, that all threads could crash on start if all the KIA accounts were at the top of the list.

The root exception in this scenario is an AccessDenied error from botocore. The source/nature of the AccessDenied exception seems to matter. For example, if AccessDenied is due to the aardvark role missing from the target account, a message is logged and aardvark keeps on trucking. But for some reason when SCP is the culprit, Aardvark chokes. Further investigation could be done to discover the distinction, but for our purposes it's sufficient to say that root cause was an unhandled exception as described above.

Fixing this involved adding exception handling to the two methods listed in the description, and replacing one instance of a direct call from AccountToUpdate._get_arns() to cloudaux.aws.sts.boto3_cached_conn() with a local call to AccountToUpdate._get_client(). Current behavior if the call from UpdateAccountThread.run() to AccountToUpdate.update_account() returns an error (i.e., no exception), is to put account back on the queue for processing again. In the case of an exception, however, it's simpler to assume the cause is not transient and continue, thus avoiding an infinite loop. If the issue was transient then the account will be processed the next time the updater runs, without data loss.

I'd like to contribute the code fixing this issue, but obviously don't have permission to create a branch. Would you prefer that I fork the project and create pull request from that; or submit a diff; or something else? I also added some debug-level logging while troubleshooting, which I'll keep in the pull request because it's valuable for monitoring when processing hundreds of AWS accounts.

thanks!

Thanks for the report, and nice find! Our preference would be to fork the project and open a PR from there.