AWS_METADATA_SERVICE_NUM_ATTEMPTS effectively ignored due to uncaught botocore exceptions
sndrtj opened this issue · comments
Describe the bug
The AWS_METADATA_SERVICE_NUM_ATTEMPTS
environment variable effectively gets ignored in many cases due to uncaught botocore errors in the AioIMDSFetcher.
Botocore has the following retryable exceptions:
- ReadTimeoutError
- EndpointConnectionError
- ConnectionClosedError
- ConnectTimeoutError
Aiobotocore, however, only retries on asyncio and aiohttp exceptions, and does not retry on the exceptions raised by botocore.
Example
The following is an illustrative example:
import aiobotocore.session
import aiobotocore.credentials
import asyncio
import logging
async def main():
session = aiobotocore.session.get_session()
tasks = [aiobotocore.credentials.get_credentials(session) for _ in range(1000)]
await asyncio.gather(*tasks)
if __name__ == "__main__":
logging.basicConfig(level="DEBUG")
asyncio.run(main())
Now run this as AWS_METADATA_SERVICE_NUM_ATTEMPTS=10 python test.py
on an EC2 instance. This will (likely, may need a couple tries) fail with either a ConnectionClosedError
or a ConnectTimeoutError
, without any expected Caught retryable HTTP exception while making metadata service request
message in logs.
Botocore example that does work as expected:
import botocore.session
import botocore.credentials
import logging
def main():
session = botocore.session.get_session()
for _ in range(1000):
botocore.credentials.get_credentials(session)
if __name__ == "__main__":
logging.basicConfig(level="DEBUG")
main()
Checklist
- I have reproduced in environment where
pip check
passes without errors - I have provided
pip freeze
results - I have provided sample code or detailed way to reproduce
- I have tried the same code in botocore to ensure this is an aiobotocore specific issue
- I have tried similar code in aiohttp to ensure this is is an aiobotocore specific issue
- I have checked the latest and older versions of aiobotocore/aiohttp/python to see if this is a regression / injection
pip freeze results
aiobotocore==2.4.1
aiohttp==3.8.3
aioitertools==0.11.0
aiosignal==1.3.1
async-timeout==4.0.2
attrs==22.2.0
botocore==1.27.59
charset-normalizer==2.1.1
frozenlist==1.3.3
idna==3.4
jmespath==1.0.1
multidict==6.0.3
python-dateutil==2.8.2
six==1.16.0
typing_extensions==4.4.0
urllib3==1.26.13
wrapt==1.14.1
yarl==1.8.2
Environment:
- Python Version: 3.8
- OS name and version: Ubuntu 22.04
Additional context
I encountered this issue while experiencing errors in dvc with dvc pull
. DVC seems to hit get_credentials
for just about every object it retrieves from S3. My repository has about 7k objects, which seems to be more than enough to trigger this behaviour.
This bug is probably related to #961
thanks! will look into this asap
ya this seems like an oversight of not swapping the botocore exceptions after we started translating exceptions. Could you try instead importing RETRYABLE_HTTP_ERRORS
from botocore.utils ?
Yes, when I patch RETRYABLE_HTTP_ERRORS
with the ones from botocore.utils
it works as expected :-).
from botocore.utils import RETRYABLE_HTTP_ERRORS
import aiobotocore.session
import aiobotocore.credentials
import aiobotocore.utils
import asyncio
import logging
# patch utils
aiobotocore.utils.RETRYABLE_HTTP_ERRORS = RETRYABLE_HTTP_ERRORS
async def main():
session = aiobotocore.session.get_session()
tasks = [aiobotocore.credentials.get_credentials(session) for _ in range(1000)]
await asyncio.gather(*tasks)
if __name__ == "__main__":
logging.basicConfig(level="DEBUG")
asyncio.run(main())
will get a patch out for this
Thanks for the quick resolution!
If you follow what i had to do for the tests in that pr is why I hate unit tests and prefer integration tests. They would have caught this issue instead of providing a sense of false security.