boto / botocore

The low-level, core functionality of boto3 and the AWS CLI.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OSError: [Errno 24] Too many open files causes CredentialRetrievalError

adhadseKavida opened this issue · comments

Describe the bug

Currently on boto3 1.34.14 and AWS FARGATE Spot job.

Whenever at the end of around 15 min of the job, the job can't retrieve the secrets for our MongoDB. Previously the job used to run around ~12 min just fine.

For your information, the Fargate job runs a two async jobs, one after another, the later taking a little longer to process.

If I disable the secondary task, the job runs just fine without any issue.

I tried looking at stackoverflow to increase the ulimit for nofile, but such thing isn't supported on FARGATE platform.

What's the issue here, I don't seems to be able to debug it locally.

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 174, in _new_conn
File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 95, in create_connection
File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 76, in create_connection
File "/usr/lib/python3.10/socket.py", line 232, in __init__
OSError: [Errno 24] Too many open files
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/botocore/httpsession.py", line 464, in send
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 787, in urlopen
File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 525, in increment
File "/usr/local/lib/python3.10/dist-packages/urllib3/packages/six.py", line 770, in reraise
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 703, in urlopen
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 398, in _make_request
File "/usr/local/lib/python3.10/dist-packages/botocore/awsrequest.py", line 96, in request
File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 244, in request
File "/usr/lib/python3.10/http/client.py", line 1283, in request
File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
File "/usr/local/lib/python3.10/dist-packages/botocore/awsrequest.py", line 123, in _send_output
File "/usr/local/lib/python3.10/dist-packages/botocore/awsrequest.py", line 223, in send
File "/usr/lib/python3.10/http/client.py", line 976, in send
File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 205, in connect
File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 186, in _new_conn
urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7fc5e898a1d0>: Failed to establish a new connection: [Errno 24] Too many open files
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/botocore/utils.py", line 3128, in _get_response
File "/usr/local/lib/python3.10/dist-packages/botocore/httpsession.py", line 493, in send
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://169.254.170.2/v2/credentials/d30c4f03-fbbd-4d6c-982f-c118f9f4702a"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/botocore/credentials.py", line 1964, in fetch_creds
File "/usr/local/lib/python3.10/dist-packages/botocore/utils.py", line 3065, in retrieve_full_uri
File "/usr/local/lib/python3.10/dist-packages/botocore/utils.py", line 3109, in _retrieve_credentials
File "/usr/local/lib/python3.10/dist-packages/botocore/utils.py", line 3148, in _get_response
botocore.exceptions.MetadataRetrievalError: Error retrieving metadata: Received error when attempting to retrieve container metadata: Could not connect to the endpoint URL: "http://<IP_ADDRESS>/v2/credentials/d30c4f03-fbbd-4d6c-982f-c118f9f4702a"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/code/app/config/mongo_connect.py", line 21, in get_connection
str_config = secret_manager.get_secret(SECRET_MANAGER.get('MONGODB'))
File "/code/app/config/base_secrets_manager.py", line 33, in get_secret
File "/usr/local/lib/python3.10/dist-packages/boto3/session.py", line 299, in client
File "/usr/local/lib/python3.10/dist-packages/botocore/session.py", line 957, in create_client
File "/usr/local/lib/python3.10/dist-packages/botocore/session.py", line 515, in get_credentials
File "/usr/local/lib/python3.10/dist-packages/botocore/credentials.py", line 2074, in load_credentials
File "/usr/local/lib/python3.10/dist-packages/botocore/credentials.py", line 1926, in load
File "/usr/local/lib/python3.10/dist-packages/botocore/credentials.py", line 1934, in _retrieve_or_fail
File "/usr/local/lib/python3.10/dist-packages/botocore/credentials.py", line 1971, in fetch_creds
botocore.exceptions.CredentialRetrievalError: Error when retrieving credentials from container-role: Error retrieving metadata: Received error when attempting to retrieve container metadata: Could not connect to the endpoint URL: "http://<IP_ADDRESS>/v2/credentials/d30c4f03-fbbd-4d6c-982f-c118f9f4702a"

Expected Behavior

The job should just fine and retrieve the Credentials

Current Behavior

The task fails to retrieve the secret from the Secrets Manager.

Reproduction Steps

hard to reproduce with current setup without including long running async jobs.

Possible Solution

  • increasing ulimit for fargate jobs?

Additional Information/Context

No response

SDK version used

1.34.14

Environment details (OS name and version, etc.)

AWS Fargate

Thanks for reaching out. I'm not sure what the issue is here but it may be something that we would need to ask the Fargate team about.

Can you provide a full code snippet for us to try and reproduce this, and complete debug logs (with sensitive info redacted)? You can get the full logs by adding boto3.set_stream_logger('') to your script.

Also version 1.34.14 is a bit older, have you tried updating to the latest version (1.34.108 per the CHANGELOG)?

I'll updating to a little latest version and try once again.

Additionally, I'll send the logs when I test it out.

Greetings! It looks like this issue hasn’t been active in longer than five days. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.