aio-libs / aiobotocore

asyncio support for botocore library using aiohttp

Home Page:https://aiobotocore.rtfd.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

S3 Express Session opened for all asyncio calls

LouisAuneau opened this issue · comments

Hello,

Describe the bug
We are aiming at improving our uploading speed using S3 Express One Zone. Running the following code using botocore, I manage to upload files in a reasonable time:

import botocore.session

bucket = 'my-bucket-eun1-az1--x-s3'
region = 'eu-north-1'
session = botocore.session.get_session()
client = session.create_client('s3', region_name=region)

for i, file in enumerate(dataset):
    client.put_object(
        Body=file,
        Bucket=bucket,
        Key=f'file_{i}.jpg'
    )

However, I wanted to improve perfomance by running async with the following code:

import aiobotocore.session

bucket = 'my-bucket-eun1-az1--x-s3'
region = 'eu-north-1'
session = aiobotocore.session.get_session()

async with session.create_client('s3', region_name=region) as client:
    tasks = []
    for i, file in enumerate(dataset):
        tasks.append(client.put_object(
            Body=file,
            Bucket=bucket,
            Key=f'file_{i}.jpg'
        ))

    await asyncio.gather(*tasks)

But it seems when running multiple put_object call asynchronously, the CreateSession endpoint is called at each call, leading to the following ClientError:

ClientError: An error occurred (SlowDown) when calling the CreateSession operation (reached max retries: 4): Reduce your request rate.

Checklist

  • I have reproduced in environment where pip check passes without errors
  • I have provided pip freeze results
  • I have provided sample code or detailed way to reproduce
  • I have tried the same code in botocore to ensure this is an aiobotocore specific issue
  • I have tried similar code in aiohttp to ensure this is is an aiobotocore specific issue -> Time consuming as it requires to re-implement the authentication system of AWS.
  • I have checked the latest and older versions of aiobotocore/aiohttp/python to see if this is a regression / injection -> First and latest version that supports S3 Express One Zone.

pip freeze results

absl-py==1.4.0
aiobotocore==2.9.0
aiofile==3.8.8
aiofiles==23.2.1
aiohttp==3.9.1
aiohttp-retry==2.8.3
aioitertools==0.11.0
aiopath==0.5.12
aiosignal==1.3.1
anyio==3.7.1
asn1crypto==1.5.1
asttokens==2.4.1
async-timeout==4.0.3
asyncpg==0.29.0
attrs==23.2.0
awscrt==0.19.17
azure-devops==6.0.0b4
backoff==2.2.1
boto3==1.33.13
botocore==1.33.2
botocore-stubs==1.34.17
cachetools==5.3.2
caio==0.9.13
certifi==2023.11.17
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
cloud-sql-python-connector==1.5.0
cloudpickle==2.2.1
comm==0.2.1
cryptography==41.0.7
debugpy==1.8.0
decorator==5.1.1
Deprecated==1.2.14
docstring-parser==0.15
exceptiongroup==1.2.0
executing==2.0.1
fire==0.5.0
frozenlist==1.4.1
gcloud-aio-auth==4.2.3
gcloud-aio-storage==8.3.0
google-api-core==1.34.0
google-api-python-client==1.12.8
google-auth==2.26.2
google-auth-httplib2==0.1.1
google-auth-oauthlib==0.4.6
google-cloud-core==2.4.1
google-cloud-pubsub==2.19.0
google-cloud-storage==2.14.0
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
greenlet==3.0.3
grpc-google-iam-v1==0.13.0
grpcio==1.60.0
grpcio-status==1.48.2
grpcio-tools==1.48.2
h11==0.14.0
h2==4.1.0
hpack==4.0.0
httpcore==1.0.2
httplib2==0.22.0
httpx==0.26.0
hyperframe==6.0.1
idna==3.6
importlib-metadata==6.11.0
ipykernel==6.28.0
ipython==8.18.1
isodate==0.6.1
jedi==0.19.1
jmespath==1.0.1
jsonschema==4.20.0
jsonschema-specifications==2023.12.1
jupyter_client==8.6.0
jupyter_core==5.7.1
kfp==1.8.22
kfp-pipeline-spec==0.1.16
kfp-server-api==1.8.5
kubernetes==25.3.0
matplotlib-inline==0.1.6
msrest==0.6.21
multidict==6.0.4
nest-asyncio==1.5.8
notion-client==2.2.1
numpy==1.23.5
oauthlib==3.2.2
opencv-python==4.9.0.80
packaging==23.2
pandas==2.0.3
parso==0.8.3
pexpect==4.9.0
pg8000==1.30.4
Pillow==9.5.0
platformdirs==4.1.0
portalocker==2.8.2
prompt-toolkit==3.0.43
proto-plus==1.23.0
protobuf==3.20.3
psutil==5.9.7
ptyprocess==0.7.0
pure-eval==0.2.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pydantic==1.10.13
PyGithub==1.59.1
Pygments==2.17.2
PyJWT==2.8.0
PyNaCl==1.5.0
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
pyzmq==25.1.2
qdrant-client==1.7.0
referencing==0.32.1
requests==2.31.0
requests-oauthlib==1.3.1
requests-toolbelt==0.10.1
rpds-py==0.16.2
rsa==4.9
s3transfer==0.8.2
scramp==1.4.4
semantic-version==2.10.0
six==1.16.0
slackclient==2.9.4
sniffio==1.3.0
SQLAlchemy==2.0.25
stack-data==0.6.3
strip-hints==0.1.10
tabulate==0.9.0
termcolor==2.4.0
toml==0.10.2
tornado==6.4
tqdm==4.66.1
traitlets==5.14.1
typer==0.9.0
types-aiobotocore==2.9.0
types-aiobotocore-s3==2.9.0
types-awscrt==0.20.0
typing_extensions==4.9.0
tzdata==2023.4
uritemplate==3.0.1
urllib3==1.26.18
wcwidth==0.2.13
websocket-client==1.7.0
wrapt==1.16.0
yarl==1.9.4
zipp==3.17.0

Environment:

  • Python Version: 3.9.4
  • OS name and version: Debian 12

Additional context
Add any other context about the problem here.

Thank you for reporting the issue. I have a few ideas and could need your help to check them out and report back:

  1. We had to implement a custom credential cache that might be to blame. Could you please try to await the first put_object task before gathering the others?

    await tasks[0]
    await asyncio.gather(*tasks[1:])

    If that helps, I will attempt to improve the caching logic.

  2. According to docs for PutObject:

    When you use this operation with a directory bucket, you must use virtual-hosted-style requests in the format Bucket_name.s3express-az_id.region.amazonaws.com.

    Have you tried providing the bucket name in this format?

Thank you for your quick feedback. I tested your 2 points:

  1. Awaiting the first task before the other are executed indeed works. Let me know if I can be of any help for caching logic investigation.
  2. When providing the full endpoint (my-bucket--eun1-az1--x-s3.s3express-eun1-az1.eu-north-1.amazonaws.com in my example), I get a NoSuchBucket error surprisingly.
  1. Awaiting the first task before the other are executed indeed works. Let me know if I can be of any help for caching logic investigation.

Great! I will prepare a fix ASAP.

  1. When providing the full endpoint (my-bucket--eun1-az1--x-s3.s3express-eun1-az1.eu-north-1.amazonaws.com in my example), I get a NoSuchBucket error surprisingly.

Well, it was worth a shot.

@jakob-keller ah I missed that, ya it needs to use an asyncio.Lock