terricain / aioboto3

Wrapper to use boto3 resources with the aiobotocore async backend

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

async upload to AWS S3

sfsf9797 opened this issue · comments

  • Async AWS SDK for Python version: 10.1.0
  • Python version: 3.7
  • Operating System: Linux

Description

Hi, first and foremost, thank you so much for this amazing project.

I'm trying to do an async upload to S3 while running an intense task. The reason why I would like to use aioboto3 is to make the uploading process non-blocking, so I could run an intense task while the uploading task is being done in the background.

Let's say the intense task would take 4 seconds and upload to s3 would take 1.8 seconds. I would expect the total time taken for running this process (intense task and upload to s3) to be 4 seconds.

What I Did

async def upload(
) :
    session = aioboto3.Session()
    staging_path = 'path'
    bucket = 'bucket'
    blob_s3_key = 'key'
    print("start uploading")
    async with session.client("s3") as s3:
        try:
            with open(staging_path,"rb") as spfp:
                print(f"Uploading {blob_s3_key} to s3")
                await s3.upload_fileobj(spfp, bucket, blob_s3_key)

        except Exception as e:
            print(f"Unable to s3 upload {staging_path} to {blob_s3_key}: {e} ({type(e)})")
    print("done uploading")
    
 async def intense_task():
    print(f"start doing work")
    time.sleep(4)
    print("done working!!!")
    await asyncio.sleep(0.2)

async def main():
  t1 = asyncio.create_task(upload())
  t2 = asyncio.create_task(intense_task())
  await asyncio.gather(t1)

start = time.time()

await main()

print(time.time() - start)

result

start uploading
Uploading key to s3
start doing work
done working!!!
done uploading
5.833537817001343

As you can see the total time taken is 4 seconds + 1.8 seconds, seem like the tasks are being run sequentially, please let me know how can I make the upload to s3 task async or if my expectations are wrong.

thanks

hi @terrycain, really appreciate if you can look at it. thanks

If you are doing cpu intensive work I don't think you want this work to be scheduled as a task. Asynchronous tasks are generally meant for non-blocking IO work (i.e. network calls, disk reads, etc...). If your task is doing intense CPU work then it may be preventing the other task from being scheduled.

I'm a bit rusty on the Python async API, but from other languages, you normally can fire the async network call before you begin the intense CPU work, and then await the call before returning to the client.

So you may be able to make this small change:

async def main():
  t1 = asyncio.create_task(upload())
  intense_task() # don't schedule this CPU-bound work as a task
  await t1 # await t1 to ensure completion in the case `intense_task()` finishes first

As for your example time.sleep(4) blocks the event loop, you'll want asyncio.sleep then you'll get numbers closer to what you're looking for. As @BrutalSimplicity said, if you are doing cpu intensive work, look into processpoolexecutors and theres a way to call that from async code without blocking.

oh okay, thanks everyone for answering. 😁