piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Creating one file in S3 results in multiple POST api calls

ablanchard opened this issue · comments

Hi,

I'm using smart-open to download an image from http and then upload it to S3. By reading the docs, I opened one file in read mode and one in write mode:

from smart_open import open
import boto3
import sys

boto3.set_stream_logger(name='botocore')

source = "https://m.media-amazon.com/images/I/31AkfWnD8ML.jpg"
destination = sys.argv[1]

with open(source, 'rb') as fin:
  with open(destination, 'wb') as fout:
    for line in fin:
      fout.write(line)

You can call it by passing a s3 destination path. Like python test_smart_open.py "s3://sagemaker-studio-2f9zflynkdx/test.jpg" 2> test_smart_open.txt

I'm using boto3.set_stream_logger(name='botocore') here to log in verbose all the work done by the boto3 lib.
When greping the logs test_smart_open.txt, we can see that 2 POST calls are made on the test.jpg file.

2022-05-24 15:00:31,775 botocore.auth [DEBUG] CanonicalRequest:
POST
/test.jpg
--
2022-05-24 15:00:32,262 botocore.auth [DEBUG] CanonicalRequest:
POST
/test.jpg

Whereas, when I do the same thing with a vanilla boto3 call, there is only one call done.
Is there something I'm doing the wrong way ? Or it is a bug ?
Thanks

Versions

Linux-5.4.0-104-generic-x86_64-with-debian-bullseye-sid
Python 3.7.4 (default, Jan 20 2021, 16:10:08) 
[GCC 9.3.0]
smart_open 6.0.0

I think this is because smart_open uses multipart uploads by default.

If you want to use singlepart uploads, then pass transport_params={multipart_upload: False} in your call to the open function.

Yes passing this parameter does the trick. I only get one call to S3.
I thought that with my file being under the default multipart size of 50MB. It will not trigger the multipart but in fact not.
Thanks 👍

smart_open is file-agnostic. It works with streams by design. So, there is no way for it to know how large your file is.