Creating one file in S3 results in multiple POST api calls
ablanchard opened this issue · comments
Hi,
I'm using smart-open to download an image from http and then upload it to S3. By reading the docs, I opened one file in read mode and one in write mode:
from smart_open import open
import boto3
import sys
boto3.set_stream_logger(name='botocore')
source = "https://m.media-amazon.com/images/I/31AkfWnD8ML.jpg"
destination = sys.argv[1]
with open(source, 'rb') as fin:
with open(destination, 'wb') as fout:
for line in fin:
fout.write(line)
You can call it by passing a s3 destination path. Like python test_smart_open.py "s3://sagemaker-studio-2f9zflynkdx/test.jpg" 2> test_smart_open.txt
I'm using boto3.set_stream_logger(name='botocore')
here to log in verbose all the work done by the boto3 lib.
When greping the logs test_smart_open.txt, we can see that 2 POST calls are made on the test.jpg file.
2022-05-24 15:00:31,775 botocore.auth [DEBUG] CanonicalRequest:
POST
/test.jpg
--
2022-05-24 15:00:32,262 botocore.auth [DEBUG] CanonicalRequest:
POST
/test.jpg
Whereas, when I do the same thing with a vanilla boto3 call, there is only one call done.
Is there something I'm doing the wrong way ? Or it is a bug ?
Thanks
Versions
Linux-5.4.0-104-generic-x86_64-with-debian-bullseye-sid
Python 3.7.4 (default, Jan 20 2021, 16:10:08)
[GCC 9.3.0]
smart_open 6.0.0
I think this is because smart_open uses multipart uploads by default.
If you want to use singlepart uploads, then pass transport_params={multipart_upload: False}
in your call to the open
function.
Yes passing this parameter does the trick. I only get one call to S3.
I thought that with my file being under the default multipart size of 50MB. It will not trigger the multipart but in fact not.
Thanks 👍
smart_open is file-agnostic. It works with streams by design. So, there is no way for it to know how large your file is.