Upgrade to use boto3

Question

Upgrade to use boto3

sebastian-nagel opened this issue 7 years ago · comments

cc-pyspark already uses boto3 to download data from s3://commoncrawl/: faster multi-part downloads and less errors (timeouts, "503 slow down"). The upgrade should improve the performance and robustness of cc-mrjob.

Sebastian Nagel · Answer 1 · Fri Sep 29 2017 16:39:04 GMT+0800 (China Standard Time)

First working solution available in branch boto3. Uses a temporary file and s3.client.download_fileobj(...) to fetch the data from S3:

the StreamingBody object return by s3.client.get_object()['Body'] does not implement the required methods of [io.BufferedReader], cf. how-to-use-boto3-with-commoncrawl-streaming-data
fast parallelized and configurable multi-part download