commoncrawl / cc-mrjob

Demonstration of using Python to process the Common Crawl dataset with the mrjob framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Upgrade to use boto3

sebastian-nagel opened this issue · comments

cc-pyspark already uses boto3 to download data from s3://commoncrawl/: faster multi-part downloads and less errors (timeouts, "503 slow down"). The upgrade should improve the performance and robustness of cc-mrjob.

First working solution available in branch boto3. Uses a temporary file and s3.client.download_fileobj(...) to fetch the data from S3: