commoncrawl / cc-mrjob

Demonstration of using Python to process the Common Crawl dataset with the mrjob framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Job fails on Hadoop by assuming local WARC path

sebastian-nagel opened this issue · comments

The CCJob fails when running in a distributed environment because it assumes that the path to the WARC file is local and does not point to the Common Crawl bucket:

+ python server_count_warc.py --step-num=0 --mapper
Traceback (most recent call last):
...
IOError: [Errno 2] No such file or directory: 'crawl-data/CC-MAIN-2016-44/segments/1476988717783.68/robotstxt/CC-MAIN-20161020183837-00000-ip-10-171-6-4.ec2.internal.warc.gz'

Seen with mrjob 0.5.6 on CDH 5.9.0.
It's actually described here: http://stackoverflow.com/questions/36812684/mrjob-determining-if-running-inline-local-emr-or-hadoop, the test

if self.options.runner in ['emr', 'hadoop']:

Just submitted #8 to fix this.

Verified the fix on a CDH 5.9.0 Hadoop cluster. Thanks to @beeker1121 and @mpenkov!