Job fails on Hadoop by assuming local WARC path

Question

Job fails on Hadoop by assuming local WARC path

sebastian-nagel opened this issue 8 years ago · comments

The CCJob fails when running in a distributed environment because it assumes that the path to the WARC file is local and does not point to the Common Crawl bucket:

+ python server_count_warc.py --step-num=0 --mapper
Traceback (most recent call last):
...
IOError: [Errno 2] No such file or directory: 'crawl-data/CC-MAIN-2016-44/segments/1476988717783.68/robotstxt/CC-MAIN-20161020183837-00000-ip-10-171-6-4.ec2.internal.warc.gz'

Sebastian Nagel · Answer 1 · Tue Nov 01 2016 21:52:56 GMT+0800 (China Standard Time)

Seen with mrjob 0.5.6 on CDH 5.9.0.
It's actually described here: http://stackoverflow.com/questions/36812684/mrjob-determining-if-running-inline-local-emr-or-hadoop, the test

if self.options.runner in ['emr', 'hadoop']:

beeker1121 · Answer 2 · Thu Dec 01 2016 01:10:08 GMT+0800 (China Standard Time)

Just submitted #8 to fix this.

Sebastian Nagel · Answer 3 · Mon Dec 05 2016 20:46:19 GMT+0800 (China Standard Time)

Verified the fix on a CDH 5.9.0 Hadoop cluster. Thanks to @beeker1121 and @mpenkov!