commoncrawl / cc-mrjob

Demonstration of using Python to process the Common Crawl dataset with the mrjob framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Job fails when running local job

mitcheccles opened this issue · comments

The example jobs appear to fail when running them locally.

I have cloned down this repo, and fetched the data using the get-data script. I attempt to run the tag_counter example as per the readme. The exact command I run is: python tag_counter.py --conf-path mrjob.conf --no-output --output-dir out crawl-data/CC-MAIN-2014-35/segments/1408500800168.29/warc/

The response I get is: IOError: [Errno 2] No such file or directory: '/<path>/cc-mrjob/WARC/1.0'

I get a different response, if I run the above command with the -r local argument. The job appears to start executing, and says "Running step 1 of 1...". However, it just hangs indefinitely, until I kill python.

I've tried the examples on a couple of machines and keep getting the same result. I suspect, I've missed some all important step? Or maybe, there is a bug?

I'm on python2.7, using mrjob-0.5.8.

That's because the job tries to read the WARC file itself as a list of WARC files to process. Try as stated in the README: python tag_counter.py --conf-path mrjob.conf --no-output --output-dir out input/test-1.warc

But the description could explicitly state this. I'll update it. If you find more points which need a clarification, please, do report it or open a pull-request. Thanks!

Ah nuts! Yes, that works now. Thank you. I don't know why, but I thought input/test-1.warc was just a placeholder string, and didn't twig that it was a folder in the repo... Doh!

Thanks for your help and for putting this tutorial together :).