bwhite / hadoopy

Python MapReduce library written in Cython. Visit us in #hadoopy on freenode. See the link below for documentation and tutorials.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Permission Denied error triggered by Python subprocess

tylere opened this issue · comments

When trying to get the hadoopy wordcount test to work, I am running into a "OSError: [Errno 13] Permission denied" error during the Python subprocess. Since the subprocess is triggered by hadoopy/_main.c, my normal python debugging process (pdb) doesn't seem to work.

Any suggestions as to how to find out what file/directory hadoopy is trying to access (and as what user)?

I am able to run the Cloudera example grep job (standalone mode) and the hadoopy wordcount command line test, but the hadoopy python wordcount test fails.

Environment:

  • Ubuntu 10.04
  • Cloudera Hadoop Distribution CDH3 (stand-alone mode)

The Hadoop output with the Pemission Denied error is listed below.

rm playground/out/ -fr; python -c "import hadoopy; out = hadoopy.launch('playground/wc-input-alice.txt', 'playground/out/', 'wc.py')"
/\----------Hadoop Output----------/\
hadoopy: Running[hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar -output playground/out/ -input playground/wc-input-alice.txt -mapper "python wc.py pipe map" -reducer "python wc.py pipe reduce" -file wc.py -jobconf mapred.job.name=wc -io typedbytes -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat -inputformat AutoInputFormat]
11/11/16 11:41:08 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [wc.py] [] /tmp/streamjob2130356558435301069.jar tmpDir=null
11/11/16 11:41:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
11/11/16 11:41:08 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/11/16 11:41:09 INFO mapred.FileInputFormat: Total input paths to process : 1
11/11/16 11:41:09 WARN mapred.LocalJobRunner: LocalJobRunner does not support symlinking into current working dir.
11/11/16 11:41:09 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /app/hadoop/tmp/mapred/staging/taericks-1131214868/.staging/job_local_0001/.job.split.crc <- /app/hadoop/tmp/mapred/local/localRunner/.job.split.crc
11/11/16 11:41:09 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /app/hadoop/tmp/mapred/staging/taericks-1131214868/.staging/job_local_0001/job.split <- /app/hadoop/tmp/mapred/local/localRunner/job.split
11/11/16 11:41:09 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /app/hadoop/tmp/mapred/staging/taericks-1131214868/.staging/job_local_0001/job.jar <- /app/hadoop/tmp/mapred/local/localRunner/job.jar
11/11/16 11:41:09 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /app/hadoop/tmp/mapred/staging/taericks-1131214868/.staging/job_local_0001/.job.xml.crc <- /app/hadoop/tmp/mapred/local/localRunner/.job.xml.crc
11/11/16 11:41:09 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /app/hadoop/tmp/mapred/staging/taericks-1131214868/.staging/job_local_0001/.job.splitmetainfo.crc <- /app/hadoop/tmp/mapred/local/localRunner/.job.splitmetainfo.crc
11/11/16 11:41:09 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /app/hadoop/tmp/mapred/staging/taericks-1131214868/.staging/job_local_0001/job.xml <- /app/hadoop/tmp/mapred/local/localRunner/job.xml
11/11/16 11:41:09 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /app/hadoop/tmp/mapred/staging/taericks-1131214868/.staging/job_local_0001/.job.jar.crc <- /app/hadoop/tmp/mapred/local/localRunner/.job.jar.crc
11/11/16 11:41:09 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /app/hadoop/tmp/mapred/staging/taericks-1131214868/.staging/job_local_0001/job.splitmetainfo <- /app/hadoop/tmp/mapred/local/localRunner/job.splitmetainfo
11/11/16 11:41:09 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
11/11/16 11:41:09 INFO streaming.StreamJob: Running job: job_local_0001
11/11/16 11:41:09 INFO streaming.StreamJob: Job running in-process (local Hadoop)
11/11/16 11:41:09 WARN snappy.LoadSnappy: Snappy native library is available
11/11/16 11:41:09 INFO snappy.LoadSnappy: Snappy native library loaded
11/11/16 11:41:09 INFO mapred.MapTask: numReduceTasks: 1
11/11/16 11:41:09 INFO mapred.MapTask: io.sort.mb = 100
11/11/16 11:41:09 INFO mapred.MapTask: data buffer = 79691776/99614720
11/11/16 11:41:09 INFO mapred.MapTask: record buffer = 262144/327680
11/11/16 11:41:09 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
11/11/16 11:41:09 INFO streaming.PipeMapRed: PipeMapRed exec [/home/terickson/.virtualenvs/hadoopy/bin/python, wc.py, pipe, map]
11/11/16 11:41:09 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
11/11/16 11:41:09 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
11/11/16 11:41:09 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
Traceback (most recent call last):
  File "wc.py", line 15, in <module>
    hadoopy.run(mapper, reducer, doc=__doc__)
  File "_main.pyx", line 445, in _main.run (hadoopy/_main.c:5554)
  File "/usr/lib/python2.6/subprocess.py", line 480, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/usr/lib/python2.6/subprocess.py", line 633, in __init__
    errread, errwrite)
  File "/usr/lib/python2.6/subprocess.py", line 1138, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied
11/11/16 11:41:09 INFO streaming.PipeMapRed: MRErrorThread done
11/11/16 11:41:09 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
11/11/16 11:41:09 INFO streaming.PipeMapRed: log:null
R/W/S=2274/0/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=taericks
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Wed Nov 16 11:41:09 PST 2011
java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:297)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    at org.apache.hadoop.io.WritableUtils.writeString(WritableUtils.java:100)
    at org.apache.hadoop.typedbytes.TypedBytesOutput.writeString(TypedBytesOutput.java:223)
    at org.apache.hadoop.typedbytes.TypedBytesWritableOutput.writeText(TypedBytesWritableOutput.java:182)
    at org.apache.hadoop.typedbytes.TypedBytesWritableOutput.write(TypedBytesWritableOutput.java:126)
    at org.apache.hadoop.streaming.io.TypedBytesInputWriter.writeTypedBytes(TypedBytesInputWriter.java:57)
    at org.apache.hadoop.streaming.io.TypedBytesInputWriter.writeValue(TypedBytesInputWriter.java:52)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)


11/11/16 11:41:09 WARN streaming.PipeMapRed: java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:297)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at java.io.DataOutputStream.flush(DataOutputStream.java:123)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:569)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:125)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)

11/11/16 11:41:09 INFO streaming.PipeMapRed: mapRedFinished
11/11/16 11:41:09 WARN streaming.PipeMapRed: java.io.IOException: Stream Closed
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:297)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at java.io.DataOutputStream.flush(DataOutputStream.java:123)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:569)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)

11/11/16 11:41:09 INFO streaming.PipeMapRed: mapRedFinished
11/11/16 11:41:09 WARN mapred.LocalJobRunner: job_local_0001
java.io.IOException: log:null
R/W/S=2274/0/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=taericks
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Wed Nov 16 11:41:09 PST 2011
java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:297)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    at org.apache.hadoop.io.WritableUtils.writeString(WritableUtils.java:100)
    at org.apache.hadoop.typedbytes.TypedBytesOutput.writeString(TypedBytesOutput.java:223)
    at org.apache.hadoop.typedbytes.TypedBytesWritableOutput.writeText(TypedBytesWritableOutput.java:182)
    at org.apache.hadoop.typedbytes.TypedBytesWritableOutput.write(TypedBytesWritableOutput.java:126)
    at org.apache.hadoop.streaming.io.TypedBytesInputWriter.writeTypedBytes(TypedBytesInputWriter.java:57)
    at org.apache.hadoop.streaming.io.TypedBytesInputWriter.writeValue(TypedBytesInputWriter.java:52)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)


    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:126)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
11/11/16 11:41:10 INFO streaming.StreamJob:  map 0%  reduce 0%
11/11/16 11:41:10 INFO streaming.StreamJob: Job running in-process (local Hadoop)
11/11/16 11:41:10 ERROR streaming.StreamJob: Job not successful. Error: NA
11/11/16 11:41:10 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/terickson/.virtualenvs/hadoopy/lib/python2.6/site-packages/hadoopy/_runner.py", line 214, in launch
    raise subprocess.CalledProcessError(process.returncode, ' '.join(cmd))
subprocess.CalledProcessError: Command 'hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar -output playground/out/ -input playground/wc-input-alice.txt -mapper "python wc.py pipe map" -reducer "python wc.py pipe reduce" -file wc.py -jobconf mapred.job.name=wc -io typedbytes -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat -inputformat AutoInputFormat' returned non-zero exit status 1

I believe this is due to local execution mode, try psuedo-distributed. Hadoop's local mode is poor at simulating the environment and I've found several bugs in it that I haven't had a chance to report. One I know of is that it doesn't handle relative paths so that could be the issue here.

Yes, I was using the standalone configuration for Hadoop. Once I changed over to pseudo-distributed mode, I received a different error (which I will described in a separate issue).

If hadoopy does not work with Hadoop in Standalone Mode, this should probably be noted in the hadoopy documentation (perhaps in an installation section).

I looked into local a bit more and the issue isn't on hadoopy's side, local doesn't properly support emulating the distributed cache with respect to .tar file decompression. Eventually we can try to get this fixed in hadoop but it isn't a high priority.

Updated the readme to point to this issue.

I am getting below error when using hadoopy.launch .
error says that input file does not exist but it is there in hdfs

out = hadoopy.launch('data/word.txt','data/out2','/home/ubuntu/hadoopy-MR/wc.py')
WARNING:root:Script is not executable which is a requirement when pipe=True. A temporary copy will be modified to correct this.
WARNING:root:Adding "#!/usr/bin/env python" to script [/tmp/tmp0ZrwWewc.py]. This will make line numbers off by one from the original.
WARNING:root:18/09/06 22:12:56 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
WARNING:root:18/09/06 22:12:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARNING:root:18/09/06 22:12:56 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
ERROR:root:18/09/06 22:12:59 ERROR streaming.StreamJob: Error Launching job : Input path does not exist: hdfs://localhost:9000/user/ubuntu/data/word.txt
Traceback (most recent call last):
File "", line 1, in
File "/home/ubuntu/hadoopy/src/hadoopy/hadoopy/_runner.py", line 328, in launch
raise subprocess.CalledProcessError(process.returncode, ' '.join(cmd))
subprocess.CalledProcessError: Command 'hadoop jar /usr/local/hadoop-env/hadoop-2.7.0/share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar -output data/out2 -input data/word.txt -mapper "python tmp0ZrwWewc.py pipe map" -reducer "python tmp0ZrwWewc.py pipe reduce" -file /tmp/tmp0ZrwWewc.py -jobconf "mapred.job.name=tmp0ZrwWewc" -io typedbytes -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat -inputformat AutoInputFormat' returned non-zero exit status 5

Pleas help me