python train.py gives a CalledProcessError

Question

python train.py gives a CalledProcessError

kessler-frost opened this issue 6 years ago · comments

When I run python train.py on the specified CPU system I get a very long error message ending with,
Traceback (most recent call last): File "train.py", line 450, in <module> if "parent" == mpi_fork(args.num_worker+1): os.exit() File "train.py", line 424, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "/home/neptune/anaconda3/lib/python3.5/subprocess.py", line 581, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '65', '/home/neptune/anaconda3/bin/python', '-u', 'train.py']' returned non-zero exit status 134
I searched for the exit status for mpirun but wasn't able to debug the issue.

hardmaru · Answer 1 · Tue Jul 17 2018 13:36:11 GMT+0800 (China Standard Time)

Have you tried running ESTool with simple experiment to see if MPI is installed ok? Also I think it is configured for 64 core machine. If you are using less cores pass in a flag to specify (instructions in ESTool or blogs)

…

On Tue, Jul 17, 2018 at 2:01 PM Sankalp Sanand ***@***.***> wrote: When I run python train.py on the specified CPU system I get a very long error message ending with, Traceback (most recent call last): File "train.py", line 450, in <module> if "parent" == mpi_fork(args.num_worker+1): os.exit() File "train.py", line 424, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "/home/neptune/anaconda3/lib/python3.5/subprocess.py", line 581, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '65', '/home/neptune/anaconda3/bin/python', '-u', 'train.py']' returned non-zero exit status 134 I searched for the exit status for mpirun but wasn't able to debug the issue. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGBoHoSDYbqfrbO2I9Rw7s9cpg9Vr8WYks5uHW-OgaJpZM4VSL4G> .

Sankalp Sanand · Answer 2 · Tue Jul 17 2018 14:19:34 GMT+0800 (China Standard Time)

Yes, I've tried running ESTool with a simple experiment from your stool repo using python train.py bullet_racecar -n 8 -t 4 it was running without any issue/error. I even tried python train.py bullet_ant -e 16 -n 64 -t 4 after installing pybullet and it too ran successfully. But still was unable to perform the same on doom. And yeah, I am using a 64 core machine with 200GB RAM on gcloud for all of the experiments, just as you mentioned in the blog post.

David Foster · Answer 3 · Tue Jul 17 2018 20:23:15 GMT+0800 (China Standard Time)

Could be related to this:
AppliedDataSciencePartners/WorldModels#3

Ensure you've only got one MPI library on your machine (i.e. try running this if you're on Linux)
sudo apt-get remove openmpi-bin

If you have multiple MPI's then comm.Get_size() returns 1, so the following assert statement fails
num_worker = comm.Get_size()
assert len(packet_list) == num_worker-1

Sankalp Sanand · Answer 4 · Wed Jul 18 2018 03:24:50 GMT+0800 (China Standard Time)

Tried that but it opens up a new box of errors like
FileNotFoundError: [Errno 2] No such file or directory: 'mpirun'
or
lib12.so was not found and something like that.

Interestingly though, when I tried changing the number of cores from 64 to 32 or 24 by executing,
python train.py -n 32
It started giving me the right thing,
('doomrnn', (1, 35, 269.67, 149.75, 480.81, 69.88, 0.09914, 269.67, 480))

I guess the issue just comes when we use 64 cores (which is odd)

David Foster · Answer 5 · Wed Jul 18 2018 04:53:54 GMT+0800 (China Standard Time)

Numbers of workers has to be less than the number of cores - how many cores have you got?

Try uninstalling open MPI and instead install mpich

sudo apt-get install mpich

Sankalp Sanand · Answer 6 · Wed Jul 18 2018 12:23:36 GMT+0800 (China Standard Time)

I. I've tried the following combinations which seemed to work (not uninstalling openmpi):

64 Core proc, python train.py -n 24 or python train.py 32
24 Core proc, python train.py -n 24

II. Which did not work include:
with openmpi -

64 Core proc, python train.py
with mpich -
64 Core proc, python train.py
64 Core proc, python train.py -n 32

Also, I'm using Anaconda 4.2 in all of my experiments because Python 3.6 was causing issues with boost libraries.
I'd suggest if it's possible for someone to perform a clean installation of all the project dependencies on a 64 core machine then they should try the solution by @davidADSP as I've exhausted all of my gcloud credits and am stuck with a 24 core one with a new account.

David Foster · Answer 7 · Wed Jul 18 2018 15:32:48 GMT+0800 (China Standard Time)

Do you get the same problems with the car racing task or is it just doom?

Sankalp Sanand · Answer 8 · Wed Jul 18 2018 17:09:33 GMT+0800 (China Standard Time)

I don't know about a 64 core proc, but for 24 core python train.py -n 24 executes successfully for car racing task. For a while this issue was also present when using 24 core processor but I was able to work around that by installing stuff in this particular order,
pip install tensorflow==1.8 gym==0.9.4 cma==2.2

conda install libgcc

apt-get install -y python-numpy cmake zlib1g-dev libjpeg-dev libboost-all-dev gcc libsdl2-dev wget unzip git

pip install mpi4py==2

pip install ppaquette-gym-doom

were you able to reproduce this error?

I think that this issue is caused by the error in any of the threads while executing them. When I carefully observed this I found that there were different reasons,
for example one time I got the same AssertionError you referenced,
Traceback (most recent call last): File "05_train_controller.py", line 461, in <module> main(args) File "05_train_controller.py", line 410, in main master() File "05_train_controller.py", line 319, in master send_packets_to_slaves(packet_list) File "05_train_controller.py", line 233, in send_packets_to_slaves assert len(packet_list) == num_worker-1 AssertionError

then one time I got this in between a whole screen of text,
ImportError: libXft.so.2: cannot open shared object file: No such file or directory

So, I guess this is being caused due to dependency issues(the same one over all the threads).

Sankalp Sanand · Answer 9 · Wed Jul 18 2018 21:47:58 GMT+0800 (China Standard Time)

Now I've come across a new error when I created a completely new instance and did the installation as mentioned above then executed python train.py and this occurred ,

RuntimeError: can't start new thread

I guess all of the other errors were resolved by doing a clean installation in that order.

hardmaru · Answer 10 · Sat Jul 21 2018 08:11:09 GMT+0800 (China Standard Time)

Hi @kessler-frost

I'm not sure how to resolve this to be honest. The only diff I see is the python version I used (3.5.2)

I ran train.py today on a fresh machine (to check another issue on another thread) for ~ half a day and it seemed to work on my machine:

https://github.com/hardmaru/WorldModelsExperiments/blob/master/doomrnn/trainlog/train.log.txt

Sankalp Sanand · Answer 11 · Sat Jul 21 2018 14:34:22 GMT+0800 (China Standard Time)

@hardmaru thank you. Even I don't understand why is this happening, we both are using the same Anaconda distribution (python 3.5.2). I guess I'll close this issue until someone comes across it again.

Antonio-git-lab · Answer 12 · Thu Mar 26 2020 17:23:11 GMT+0800 (China Standard Time)

hello while I am running train.py Igot this error can someone help me please
File "c:\Users\User\Desktop\GIT\WorldModelsExperiments-master\carracing\train.py", line 445, in
if "parent" == mpi_fork(args.num_worker+1): os._exit()
File "c:\Users\User\Desktop\GIT\WorldModelsExperiments-master\carracing\train.py", line 419, in mpi_fork
subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 266, in check_call
retcode = call(*popenargs, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 247, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 676, in init
restore_signals, start_new_session)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 957, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified