hardmaru / WorldModelsExperiments

World Models Experiments

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

python train.py gives a CalledProcessError

kessler-frost opened this issue · comments

When I run python train.py on the specified CPU system I get a very long error message ending with,
Traceback (most recent call last): File "train.py", line 450, in <module> if "parent" == mpi_fork(args.num_worker+1): os.exit() File "train.py", line 424, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "/home/neptune/anaconda3/lib/python3.5/subprocess.py", line 581, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '65', '/home/neptune/anaconda3/bin/python', '-u', 'train.py']' returned non-zero exit status 134
I searched for the exit status for mpirun but wasn't able to debug the issue.

Yes, I've tried running ESTool with a simple experiment from your stool repo using python train.py bullet_racecar -n 8 -t 4 it was running without any issue/error. I even tried python train.py bullet_ant -e 16 -n 64 -t 4 after installing pybullet and it too ran successfully. But still was unable to perform the same on doom. And yeah, I am using a 64 core machine with 200GB RAM on gcloud for all of the experiments, just as you mentioned in the blog post.

Could be related to this:
AppliedDataSciencePartners/WorldModels#3

Ensure you've only got one MPI library on your machine (i.e. try running this if you're on Linux)
sudo apt-get remove openmpi-bin

If you have multiple MPI's then comm.Get_size() returns 1, so the following assert statement fails
num_worker = comm.Get_size()
assert len(packet_list) == num_worker-1

Tried that but it opens up a new box of errors like
FileNotFoundError: [Errno 2] No such file or directory: 'mpirun'
or
lib12.so was not found and something like that.

Interestingly though, when I tried changing the number of cores from 64 to 32 or 24 by executing,
python train.py -n 32
It started giving me the right thing,
('doomrnn', (1, 35, 269.67, 149.75, 480.81, 69.88, 0.09914, 269.67, 480))

I guess the issue just comes when we use 64 cores (which is odd)

Numbers of workers has to be less than the number of cores - how many cores have you got?

Try uninstalling open MPI and instead install mpich

sudo apt-get install mpich

I. I've tried the following combinations which seemed to work (not uninstalling openmpi):

  1. 64 Core proc, python train.py -n 24 or python train.py 32
  2. 24 Core proc, python train.py -n 24

II. Which did not work include:
with openmpi -

  1. 64 Core proc, python train.py
    with mpich -
  2. 64 Core proc, python train.py
  3. 64 Core proc, python train.py -n 32

Also, I'm using Anaconda 4.2 in all of my experiments because Python 3.6 was causing issues with boost libraries.
I'd suggest if it's possible for someone to perform a clean installation of all the project dependencies on a 64 core machine then they should try the solution by @davidADSP as I've exhausted all of my gcloud credits and am stuck with a 24 core one with a new account.

Do you get the same problems with the car racing task or is it just doom?

I don't know about a 64 core proc, but for 24 core python train.py -n 24 executes successfully for car racing task. For a while this issue was also present when using 24 core processor but I was able to work around that by installing stuff in this particular order,
pip install tensorflow==1.8 gym==0.9.4 cma==2.2

conda install libgcc

apt-get install -y python-numpy cmake zlib1g-dev libjpeg-dev libboost-all-dev gcc libsdl2-dev wget unzip git

pip install mpi4py==2

pip install ppaquette-gym-doom

were you able to reproduce this error?

I think that this issue is caused by the error in any of the threads while executing them. When I carefully observed this I found that there were different reasons,
for example one time I got the same AssertionError you referenced,
Traceback (most recent call last): File "05_train_controller.py", line 461, in <module> main(args) File "05_train_controller.py", line 410, in main master() File "05_train_controller.py", line 319, in master send_packets_to_slaves(packet_list) File "05_train_controller.py", line 233, in send_packets_to_slaves assert len(packet_list) == num_worker-1 AssertionError

then one time I got this in between a whole screen of text,
ImportError: libXft.so.2: cannot open shared object file: No such file or directory

So, I guess this is being caused due to dependency issues(the same one over all the threads).

Now I've come across a new error when I created a completely new instance and did the installation as mentioned above then executed python train.py and this occurred ,

RuntimeError: can't start new thread

I guess all of the other errors were resolved by doing a clean installation in that order.

Hi @kessler-frost

I'm not sure how to resolve this to be honest. The only diff I see is the python version I used (3.5.2)

I ran train.py today on a fresh machine (to check another issue on another thread) for ~ half a day and it seemed to work on my machine:

https://github.com/hardmaru/WorldModelsExperiments/blob/master/doomrnn/trainlog/train.log.txt

@hardmaru thank you. Even I don't understand why is this happening, we both are using the same Anaconda distribution (python 3.5.2). I guess I'll close this issue until someone comes across it again.

hello while I am running train.py Igot this error can someone help me please
File "c:\Users\User\Desktop\GIT\WorldModelsExperiments-master\carracing\train.py", line 445, in
if "parent" == mpi_fork(args.num_worker+1): os._exit()
File "c:\Users\User\Desktop\GIT\WorldModelsExperiments-master\carracing\train.py", line 419, in mpi_fork
subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 266, in check_call
retcode = call(*popenargs, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 247, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 676, in init
restore_signals, start_new_session)
File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 957, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified