Compiling and running tests?
thvasilo opened this issue · comments
Hello, I thought I'd help out with testing the new features that @chenqin is going to be working on, but I can't seem to be able to get the tests to compile and run properly.
The README in the test folder refers to a keepalive.sh
script but that doesn't seem to exist.
I'm able to build the project through CMake, however running make
in the tests folder fails for the lazy_recover
target with an unexpected error:
g++ -c -Wall -O3 -msse2 -Wno-unknown-pragmas -fPIC -I../include -std=c++0x -o lazy_recover.o lazy_recover.cc
In file included from src/../include/rabit/internal/engine.h:10:0,
from src/engine_mpi.cc:14:
src/../include/rabit/internal/../serializable.h:12:10: fatal error: dmlc/io.h: No such file or directory
#include "dmlc/io.h"
^~~~~~~~~~~
compilation terminated.
Makefile:88: recipe for target 'engine_mpi.o' failed
make[1]: *** [engine_mpi.o] Error 1
Which is weird because the io.h
header is indeed under ../inlcude/dmlc
.
When trying to run the rest of tests that do compile only the ring all reduce test runs to completion, the others raise runtime errors.
I'm wondering if I'm doing something wrong when running these tests, I simply call make -f test.mk <test-name>
.
Do you have instructions for running the tests @chenqin ?
I think MPI engine is still not working at least on my mac, I don't see why we need to support mpi engine within a mpi engine :) Anyway, I submitted reference fix c6f6b25
Regarding to test locally, what i did was run command in xgboost/rabit directory (point to my own dmlc-core project) xgboost/dmlc-core, pay attention to relative directory layout.
../dmlc-core/tracker/dmlc-submit --cluster local --num-workers=10 --local-num-attempt=10 test/model_recover 10000 mock=0,0,1,0 mock=1,1,1,0
Thanks @chenqin , compiling the tests under the xgboost directory is actually fine, the error is gone in that case. Still weird ithat it doesn't work in base rabit though, but I tested your fix and it seems to compile fine now. Haven't tried that branch under xgboost though. I assume some shenanigans with different dmlc-core
versions is the cause, I had clone the dmlc-core master into rabit to compile.
OK so I tried running such an example and here's my truncated output:
> ../dmlc-core/tracker/dmlc-submit --cluster local --num-workers=10 --local-num-attempt=10 test/model_recover 10000 mock=0,0,1,0
2019-03-14 11:27:21,968 INFO start listen on 10.112.11.162:9091
[0] reload-trail=0, init iter=0
[1] reload-trail=0, init iter=0
[9] reload-trail=0, init iter=0
[8] reload-trail=0, init iter=0
[6] reload-trail=0, init iter=0
[2] reload-trail=0, init iter=0
[7] reload-trail=0, init iter=0
[5] reload-trail=0, init iter=0
[3] reload-trail=0, init iter=0
[4] reload-trail=0, init iter=0
2019-03-14 11:27:22,955 INFO @tracker All of 10 nodes getting started
[0] !!!TestMax pass, iter=0
[3] !!!TestMax pass, iter=0
[0]@@@Hit Mock Error:Broadcast
[5] !!!TestMax pass, iter=0
[1] !!!TestMax pass, iter=0
Exception in thread Thread-2:
Traceback (most recent call last):
File "/home/tvas/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/tvas/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/Dropbox/SICS/repos/xgboost/dmlc-core/tracker/dmlc_tracker/local.py", line 42, in exec_cmd
raise RuntimeError('Get nonzero return code=%d on %s %s' % (ret, cmd, env))
RuntimeError: Get nonzero return code=254 on test/model_recover 10000 mock=0,0,1,0 --local-num-attempt=10 {<other-env-vars>.... '_': '../dmlc-core/tracker/dmlc-submit', 'DMLC_NUM_WORKER': '10', 'DMLC_NUM_SERVER': '0', 'DMLC_TRACKER_URI': '10.112.11.162', 'DMLC_TRACKER_PORT': '9091', 'DMLC_TASK_ID': '0', 'DMLC_ROLE': 'worker', 'DMLC_JOB_CLUSTER': 'local'}
[9] !!!TestMax pass, iter=0
[2] !!!TestMax pass, iter=0
[6] !!!TestMax pass, iter=0
[8] !!!TestMax pass, iter=0
[4] !!!TestMax pass, iter=0
[7] !!!TestMax pass, iter=0
Then the process just hangs. Is this expected behavior? I'm not sure how to determine whether a test run was successful.
Did you update dmlc-core and point to one with latest patch dmlc/dmlc-core#512
But yes, that is something might need a bit improvement, basically when one worker dead and not bring up, everyone else is waiting infinitely and retry. One of approach we can do is
- worker maintain heartbeat to tracker and report status since last report
- tracker piggy back network status and coordinate with resource management framework(YARN/K8s/Spark) for failed worker recovery.
- reactivate netwrok
Yes after updating to the latest dmlc-core this works in base rabit as well, the test are actually terminating now.
So for anyone looking how to run test in rabit:
- Clone rabit.
- Clone dmlc-core under the rabit base directory.
- Run
make
in therabit/test/
directory. - Run tests using
make -f test.mk <name-of-test>
for examplemake -f test.mk model_recover_10_10k_die_same
Adding how to run the unit test for completeness:
Using @chenqin trick to build a local copy of Gtest:
cd /path/to/rabit/
wget -nc https://github.com/google/googletest/archive/release-1.7.0.zip
unzip -n release-1.7.0.zip
mv googletest-release-1.7.0 gtest && cd gtest
cmake . && make
mkdir lib && mv libgtest.a lib
cd ..
rm -rf release-1.7.0.zip
Then we can build and run the unit tests.
mkdir build
cd build
GTEST_ROOT=/path/to/rabit/gtest/ cmake -DRABIT_BUILD_TESTS=ON ..
make test
Note that this step requires PR #124