dmlc / rabit

Reliable Allreduce and Broadcast Interface for distributed machine learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Compiling and running tests?

thvasilo opened this issue · comments

Hello, I thought I'd help out with testing the new features that @chenqin is going to be working on, but I can't seem to be able to get the tests to compile and run properly.

The README in the test folder refers to a keepalive.sh script but that doesn't seem to exist.

I'm able to build the project through CMake, however running make in the tests folder fails for the lazy_recover target with an unexpected error:

g++ -c -Wall -O3 -msse2  -Wno-unknown-pragmas -fPIC -I../include  -std=c++0x -o lazy_recover.o lazy_recover.cc
In file included from src/../include/rabit/internal/engine.h:10:0,
                 from src/engine_mpi.cc:14:
src/../include/rabit/internal/../serializable.h:12:10: fatal error: dmlc/io.h: No such file or directory
 #include "dmlc/io.h"
          ^~~~~~~~~~~
compilation terminated.
Makefile:88: recipe for target 'engine_mpi.o' failed
make[1]: *** [engine_mpi.o] Error 1

Which is weird because the io.h header is indeed under ../inlcude/dmlc.

When trying to run the rest of tests that do compile only the ring all reduce test runs to completion, the others raise runtime errors.

I'm wondering if I'm doing something wrong when running these tests, I simply call make -f test.mk <test-name>.

Do you have instructions for running the tests @chenqin ?

I think MPI engine is still not working at least on my mac, I don't see why we need to support mpi engine within a mpi engine :) Anyway, I submitted reference fix c6f6b25

Regarding to test locally, what i did was run command in xgboost/rabit directory (point to my own dmlc-core project) xgboost/dmlc-core, pay attention to relative directory layout.

../dmlc-core/tracker/dmlc-submit --cluster local --num-workers=10 --local-num-attempt=10 test/model_recover 10000 mock=0,0,1,0 mock=1,1,1,0

Thanks @chenqin , compiling the tests under the xgboost directory is actually fine, the error is gone in that case. Still weird ithat it doesn't work in base rabit though, but I tested your fix and it seems to compile fine now. Haven't tried that branch under xgboost though. I assume some shenanigans with different dmlc-core versions is the cause, I had clone the dmlc-core master into rabit to compile.

OK so I tried running such an example and here's my truncated output:

> ../dmlc-core/tracker/dmlc-submit --cluster local --num-workers=10 --local-num-attempt=10 test/model_recover 10000 mock=0,0,1,0

2019-03-14 11:27:21,968 INFO start listen on 10.112.11.162:9091
[0] reload-trail=0, init iter=0
[1] reload-trail=0, init iter=0
[9] reload-trail=0, init iter=0
[8] reload-trail=0, init iter=0
[6] reload-trail=0, init iter=0
[2] reload-trail=0, init iter=0
[7] reload-trail=0, init iter=0
[5] reload-trail=0, init iter=0
[3] reload-trail=0, init iter=0
[4] reload-trail=0, init iter=0
2019-03-14 11:27:22,955 INFO @tracker All of 10 nodes getting started
[0] !!!TestMax pass, iter=0
[3] !!!TestMax pass, iter=0
[0]@@@Hit Mock Error:Broadcast
[5] !!!TestMax pass, iter=0
[1] !!!TestMax pass, iter=0
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/tvas/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/tvas/anaconda3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Dropbox/SICS/repos/xgboost/dmlc-core/tracker/dmlc_tracker/local.py", line 42, in exec_cmd
    raise RuntimeError('Get nonzero return code=%d on %s %s' % (ret, cmd, env))
RuntimeError: Get nonzero return code=254 on test/model_recover 10000 mock=0,0,1,0 --local-num-attempt=10 {<other-env-vars>....  '_': '../dmlc-core/tracker/dmlc-submit', 'DMLC_NUM_WORKER': '10', 'DMLC_NUM_SERVER': '0', 'DMLC_TRACKER_URI': '10.112.11.162', 'DMLC_TRACKER_PORT': '9091', 'DMLC_TASK_ID': '0', 'DMLC_ROLE': 'worker', 'DMLC_JOB_CLUSTER': 'local'}
[9] !!!TestMax pass, iter=0
[2] !!!TestMax pass, iter=0
[6] !!!TestMax pass, iter=0
[8] !!!TestMax pass, iter=0
[4] !!!TestMax pass, iter=0
[7] !!!TestMax pass, iter=0

Then the process just hangs. Is this expected behavior? I'm not sure how to determine whether a test run was successful.

Did you update dmlc-core and point to one with latest patch dmlc/dmlc-core#512
But yes, that is something might need a bit improvement, basically when one worker dead and not bring up, everyone else is waiting infinitely and retry. One of approach we can do is

  1. worker maintain heartbeat to tracker and report status since last report
  2. tracker piggy back network status and coordinate with resource management framework(YARN/K8s/Spark) for failed worker recovery.
  3. reactivate netwrok

Yes after updating to the latest dmlc-core this works in base rabit as well, the test are actually terminating now.

So for anyone looking how to run test in rabit:

  1. Clone rabit.
  2. Clone dmlc-core under the rabit base directory.
  3. Run make in the rabit/test/ directory.
  4. Run tests using make -f test.mk <name-of-test> for example make -f test.mk model_recover_10_10k_die_same

Adding how to run the unit test for completeness:

Using @chenqin trick to build a local copy of Gtest:

cd /path/to/rabit/
wget -nc https://github.com/google/googletest/archive/release-1.7.0.zip
unzip -n release-1.7.0.zip
mv googletest-release-1.7.0 gtest && cd gtest
cmake . && make
mkdir lib && mv libgtest.a lib
cd ..
rm -rf release-1.7.0.zip

Then we can build and run the unit tests.

mkdir build
cd build
GTEST_ROOT=/path/to/rabit/gtest/ cmake -DRABIT_BUILD_TESTS=ON ..
make test

Note that this step requires PR #124