PatWie / cluster-smi

nvidia-smi but for an entire GPU cluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot compile

thangvubk opened this issue · comments

Thank you for your excellent work. However when i compile with go-lang, it have the error. I totally new to Go-lang, can you show me what is the problem?
image

I totally new to Go-lang

Me too. See the answer below as an answer from Noob:

You need golang > v1.5 as the repository requires the vendor-directory.
Then, golang requires a single workspace for all projects. Mine is

# in .bashrc
GOROOT=/home/patwie/libs/go # contains "api, bin, doc, lib, misc, pkg, robots.txt, ..."
GOPATH=/home/patwie/gocode # see below

My structure looks like:

/home
  /patwie
    /gocode
      /src
        /github.com
          /patwie
            /cluster-smi
        /golang.org  # not needed
        /gopkg.in    # not needed

Did you git clone the entire repository by

mkdir -p ${GOPATH}/src/github.com/patwie
cd ${GOPATH}/src/github.com/patwie
git clone https://github.com/PatWie/cluster-smi.git
cd cluster-smi
cp cluster-smi.example.env cluster-smi.env
make

You either need to clone this repo exactly into

/usr/local/go/src/github.com/patwie/cluster-smi
# or
/root/work/src/github.com/patwie/cluster-smi

Thank you for your quick response. The problem has gone. I've just leave my lab so I just tested on another computer without Nvidia GPU. Now it have problem with nvml.h header, and i think it will be resolved in the computer with Nvidia driver.

I tested on the GPU server, but i got the same error (libzmq and nvml.h) :(
image

See my other project:
https://github.com/PatWie/tf_zmq

which basically says:

# compile ZMQ library for c++
cd /path/to/your_lib_folder
git clone https://github.com/zeromq/libzmq
cd libzmq
./autogen.sh
./configure
./configure --prefix=/path/to/your_lib_folder/libzmq/dist
make
make install

and add to your bashrc

export PKG_CONFIG_PATH=/path/to/your_lib_folder/libzmq/dist/lib/pkgconfig/:$PKG_CONFIG_PATH

Please let me know if this helps, so this can be documented somewhere in the readme here.

edit cluster-smi-node only works on GPU machines, so you need the cuda-toolkit. You might need the CUDA_INSTALL_PATH env-variable as well.

Thank you for your reply. I got one step further. i follow your instruction and the libzmq problem has gone. But the remaining problem is nvml.h

image

i had cuda-toolkit install in /usr/local/cuda. And i see it is different with your nvml.o

image

How should i do?

A little bit further :). I modify the link in your nvml.go and it compiles successfully. But when i run smi-server it has conflict in ZeroMQ lib.

image

See the readme in https://github.com/PatWie/cluster-smi/tree/master/vendor/github.com/pebbe/zmq4

go get github.com/pebbe/zmq4
If you need support for ZeroMQ 4.2 DRAFT, checkout the branch draft4.2.

While not downloading the tar with the correct version from
http://download.zeromq.org/

Unfortunately, ZMQ can only be dynamically linked to the go-app.

So you mean i need another version of GO?

I also have questions.

  1. Do both server and nodes need cuda-toolkit
  2. Can i execute cluster-smi on the nodes, or just on the server.

Thank you!

You need another version of ZMQ.

  1. Only the cluster-smi-node should need the cuda toolkit. The other should run and be compileable on machines without cuda.

  2. Is the image in the readme so bad? You can place all these apps on completely different machines as long as they can communicate. You should be able to call cluster-smi even from a different network if the firewall allows it. But you should compile all apps ones on a machine supporting cuda as this is easier.

Here the setup is: On dump machine having cluster-smi-server running with the port's open. Several different machines with GPUs running cluster-smi-node. And cluster-smi can run everywhere.

Feel free to update the readme if this is confusing presented there.

Aha. Your figure is pretty cool, but the server looks like a modem, which should be changed. In my opinion, when the binaries are run on multiple machines. It is important to clarify where to build code, where to run the binaries. Previously, i thought that i have to build the code in every nodes, and run client and server respectively.

For the zmq lib, i found on the github that the latest version is 4.2.3, but in the error, it says installed version is 4.2.4. I also have a question: Is the zmq4.go in /vendor directory will automatically use the latest version of zmq.

Could you please send me your binaries through thangvubk@gmail.com. I think i can use your binary in case of i cannot build the code :((

Try godep update <name> to update the package in the vendor directory. I cannot provide a pre-compile binary, as they include specific settings:

# ip of cluster-smi-server
cluster_smi_server_ip="127.0.0.1"
# port of cluster-smi-server, which nodes send to
cluster_smi_server_port_gather="9080"
# port of cluster-smi-server, where clients subscribe to
cluster_smi_server_port_distribute="9081"
# tick for receiving data in milliseconds
cluster_smi_tick_ms="1000"

Further, this would not help in your case, as ZMQ is dynamically linked which would not solve version miss-matches. I will consider trying to update libzmq here.

Yes. I think it should be nice if you check the compilation in your code. :) thank you

Finally, it works beautifully. I downloaded version 4.1.4 of zeromq and compiled. I really appreciate your support. Thank you very much and have a nice day :D

You are welcome!

Since you are able to compile this project, cluster-top might be interesting for you as well.

Wow. This is great. Thank you very much :)

It would be perfect if you can add the processes using GPU in cluster-smi (as in nvidia-smi)
image

I cannot imagine how to display these information. For multiple machines this list would be long. Any suggestions #5?

Hi @PatWie ,

It seems i successfully compiled cluster-smi, however when i want to launch cluster-smi-node or cluster-smi-router i get the following error:

./cluster-smi-node: error while loading shared libraries: libzmq.so.4: cannot open shared object file: No such file or director

I followed all steps in the README, including the compilation of zmq (from http://files.patwie.com/mirror/zeromq-4.1.0-rc1.tar.gz)

Just for you to know: I have no experience at all with Go, so i might have missed something that is totally obvious to you.

Thanks for your help,
-Ivan

Seems like the only missing step is adding the path to the libzmq.so directory to the LD_LIBRARY_PATH environment variable.

Seems like the only missing step is adding the path to libzmq.so to the LD_LIBRARY_PATH environment variable.

That was exactly the missing step. Thank you:)
I will add this step to the README.md and create a new PR