adaptivecomputing / torque

Torque Repository

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

No GPU activity when the job is running

patrick-douglas opened this issue · comments

Firstly I'm not using cgroups(This is required?)
My issue is:
I'm using the Torque version 5.1.3 (because version greater them failed to install in Linux Mint)
Configured with the following commands
#me@root:

cd torque-5.1.3-1462984387_205d70d
./configure --with-debug --enable-nvidia-gpus --with-sendmail
make
make install

cp contrib/init.d/debian.pbs_server /etc/init.d/pbs_server
cp contrib/init.d/debian.pbs_sched /etc/init.d/pbs_sched
cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd

sysv-rc-conf pbs_server on
sysv-rc-conf trqauthd on
sysv-rc-conf pbs_sched on

echo '/usr/local/lib'>/etc/ld.so.conf.d/torque.conf
ldconfig

service trqauthd restart
echo '/usr/local/lib'>/etc/ld.so.conf.d/torque.conf
echo "master.lbn.com">/var/spool/torque/server_name
./torque.setup root
echo "node01.lbn.com np=12 gpus=1" > /var/spool/torque/server_priv/nodes

service trqauthd restart
service pbs_server restart
service pbs_sched start

qmgr -c 'set server auto_node_np = True'

make packages
#Then I do ssh node01.lbn.com and run the following:
#root@node02
apt-get update
apt-get install g++ libssl-dev libxml2-dev sysv-rc-conf libboost-all-dev -y
cd torque-5.1.3-1462984387_205d70d
./configure --with-debug --enable-nvidia-gpus
make -j 2
make install -j 2

./torque-package-clients-linux-x86_64.sh --install
./torque-package-devel-linux-x86_64.sh --install
./torque-package-doc-linux-x86_64.sh --install
./torque-package-mom-linux-x86_64.sh --install
./torque-package-server-linux-x86_64.sh --install

echo '/usr/local/lib'>/etc/ld.so.conf.d/torque.conf
ldconfig

cp contrib/init.d/debian.pbs_mom /etc/init.d/pbs_mom
cp contrib/init.d/debian.trqauthd /etc/init.d/trqauthd

sysv-rc-conf trqauthd on
sysv-rc-conf pbs_mom on

service trqauthd restart

echo '$pbsserver master'>/var/spool/torque/mom_priv/config
echo '$logevent 225'>>/var/spool/torque/mom_priv/config
echo '$usercp *:/home /home'>>/var/spool/torque/mom_priv/config

service pbs_mom start

After run this I run "pbsnodes" command and node01 is ok, I can see all GPU info, however when a submit a job the nvidia-smi change the status of GPU to Exclusive-process and the GPU activity stay 0%, but the task still running (when I run "top")
My GPU is Nvidia-Tesla k40c but I already tested with GeforceGT 430 and no success
NOTE: I'm runing CUDA 8.0 and the latest NVIDIA Drivers
Please help-me!
Thank you in advance