NERSC / slurm-ray-cluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Not Getting Past ray.init

cupdike opened this issue · comments

Hi, I modified submit-ray-cluster.sbatch to run on a single node using a container. When it runs, everything seems okay but then everything stops with the the head, worker and python mnist code all being cancelled.

I can tell from some logging statements I added that the mnist code never gets past ray.init. But I can't find any way to understand more about what going on.

Any suggestions on what to try next?

$ sacct -j 1023
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1023         submit-ra+      batch                    24     FAILED      6:0 
1023.batch        batch                               24     FAILED      6:0 
1023.extern      extern                               24  COMPLETED      0:0 
1023.0         hostname                                8  COMPLETED      0:0 
1023.1          RayHead                                8  CANCELLED     0:15 
1023.2       RayWorker1                                8  CANCELLED     0:15 
1023.3       AiPythonS+                                8  CANCELLED      0:6 
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_vx=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for this output (/usr/share/lmod/lmod/init/bash)
Shell debugging restarted
+ unset __lmod_vx
+ set -o pipefail
+ export NCCL_BLOCKING_WAIT=1
+ NCCL_BLOCKING_WAIT=1
++ uuidgen
+ redis_password=adc28f91-2b23-4eb8-851e-33d41694d40c
+ export redis_password
++ scontrol show hostnames compute1
+ nodes=compute1
+ nodes_array=(${nodes[0]} ${nodes[0]})
+ echo NODES ARRAY: compute1 compute1
NODES ARRAY: compute1 compute1
+ node_1=compute1
++ srun --gres=gpu:1 --nodes=1 --ntasks=1 -w compute1 hostname --ip-address
+ ip=10.111.245.102
+ port=6379
+ ip_head=10.111.245.102:6379
+ export ip_head
+ echo 'IP Head: 10.111.245.102:6379'
IP Head: 10.111.245.102:6379
+ echo 'STARTING HEAD at compute1'
STARTING HEAD at compute1
+ sleep 10
+ srun -u -l --job-name=RayHead --gres=gpu:1 --nodes=1 --ntasks=1 -w compute1 --container-mounts=/home/updikca1/slurm/rayDocker/slurm-ray-cluster:/code --container-name=ray-torch /code/start-head.sh 10.111.245.102 adc28f91-2b23-4eb8-851e-33d41694d40c
0: pyxis: reusing existing container filesystem
0: pyxis: starting container ...
0: starting ray head node
0: 2021-07-22 18:41:20,012	INFO scripts.py:560 -- Local node IP: 10.111.245.102
0: 2021-07-22 18:41:20,041	WARNING utils.py:510 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
0: 2021-07-22 18:41:21,712	INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
0: 2021-07-22 18:41:22,730	SUCC scripts.py:592 -- --------------------
0: 2021-07-22 18:41:22,730	SUCC scripts.py:593 -- Ray runtime started.
0: 2021-07-22 18:41:22,730	SUCC scripts.py:594 -- --------------------
0: 2021-07-22 18:41:22,730	INFO scripts.py:596 -- Next steps
0: 2021-07-22 18:41:22,730	INFO scripts.py:597 -- To connect to this Ray runtime from another node, run
0: 2021-07-22 18:41:22,730	INFO scripts.py:601 --   ray start --address='10.111.245.102:6379' --redis-password='adc28f91-2b23-4eb8-851e-33d41694d40c'
0: 2021-07-22 18:41:22,730	INFO scripts.py:606 -- Alternatively, use the following Python code:
0: 2021-07-22 18:41:22,731	INFO scripts.py:609 -- import ray
0: 2021-07-22 18:41:22,731	INFO scripts.py:610 -- ray.init(address='auto', _redis_password='adc28f91-2b23-4eb8-851e-33d41694d40c')
0: 2021-07-22 18:41:22,731	INFO scripts.py:618 -- If connection fails, check your firewall sett
0: ings and network configuration.
0: 2021-07-22 18:41:22,731	INFO scripts.py:623 -- To terminate the Ray runtime, run
0: 2021-07-22 18:41:22,731	INFO scripts.py:624 --   ray stop
0: 
+ worker_num=1
+ (( i=1 ))
+ (( i<=1 ))
+ node_i=compute1
+ echo 'STARTING WORKER 1 at compute1'
STARTING WORKER 1 at compute1
+ sleep 5
+ srun -u -l --job-name=RayWorker1 --gres=gpu:1 --nodes=1 --ntasks=1 -w compute1 --container-mounts=/home/updikca1/slurm/rayDocker/slurm-ray-cluster:/code --container-name=ray-torch /code/start-worker.sh 10.111.245.102:6379 adc28f91-2b23-4eb8-851e-33d41694d40c
0: pyxis: reusing existing container filesystem
0: pyxis: starting container ...
0: starting ray worker node
0: 2021-07-22 18:41:29,106	INFO scripts.py:670 -- Local node IP: 10.111.245.102
0: 2021-07-22 18:41:29,109	WARNING utils.py:510 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
0: 2021-07-22 18:41:29,121	SUCC scripts.py:683 -- --------------------
0: 2021-07-22 18:41:29,121	SUCC scripts.py:684 -- Ray runtime started.
0: 2021-07-22 18:41:29,122	SUCC scripts.py:685 -- --------------------
0: 2021-07-22 18:41:29,122	INFO scripts.py:687 -- To terminate the Ray runtime, run
0: 2021-07-22 18:41:29,122	INFO scripts.py:688 --   ray stop
0: 
+ (( i++  ))
+ (( i<=1 ))
+ sleep 20
+ HOROVOD_LOG_LEVEL=debug
+ srun -u -l --job-name=AiPythonScript -u --gpus-per-task=0 --nodes=1 --ntasks=1 --container-mounts=/home/updikca1/slurm/rayDocker/slurm-ray-cluster:/code --container-name=ray-torch python -u /code/examples/mnist_pytorch_trainable.py --ray-address 10.111.245.102:6379
0: pyxis: reusing existing container filesystem
0: pyxis: starting container ...
0: inside mnist_pytorch_trainable.py
0: inside __main__
0: 10.111.245.102:6379 adc28f91-2b23-4eb8-851e-33d41694d40c
0: 2021-07-22 18:41:54,938	INFO worker.py:735 -- Connecting to existing Ray cluster at address: 10.111.245.102:6379
srun: error: compute1: task 0: Aborted