bytedance / byteps

A high performance and generic framework for distributed DNN training

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

unsupported van type: 1 Error when launch RDMA

Ruinhuang opened this issue · comments

i tried to lunch RDMA with pytorch.
This is my command:
`export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1

export DMLC_INTERFACE=ib0

export DMLC_PS_ROOT_URI=10.0.0.100
export DMLC_PS_ROOT_PORT=9000
bpslaunch`

This is error info:
`BytePS launching scheduler
Command: python3 -c 'import byteps.server'

[08:35:53] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[08:35:53] src/postoffice.cc:25: Creating Van: 1
[08:35:53] 3rdparty/ps-lite/include/dmlc/logging.h:276: [08:35:53] src/van.cc:97: unsupported van type: 1

Stack trace returned 10 entries:
[bt] (0) /opt/conda/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x28a2b) [0x7f1e9798aa2b]
[bt] (1) /opt/conda/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x28d31) [0x7f1e9798ad31]
[bt] (2) /opt/conda/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x534d8) [0x7f1e979b54d8]
[bt] (3) /opt/conda/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x49c23) [0x7f1e979abc23]
[bt] (4) /opt/conda/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x4e584) [0x7f1e979b0584]
[bt] (5) /opt/conda/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps_server+0xdaa) [0x7f1e979872ba]
[bt] (6) /opt/conda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f1e97ab59dd]
[bt] (7) /opt/conda/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f1e97ab5067]
[bt] (8) /opt/conda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x10da8) [0x7f1e97acbda8]
[bt] (9) /opt/conda/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x1108c) [0x7f1e97acc08c] `

This is the ibv_devinfo:
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.27.2008
node_guid: b859:9f03:001b:a952
sys_image_guid: b859:9f03:001b:a952
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000010
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 18
port_lid: 15
port_lmc: 0x00
link_layer: InfiniBand

Did you checkout ps-lite to the latest version? This seems to be a bug incurred by https://github.com/bytedance/ps-lite/blob/6ecbd23c67e2c6a401df4de7c11a72572f3e8a3a/src/postoffice.cc#L19.

I think the fastest way to solve your problem should be export DMLC_ENABLE_RDMA=rdma.

i build byteps by source code and the version is 0.2.5 and the pslite version is 28330e
i set export DMLC_ENABLE_RDMA=rdma.
but it still shows the error
src/van.cc:97: unsupported van type: rdma

Can you make sure that RDMA-related libs are installed properly? A fast way to verify is cd byteps/3rdparty/ps-lite; make -j USE_RDMA=1.

this issue is caused by value of export DMLC_PS_ROOT_URI, this is the ib0 ip, not node ip