TRI-ML / dd3d

Official PyTorch implementation of DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection? (ICCV 2021), Dennis Park*, Rares Ambrus*, Vitor Guizilini, Jie Li, and Adrien Gaidon.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multi-node training

EphChem opened this issue · comments

Hi there,
Thank you so much for this release!
When trying to run multi-node training, I can see that this repo is equipped to do this, when I see the following lines:

# Multi-node training often fails with "received 0 items of ancdata" error.

dd3d/Makefile

Line 42 in da25b61

-H ${MPI_HOSTS} \

Have you trained using multiple nodes (not just multiple GPUs) where you have to provide 2 different ip addresses from within the docker containers you provided in this repo? And has this worked for you? When I execute training on two different machines, the code hangs and I dont see any terminal printouts...

Thank you in advance!