line351 /stitching_resnet_swim/train.py “--local_rank" argument error

Question

line351 /stitching_resnet_swim/train.py “--local_rank" argument error

Xinrt opened this issue a year ago · comments

env:
pytorch 2.0.0
pytorch-cuda 11.7
python 3.10.10

as title, should this argument parsing change from “--local_rank" to “--local-rank"?
when using “--local_rank",
error prompt: train.py: error: unrecognized arguments: --local-rank=5 (see the attachment for complete error output)
local_rank_error.txt

Zizheng Pan · Answer 1 · Tue Mar 28 2023 07:51:48 GMT+0800 (China Standard Time)

Hi @Xinrt, can you please let me know your running commands? I can't see any errors on my local machine with the instructions on the readme.

Xinran Tang · Answer 2 · Tue Mar 28 2023 07:59:20 GMT+0800 (China Standard Time)

Hello,
I used:
To stitch a ResNet-18 with ResNet-50 with 8 GPUs on ImageNet. with commands
./distributed_train.sh 8 \
[path/to/imagenet] \
-b 128 \
--stitch_config configs/resnet18_resnet50.json \
--sched cosine \
--epochs 30 \
--lr 0.05 \
--amp --remode pixel \
--reprob 0.6 \
--aa rand-m9-mstd0.5-inc1 \
--resplit --split-bn -j 10 --dist-bn reduce
and replace the [path/to/imagenet] with my own path

Zizheng Pan · Answer 3 · Tue Mar 28 2023 08:14:15 GMT+0800 (China Standard Time)

Hi @Xinrt, I found the issue. It seems you are using the latest PyTorch 2.0. However, the old API in previous version seems to be deprecated:

...conda_envs/torch121/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.la
unch is deprecated                                                                                                                                 
and will be removed in future. Use torchrun.                                                                                                       
Note that --use_env is set by default in torchrun.                                                                                                 
If your script expects `--local_rank` argument to be set, please                                                                                   
change it to read from `os.environ['LOCAL_RANK']` instead. See                                                                                     
https://pytorch.org/docs/stable/distributed.html#launch-utility for                                                                                
further instructions

At this moment, you can fix this issue by downgrading your PyTorch version. For example, PyTorch 1.12 + CUDA 11.3, which should work for all the code available in this repo. I will try to find some time to make this repo compatible with the latest version of PyTorch.

Xinran Tang · Answer 4 · Tue Mar 28 2023 08:17:44 GMT+0800 (China Standard Time)

Got it, thank you so much! I will try to use PyTorch 1.12

xiangtianheng · Answer 5 · Tue Mar 28 2023 09:36:32 GMT+0800 (China Standard Time)

Hi @HubHop , I am working with @Xinrt
Do you guys use conda for env setup? Do you mind sharing the conda env file for this project?
I read the requirements.txt and the Requirements chapter in the README in /stitching_resnet_swim but it seems that there are still several dependencies missing when we were running your code.

Zizheng Pan · Answer 6 · Tue Mar 28 2023 10:08:55 GMT+0800 (China Standard Time)

Hi @xiangtianheng, to prepare your python env is pretty much easy for this project. I just updated the readme, where you can find how to to create a conda env for your experiments. I have tested this env and it can run all the code in this repo.

xiangtianheng · Answer 7 · Tue Mar 28 2023 13:04:16 GMT+0800 (China Standard Time)

Thanks for the info! I am able to run the code now with the dependency provided! Feel free to close this issue.