pytorch code is not clear

Question

pytorch code is not clear

GuoleiSun opened this issue 5 years ago · comments

Hi,

Great paper. Thanks for providing pytorch code. I tried the code, but it is for distributed system. Could you clear it a little bit so that we can easily run it in any system with some GPUs? Also, there are some errors in the code, could you try to correct it?

Thanks

Junsuk Choe · Answer 1 · Thu Jul 25 2019 11:26:10 GMT+0800 (China Standard Time)

Hi Guolei,

We are sorry for your inconvenience. There is a mistake when I cleaned the codes. We revise the codes to correct errors. If there are still any errors, please let us know.

Although our code uses the modules for distributed system, but you can also run it on single system. We also usually run our code on single PC with 1 or 2 GPU(s).

Thanks

Guolei Sun · Answer 2 · Thu Jul 25 2019 13:38:13 GMT+0800 (China Standard Time)

Hi,
I tried to use one gpu, and set args.multiprocessing_distributed as False. But I got the following error.
I use python 3.7 (>3.3) and my pytorch version is :
--> python -c "import torch; print(torch.version)"
--> 1.1.0

File "train.py", line 354, in train
for name, module in model.module.named_modules():
File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 539, in getattr
type(self).name, name))
AttributeError: 'VGG' object has no attribute 'module'

Guolei Sun · Answer 3 · Thu Jul 25 2019 13:49:27 GMT+0800 (China Standard Time)

Hi,

I solve the problem by removing module in model.module.named_modules().

But how to use 2 or more gpus to train the model? How should I use arguments? Could you provide a script like run1.sh for one gpu and multiple gpus training?

Thanks a lot

Junsuk Choe · Answer 4 · Thu Jul 25 2019 14:02:46 GMT+0800 (China Standard Time)

Hi Guolei,

You do not need to set args.multiprocessing_distributed as False. The only thing you need to do is changing the gpu list.

Here are the examples:

for 1 GPU:

gpu=0
name1=vgg_ADL1
epoch=200
decay=60
model=vgg16_ADL
server=tcp://127.0.0.1:01
batch=32
wd=5e-4
lr=0.001
ADL_pos="3M 4M 53"


CUDA_VISIBLE_DEVICES=${gpu} python train.py -a ${model} --dist-url ${server} \
--multiprocessing-distributed --world-size 1 --pretrained \
--data ../CUB_200_2011/ --dataset CUB \
--train-list datalist/CUB/train.txt \
--test-list datalist/CUB/test.txt \
--data-list datalist/CUB/ \
--ADL-pos ${ADL_pos} --ADL-rate 0.75 --ADL-thr 0.8 \
--task wsol \
--batch-size ${batch} --epochs ${epoch} --LR-decay ${decay} --wd ${wd} --lr ${lr} --nest --name ${name1}

for 2 GPUs:

gpu=0,1
name1=vgg_ADL1
epoch=200
decay=60
model=vgg16_ADL
server=tcp://127.0.0.1:01
batch=32
wd=5e-4
lr=0.001
ADL_pos="3M 4M 53"


CUDA_VISIBLE_DEVICES=${gpu} python train.py -a ${model} --dist-url ${server} \
--multiprocessing-distributed --world-size 1 --pretrained \
--data ../CUB_200_2011/ --dataset CUB \
--train-list datalist/CUB/train.txt \
--test-list datalist/CUB/test.txt \
--data-list datalist/CUB/ \
--ADL-pos ${ADL_pos} --ADL-rate 0.75 --ADL-thr 0.8 \
--task wsol \
--batch-size ${batch} --epochs ${epoch} --LR-decay ${decay} --wd ${wd} --lr ${lr} --nest --name ${name1}

Guolei Sun · Answer 5 · Thu Jul 25 2019 14:06:59 GMT+0800 (China Standard Time)

OK
When I run "bash scripts/run1.sh", I got the following error:

Traceback (most recent call last):
File "train.py", line 387, in
main()
File "train.py", line 70, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
while not spawn_context.join():
File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 114, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/raid/guolei/wsol/ADL-master/Pytorch/train.py", line 102, in main_worker
world_size=args.world_size, rank=args.rank)
File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 406, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 95, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon)
RuntimeError: Permission denied

Guolei Sun · Answer 6 · Thu Jul 25 2019 14:07:41 GMT+0800 (China Standard Time)

I didn't change anything. In run1.sh, you use one gpu. Right?

Junsuk Choe · Answer 7 · Thu Jul 25 2019 14:10:55 GMT+0800 (China Standard Time)

Yes it's right. In our system, we can run our code without any error when we use only one gpu.

Do you change the GPU list? If your system has only one GPU, you may change the first line of run1.sh to gpu=0

Guolei Sun · Answer 8 · Thu Jul 25 2019 14:14:17 GMT+0800 (China Standard Time)

Actually, my system has 8 GPUs. I tried different gpu id. All of them give the same error as above. Could you check the reason?

Junsuk Choe · Answer 9 · Thu Jul 25 2019 14:20:47 GMT+0800 (China Standard Time)

Sure!

Could you change the port number? 01 may be a blocked port in your system. You can use other ports, e.g., 8889, 8890.

For example:

gpu=0
name1=vgg_ADL1
epoch=200
decay=60
model=vgg16_ADL
server=tcp://127.0.0.1:8889
batch=32
wd=5e-4
lr=0.001
ADL_pos="3M 4M 53"


CUDA_VISIBLE_DEVICES=${gpu} python train.py -a ${model} --dist-url ${server} \
--multiprocessing-distributed --world-size 1 --pretrained \
--data ../CUB_200_2011/ --dataset CUB \
--train-list datalist/CUB/train.txt \
--test-list datalist/CUB/test.txt \
--data-list datalist/CUB/ \
--ADL-pos ${ADL_pos} --ADL-rate 0.75 --ADL-thr 0.8 \
--task wsol \
--batch-size ${batch} --epochs ${epoch} --LR-decay ${decay} --wd ${wd} --lr ${lr} --nest --name ${name1}

Guolei Sun · Answer 10 · Thu Jul 25 2019 14:26:15 GMT+0800 (China Standard Time)

I changed port to 8888 and it works. Great! Thank a lot.
I will run it and let you know what results I got.

Junsuk Choe · Answer 11 · Thu Jul 25 2019 14:32:22 GMT+0800 (China Standard Time)

I'm glad to hear that!

Please note that results from this Pytorch implementation can be slightly different with the results in paper. We used Tensorflow implementation for all experiments, as we mentioned in the paper.

If you have any further questions, just let us know.

Guolei Sun · Answer 12 · Thu Jul 25 2019 14:34:43 GMT+0800 (China Standard Time)

I understand. Thanks

Ahmed Taha · Answer 13 · Mon Sep 02 2019 23:36:03 GMT+0800 (China Standard Time)

I am sorry to open this but I noticed something in your configuration parameters that is different from the one in the readme file & utils_args. I am using Tensorflow version

For VGG & CUB

Here it seems epoch=200, while according to the readme example + (default from utils_args), it is 105
Here it seems base-lr=0.001, while according to the readme example , it is 0.01

I am trying to replicate the VGG on CUB and I am getting different results

My classification acc (top 1) : 70-71%, while in the paper it is reported 65%
My best Localization acc (top 1) : 47%, while in the paper it is reported 52%

So, I am trying to figure out what I am doing wrong

Junsuk Choe · Answer 14 · Tue Sep 03 2019 08:07:42 GMT+0800 (China Standard Time)

@ahmdtaha

Thanks for your comment.

Recently I noticed that the tensorflow codes in this repository is slightly different from the submission version. This is because I cleaned up the codes for improving classification accuracy using ADL (this is one of our future plan as we mentioned in paper). In addition, we use different training setting between Pytorch and Tensorflow version (Pytorch implementation is not stable until now). I am sorry for your inconvenience. I revise the codes soon but unfortunately I do not have resources to test it right now. Probably I could test it after CVPR 2020 deadline.

And you can use this for reproducing our results:

python CAM-VGG.py --gpu 0 --data /notebooks/dataset/CUB200/ --cub --base-lr 0.01 --logdir VGGGAP_CUB --load VGG --batch 128 --attdrop 3 4 53 --threshold 0.80 --keep_prob 0.25

Ahmed Taha · Answer 15 · Tue Sep 03 2019 08:19:59 GMT+0800 (China Standard Time)

Thanks for your reply