junsukchoe / ADL

Attention-based Dropout Layer for Weakly Supervised Object Localization (CVPR 2019 Oral)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pytorch code is not clear

GuoleiSun opened this issue · comments

Hi,

Great paper. Thanks for providing pytorch code. I tried the code, but it is for distributed system. Could you clear it a little bit so that we can easily run it in any system with some GPUs? Also, there are some errors in the code, could you try to correct it?

Thanks

Hi Guolei,

We are sorry for your inconvenience. There is a mistake when I cleaned the codes. We revise the codes to correct errors. If there are still any errors, please let us know.

Although our code uses the modules for distributed system, but you can also run it on single system. We also usually run our code on single PC with 1 or 2 GPU(s).

Thanks

Hi,
I tried to use one gpu, and set args.multiprocessing_distributed as False. But I got the following error.
I use python 3.7 (>3.3) and my pytorch version is :
--> python -c "import torch; print(torch.version)"
--> 1.1.0

File "train.py", line 354, in train
for name, module in model.module.named_modules():
File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 539, in getattr
type(self).name, name))
AttributeError: 'VGG' object has no attribute 'module'

Hi,

I solve the problem by removing module in model.module.named_modules().

But how to use 2 or more gpus to train the model? How should I use arguments? Could you provide a script like run1.sh for one gpu and multiple gpus training?

Thanks a lot

Hi Guolei,

You do not need to set args.multiprocessing_distributed as False. The only thing you need to do is changing the gpu list.

Here are the examples:

for 1 GPU:

gpu=0
name1=vgg_ADL1
epoch=200
decay=60
model=vgg16_ADL
server=tcp://127.0.0.1:01
batch=32
wd=5e-4
lr=0.001
ADL_pos="3M 4M 53"


CUDA_VISIBLE_DEVICES=${gpu} python train.py -a ${model} --dist-url ${server} \
--multiprocessing-distributed --world-size 1 --pretrained \
--data ../CUB_200_2011/ --dataset CUB \
--train-list datalist/CUB/train.txt \
--test-list datalist/CUB/test.txt \
--data-list datalist/CUB/ \
--ADL-pos ${ADL_pos} --ADL-rate 0.75 --ADL-thr 0.8 \
--task wsol \
--batch-size ${batch} --epochs ${epoch} --LR-decay ${decay} --wd ${wd} --lr ${lr} --nest --name ${name1}

for 2 GPUs:

gpu=0,1
name1=vgg_ADL1
epoch=200
decay=60
model=vgg16_ADL
server=tcp://127.0.0.1:01
batch=32
wd=5e-4
lr=0.001
ADL_pos="3M 4M 53"


CUDA_VISIBLE_DEVICES=${gpu} python train.py -a ${model} --dist-url ${server} \
--multiprocessing-distributed --world-size 1 --pretrained \
--data ../CUB_200_2011/ --dataset CUB \
--train-list datalist/CUB/train.txt \
--test-list datalist/CUB/test.txt \
--data-list datalist/CUB/ \
--ADL-pos ${ADL_pos} --ADL-rate 0.75 --ADL-thr 0.8 \
--task wsol \
--batch-size ${batch} --epochs ${epoch} --LR-decay ${decay} --wd ${wd} --lr ${lr} --nest --name ${name1}

OK
When I run "bash scripts/run1.sh", I got the following error:

Traceback (most recent call last):
File "train.py", line 387, in
main()
File "train.py", line 70, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
while not spawn_context.join():
File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 114, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/raid/guolei/wsol/ADL-master/Pytorch/train.py", line 102, in main_worker
world_size=args.world_size, rank=args.rank)
File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 406, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/home/ubuntu/anaconda2/envs/Senet/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 95, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon)
RuntimeError: Permission denied

I didn't change anything. In run1.sh, you use one gpu. Right?

Yes it's right. In our system, we can run our code without any error when we use only one gpu.

Do you change the GPU list? If your system has only one GPU, you may change the first line of run1.sh to gpu=0

Actually, my system has 8 GPUs. I tried different gpu id. All of them give the same error as above. Could you check the reason?

Sure!

Could you change the port number? 01 may be a blocked port in your system. You can use other ports, e.g., 8889, 8890.

For example:

gpu=0
name1=vgg_ADL1
epoch=200
decay=60
model=vgg16_ADL
server=tcp://127.0.0.1:8889
batch=32
wd=5e-4
lr=0.001
ADL_pos="3M 4M 53"


CUDA_VISIBLE_DEVICES=${gpu} python train.py -a ${model} --dist-url ${server} \
--multiprocessing-distributed --world-size 1 --pretrained \
--data ../CUB_200_2011/ --dataset CUB \
--train-list datalist/CUB/train.txt \
--test-list datalist/CUB/test.txt \
--data-list datalist/CUB/ \
--ADL-pos ${ADL_pos} --ADL-rate 0.75 --ADL-thr 0.8 \
--task wsol \
--batch-size ${batch} --epochs ${epoch} --LR-decay ${decay} --wd ${wd} --lr ${lr} --nest --name ${name1}

I changed port to 8888 and it works. Great! Thank a lot.
I will run it and let you know what results I got.

I'm glad to hear that!

Please note that results from this Pytorch implementation can be slightly different with the results in paper. We used Tensorflow implementation for all experiments, as we mentioned in the paper.

If you have any further questions, just let us know.

I understand. Thanks

I am sorry to open this but I noticed something in your configuration parameters that is different from the one in the readme file & utils_args. I am using Tensorflow version

For VGG & CUB

  • Here it seems epoch=200, while according to the readme example + (default from utils_args), it is 105
  • Here it seems base-lr=0.001, while according to the readme example , it is 0.01

I am trying to replicate the VGG on CUB and I am getting different results

  • My classification acc (top 1) : 70-71%, while in the paper it is reported 65%
  • My best Localization acc (top 1) : 47%, while in the paper it is reported 52%

So, I am trying to figure out what I am doing wrong

@ahmdtaha

Thanks for your comment.

Recently I noticed that the tensorflow codes in this repository is slightly different from the submission version. This is because I cleaned up the codes for improving classification accuracy using ADL (this is one of our future plan as we mentioned in paper). In addition, we use different training setting between Pytorch and Tensorflow version (Pytorch implementation is not stable until now). I am sorry for your inconvenience. I revise the codes soon but unfortunately I do not have resources to test it right now. Probably I could test it after CVPR 2020 deadline.

And you can use this for reproducing our results:

python CAM-VGG.py --gpu 0 --data /notebooks/dataset/CUB200/ --cub --base-lr 0.01 --logdir VGGGAP_CUB --load VGG --batch 128 --attdrop 3 4 53 --threshold 0.80 --keep_prob 0.25

Thanks for your reply