yz93 / LAVT-RIS

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CUDA memory

Huntersxsx opened this issue · comments

Hello, thanks for your great work!
As you said 'The released lavt_one weights were trained using 8 x 32G V100 cards (max mem on each card was about 13G)', while I only have 8 11G 2080Ti GPUs. Therefore, I try to use swin-tiny instead of swin-base. Unfortunately, 'CUDA out of memory. Tried to allocate 170.00 MiB (GPU 5; 10.76 GiB total capacity; 9.31 GiB already allocated; 47.12 MiB free; 9.68 GiB reserved in total by PyTorch)' still occurs.
I use the command
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 train.py \
--model lavt_one --dataset refcoco --model_id refcoco \
--batch-size 8 --lr 0.00005 --wd 1e-2 \
--swin_type tiny --pretrained_swin_weights ./pretrained_weights/swin_tiny_patch4_window7_224.pth \
--epochs 40 --img_size 480 2>&1 | tee ./models/refcoco/output
I want to know if my command is incorrect or 8 2080Ti cannot even support swin-tiny.
Looking forward to your reply, Thank you~

Hi, no problem.

11 GB should be enough for a swin-tiny.

Change --batch-size to 4, instead of using 8. This argument is the number of samples per GPU card.

For instance, if you use 8 cards and --batch-size 8, then the total batch size would be 64. Changing it to 4 would give you a total batch size of 4x8=32.

That should solve the problem.

Hi, no problem.

11 GB should be enough for a swin-tiny.

Change --batch-size to 4, instead of using 8. This argument is the number of samples per GPU card.

For instance, if you use 8 cards and --batch-size 8, then the total batch size would be 64. Changing it to 4 would give you a total batch size of 4x8=32.

That should solve the problem.

Thank you, it works!
I have another concern about the test process.
As your concurrent work, CVPR2022-CRIS, they use different checkpoints for different validation sets, as stated in issue.
In other words, they use 3 different checkpoints to test on RefCOCO val、testA, and testB. I wonder if you choose the same checkpoint for 3 different validation sets of the same dataset, or you train 3 different checkpoint as CRIS did?

Hi,

We use one checkpoint for evaluation on all subsets of a dataset.

It is incorrect to use separate checkpoints for those subsets. We validated via the 'val' set (i.e., deciding on the weights); then we used the validated weights for evaluation on the test sets. The concepts of validating and testing really go back to the very basics of machine learning.

If a dataset simply doesn't have a test set, or the annotations for the test data are not available, then reporting on the validation set alone is what it is. But if there are test sets, it would be very wrong to fit the model to the test sets. One can only fit the model to the validation set, and evaluate it on the test sets.

Having said those, as the name 'val' suggests, the 'val' subset is the validation set. And as the names 'testA' and 'testB' suggest, those subsets are test sets.

It is very wrong to refer to all those subsets as "validation" sets. They are not. Their respective names already defined what they are. Very wrong. All I could recommend is to never do that. Doing that intentionally, to get good scores, borders on unethical practices.