TengdaHan / DPC

Hi,

I noticed in the implementation, if you specify batch_size as 256 and use 4 GPUs for training in parallel, effectively, the loss is the sum of 4 minibatches of size 64 (also mentioned in the paper). Could you please confirm this is correct?

If so, is there a specific reason to do this? Intuitively, we could re-collect all predicted features from all 256 samples, obtaining an N x N similarity matrix as opposed to N x (N/4) right (where N = B * pred_step * spatial_size**2)?

Thanks!

You're right, you can collect features from all GPUs first, then compute the similarity matrix to get a big one. I think it (more negative samples for contrastive loss) should give a better performance.

When I did the experiment, I was worried about the speed and memory issue, thus I computed loss inside each GPU. Maybe now PyTorch gets faster and you can have a try.

I see, so memory and speed is potentially a bottleneck here. In this case, I would suggest changing the name of the batch_size argument or add a line of comment there to explain that a batch of N will be split to micro-batches across multiple GPUs. I found it a bit misleading and only realized that after printing the shape of the similarity matrix.

But anyways thanks for confirming!

Thank you for the suggestion, I added a comment line here:

DPC/dpc/main.py

Line 212 in 38d8fc8

# similarity matrix is computed inside each gpu, thus here B == num_gpu * B2

Batch size per GPU