triplet_loss_dataloader.py

Question

triplet_loss_dataloader.py

YoonSeongGyeol opened this issue 4 years ago · comments

Hello, I'm daniel,
While running your project, one question arose.

In dataloader/triplet_loss_dataloader,
It is a system that generates (pos, neg) class randomly as the number of triplets allocated for each processor, and randomly selects images,
but, When using the function of np.random.choice, I confirmed that the same random value is outputted for each processor.
So I used np.random.RandomState(), and I was able to use a different random value for each processor.

Please let me know if I understand this processor well or not.

Thank you.
Daniel

Tamer Tahamoqa · Answer 1 · Tue Oct 27 2020 13:53:54 GMT+0800 (China Standard Time)

Hi Daniel,

Thank you very much for catching this one. The intention was only to speed up the triplet generation process and not to re-replicate the generated triplets across the spawned processes, hehe. I have edited the dataloader as you described and the RandomState() object would be initialized with seed=None so every time the seed would be a random number and would then randomly choose the required elements for triplet creation.

To be clear, the current pre-trained model was trained on 10 million generated triplets that were not generated with the multi-processing method.

The reason why I am using the "triplet generation" method is to have some kind of naive reproducibility when changing some training parameters, the intention is to conduct future experiments with a set number of human identities per triplet batch whereby the dataloader would generate and yield a set number of triplets per training iteration instead of a pre-generated list of triplets like with the current version.

However, there are two current issues I am dealing with that you should be aware of before using this project:

1- After some training "epochs", the BatchNorm2D operation would require more VRAM allocation and would cause a CudaOutofMemory Exception. I was training one epoch per day since one epoch was taking around 11 hours on my PC and I would turn off the process when it is done so I would use my PC for other things, so I managed to somehow get the 256 batch size training to work but would cause an OOM if left for several epochs. So I would recommend you use a lower batch size value that would initially allocate around 40-60% of your GPU VRAM.

2- I tried switching to CPU for the iterations that caused the OOM in order to continue training. Unfortunately, switching to CPU has a negative impact on model performance metrics, I still don't know why that is the case so far.

Again, thank you very much for catching the issue.

Seong Gyeol Yoon · Answer 2 · Tue Oct 27 2020 17:34:26 GMT+0800 (China Standard Time)

Hello.

Thank you for answering my question.
In my PC gpu, had TITAN 4ea (12GB), so I used multi-gpu (data-parallel), In fact, a network has 256/4=64 batches.
currently, I finished 1-epoch (10,000,000 triplet num data) approximately 3-hours.

There is no problem at present, and the slightly different point is that the performance is low, but most of them use torch.cuda.empty_cache () to avoid OOM.
Now, Training without any problems.

Angel G. · Answer 3 · Sun Jan 24 2021 04:44:58 GMT+0800 (China Standard Time)

We may work on this as well. I noticed that the triplet generation is not a very fast process. Probably data-frames are not that fast for this kind of usage.