tamerthamoqa / facenet-pytorch-glint360k

A PyTorch implementation of the 'FaceNet' paper for training a facial recognition model with Triplet Loss using the glint360k dataset. A pre-trained model using Triplet Loss is available for download.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

triplet_loss_dataloader.py

YoonSeongGyeol opened this issue · comments

Hello, I'm daniel,
While running your project, one question arose.

In dataloader/triplet_loss_dataloader,
It is a system that generates (pos, neg) class randomly as the number of triplets allocated for each processor, and randomly selects images,
but, When using the function of np.random.choice, I confirmed that the same random value is outputted for each processor.
So I used np.random.RandomState(), and I was able to use a different random value for each processor.

Please let me know if I understand this processor well or not.

Thank you.
Daniel

Hi Daniel,

Thank you very much for catching this one. The intention was only to speed up the triplet generation process and not to re-replicate the generated triplets across the spawned processes, hehe. I have edited the dataloader as you described and the RandomState() object would be initialized with seed=None so every time the seed would be a random number and would then randomly choose the required elements for triplet creation.

To be clear, the current pre-trained model was trained on 10 million generated triplets that were not generated with the multi-processing method.

The reason why I am using the "triplet generation" method is to have some kind of naive reproducibility when changing some training parameters, the intention is to conduct future experiments with a set number of human identities per triplet batch whereby the dataloader would generate and yield a set number of triplets per training iteration instead of a pre-generated list of triplets like with the current version.

However, there are two current issues I am dealing with that you should be aware of before using this project:

1- After some training "epochs", the BatchNorm2D operation would require more VRAM allocation and would cause a CudaOutofMemory Exception. I was training one epoch per day since one epoch was taking around 11 hours on my PC and I would turn off the process when it is done so I would use my PC for other things, so I managed to somehow get the 256 batch size training to work but would cause an OOM if left for several epochs. So I would recommend you use a lower batch size value that would initially allocate around 40-60% of your GPU VRAM.

2- I tried switching to CPU for the iterations that caused the OOM in order to continue training. Unfortunately, switching to CPU has a negative impact on model performance metrics, I still don't know why that is the case so far.

Again, thank you very much for catching the issue.

Hello.

Thank you for answering my question.
In my PC gpu, had TITAN 4ea (12GB), so I used multi-gpu (data-parallel), In fact, a network has 256/4=64 batches.
currently, I finished 1-epoch (10,000,000 triplet num data) approximately 3-hours.

There is no problem at present, and the slightly different point is that the performance is low, but most of them use torch.cuda.empty_cache () to avoid OOM.
Now, Training without any problems.

We may work on this as well. I noticed that the triplet generation is not a very fast process. Probably data-frames are not that fast for this kind of usage.