leeesangwon / PyTorch-Image-Retrieval

A PyTorch framework for an image retrieval task including implementation of N-pair Loss (NIPS 2016) and Angular Loss (ICCV 2017).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

label:question Hard negative mining?

batrlatom opened this issue · comments

Did you try hard negative mining or some other technique to improve recall and accuracy? For example I have the images similar to online products or clothing dataset with about 10 000 identities, divided into about 30 supercategories. Like for example jeans ( 300 identities ), shoes ( 400 identities ), etc. When I use your code, it is able to distinguish between jeans and shoes, but properly distinct one shoe from another is problematic. Do you have some tips how to be able to correctly identify among such big number of identities which are also clustered into number of supercategories?

I'm afraid we didn't try any hard negative mining methods because we thought N-pair loss and Angular loss could overcome the outlier problem by utilizing more sample pairs than Triplet loss.
But referring to this hard negative mining implementations for the triplet loss might be helpful to you.

Thanks.

Yes, at the end, I am able to use your code with autoaugment and googlenet to achieve 72-pairs and the problem is gone for now. So for me is better to use smaller network like googlenet with 72 pairs than for example densenet161 with 15-pairs.

Few more questions, if I can ...

I tried also to use smaller networks like mnasnet or mobilenet ( code from torchivison models ) , but was unable to achieve any good results. In fact, recall and precision decreased every epoch. Do you see any specific reason why is problematic to train those particular networks, could it be about hyperparameters like learning rate, etc? Any other network from torchvision.models works like a charm. I cannot get it why googlenet gives me better results than those two when they are superior in any way.

Do you have specific settings ( model with respect to GPU ram, learning rate, number of classes, input image size, image cropping with ssd, etc. ) which worked best for you?

Does self attention works the same way like weak object localisation? Alibaba use weak object localisation to mask image in the way that the get rid of noisy background. I used https://github.com/junsukchoe/ADL/tree/master/Pytorch to learn bounding boxes from weak labels and crop images which later went into the trainer. Does self attention do the same (similar) end-to-end?

When I test the code with for example 100 reference classes, I get recall@1 about 50%, with 1000 classes I get 30%. The more classes I add, the lower the recall. Is this normal? I though I will be able to achieve 80%+ recall@1 as stated in n-pair papers ( for online products dataset for example - 22k classes). Were you able to achieve recall over 75% in real world settings? ( 1000+ different classes in references )

  1. We also tried to use PNASNet which has a better classification score than Densenet by using AutoML.
    However, we observed that Densenet shows better performance for image retrieval task.
    We could not find any clear reason for this but one of our assumptions is there is a possibility that PNASnet is too optimized for the classification task. Only thing AutoML concerns while optimizing a neural network is increasing classification performance, therefore it could trades off between 0.1% improvement of classification and 10% degrading of retrieval performance.

  2. We used 2 K80 GPUs for the training with an input size of 288, num-classes of 42 and densenet161.
    And we only get the data augmentation method of SSD, we did not crop the image by using an object detection algorithm. For the LR policy, I don't remember the exact setup but I probably used default setup.

  3. I have no experience of using weak object localization so I could not confirm that self-attention and weak object localization perform a similar role.

  4. Because our experiment was conducted on the classified dataset, I'm sorry I could not give you a detailed answer about this. The dataset we used have very large real-world data and we could get satisfactory results.

Thanks.