About the dataset

Question

About the dataset

slacklife opened this issue 4 years ago · comments

Hi @yl-1993, really an amazing work!
I have some questions about the dataset.
In https://github.com/yl-1993/learn-to-cluster/blob/master/DATASET.md

Is the feature of ms1m_part1.tar.gz
extracted with the resnet50_part0_train.pth.tar ?
What is the training data of the model resnet50_part0_train.pth.tar. Are they the Images of part0_train.bin?
If the pretrained model is trained with the images of part0_train.bin, and then train clustering on the part1_train.bin, What if I use a model to extrace feature which is trained by other images set and then use that feature to train clustering?

Thanks!

Lei Yang · Answer 1 · Sat Aug 08 2020 01:45:43 GMT+0800 (China Standard Time)

Hi @slacklife , thanks for checking out our work.

Yes, all the features, from part0 to part9, are extracted with the same model trained on part0.
The training data is part0. We first train on part0 to get a model and then use the model to extract features for part0.
Yes. For our experiments, we want to re-use the annotated data, and thus we use the labeled data to train both feature extractor and clustering model. It indicates that we can discover more knowledge from the same amount of training data. And I think it will get better results by using separated data for training feature extractor and clustering model. Using separated data will help to reduce the statistical gap of edge similarities between training and testing, otherwise the edge similarities on training graphs is much higher than testing graphs. (Actually I have received some emails about trying similar ideas and they seemed to work well.)

Howard · Answer 2 · Tue Aug 11 2020 11:08:28 GMT+0800 (China Standard Time)

Hi, many thanks for your reply !
I still have another question want to figure out.

HFsoftmax is designed for massive classification and HFsoftmax's result is slightly lower than full softmax in table1 of paper.
There are only about 8.6K identities in part0, not as many as 100K identities in full Celeb-1M and 672K identities in MegaFace. Why do you use hfsoftmax to train resnet50_part0_train.pth.tar instead of the full softmax or cos face(I found there is the cos face classifier implementation in hfsoftmax) ? These method maybe get a better result.

Lei Yang · Answer 3 · Wed Aug 12 2020 16:55:40 GMT+0800 (China Standard Time)

@slacklife You are right. Actually, to avoid interference from irrelevant factors, we use the full softmax for training instead of hfsoftmax and CosFace. To be more clear, we mentioned hfsoftmax here only to use the shared code for face recognition.

Another paper from our group, CDP, has studied the influence of models trained with different losses. It shows that better initial model often lead to better clustering results.

Carrie Hou · Answer 4 · Tue Dec 01 2020 17:16:25 GMT+0800 (China Standard Time)

Hi, @slacklife Could you unzip the file “resnet50_part0_train.pth.tar”

Lei Yang · Answer 5 · Wed Dec 02 2020 12:10:30 GMT+0800 (China Standard Time)

Hi @houguanqun, please refer to the discussion of pretrained model in #75.