Why the need to divide feature space into d/K

Question

Why the need to divide feature space into d/K

avn3r-dn opened this issue 5 years ago · comments

Abner Ayala-Acevedo commented 5 years ago

Currently, the paper suggests dividing d dimensions into K feature spaces. d=128 K=8, feature embedding per cluster is 16. I understand this was mainly done to keep fair comparison of embedding size among against other models. But theoretically, can we just have a new embedding space be d*k= 1024. If we can have you experimented how the two will compare? 128d vs 1028d one uses d/k feature space per cluster the other just uses the entire embedding space per cluster.

Regards.

Artsiom · Answer 1 · Tue Jul 16 2019 09:04:55 GMT+0800 (China Standard Time)

Hi,

The embedding space of 1024d would obviously give better results due to the larger capacity.

Best,
Artsiom

Abner Ayala-Acevedo · Answer 2 · Tue Jul 16 2019 09:43:14 GMT+0800 (China Standard Time)

Hey Artsiom

Sorry I didn't explain my question well enough. My question is: Why do we need to divide the feature space into K chunks. I understand the value of the K cluster and K losses but what do we gain from dividing the feature space into K chunks. The only gain I can see is keeping the feature space same as original.

Assuming number of features is not a constraint. Can't I just skip the splitting of features into K chunks. For example you use a 128d and divide into 16x8 features space which you then merge at the output. Can't I just use the 128d directly and then merge them into a 128dx8 dimensions?

Regards.

Artsiom · Answer 3 · Tue Jul 16 2019 10:07:06 GMT+0800 (China Standard Time)

Yes, you can use 128dx8 and it is the same as splitting 1024d into 8 parts :). And as you pointed out the comparison to the 128d baselines would not be fair any more.
But if you compare now 128dx8 trained with the proposed method and the 1024d baseline, then our method should give better results.

Best,
Artsiom

Faiz Ul Wahab · Answer 4 · Wed Jul 17 2019 01:18:28 GMT+0800 (China Standard Time)

The split of the embedding space into K chunks is to serve the purpose of making sampling easier. If you do not divide the embedding space into K chunks, you will have to resort to hard example mining, which do not result in comparable results(as most of the triplets are more or less useless).

Artsiom · Answer 5 · Wed Jul 17 2019 01:54:04 GMT+0800 (China Standard Time)

@fwahhab89 we split the embedding space not only for the sampling but to make the learners less correlated.