Which feature to use?

Question

Which feature to use?

yuanze1024 opened this issue 6 months ago · comments

Thank you for your works. I'm new to NLP, and I want to know which feature to use to cluster similar sentences?

After UAE(non retrieval), I'll get a (n, 1024) feature, should I use the starter token's feature same as E5?

And BTW, I found that using E5, "A red teddy bear wearing blue shirt" is very similar to "A blue teddy bear wearing red shirt". Similarly, "A man riding a horse" will be close to "A horse riding a man", is that a problem for all algorithms?

Sean · Answer 1 · Thu Dec 21 2023 13:00:38 GMT+0800 (China Standard Time)

hi @yuanze1024, thanks for following our work.

Right, if you get a (n, 1024) feature, you should take the first one as the sentence embedding. Or you can use our library angle_emb to extract the sentence embeddings, as illustrated in UAE (non-retrieval)
I think so, because there are few of these hard cases in existing training datasets. If you want to improve the performance of these hard cases, you should collect more hard data and fine-tune it.

Yuan Ze · Answer 2 · Thu Dec 21 2023 13:13:45 GMT+0800 (China Standard Time)

OK, I see. Thank you for you really quick response.