SeanLee97 / AnglE

Train and Infer Powerful Sentence Embeddings with AnglE | 🔥 SOTA on STS and MTEB Leaderboard

Home Page:https://arxiv.org/abs/2309.12871

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Which feature to use?

yuanze1024 opened this issue · comments

Thank you for your works. I'm new to NLP, and I want to know which feature to use to cluster similar sentences?

After UAE(non retrieval), I'll get a (n, 1024) feature, should I use the starter token's feature same as E5?

And BTW, I found that using E5, "A red teddy bear wearing blue shirt" is very similar to "A blue teddy bear wearing red shirt". Similarly, "A man riding a horse" will be close to "A horse riding a man", is that a problem for all algorithms?

commented

hi @yuanze1024, thanks for following our work.

  1. Right, if you get a (n, 1024) feature, you should take the first one as the sentence embedding. Or you can use our library angle_emb to extract the sentence embeddings, as illustrated in UAE (non-retrieval)

  2. I think so, because there are few of these hard cases in existing training datasets. If you want to improve the performance of these hard cases, you should collect more hard data and fine-tune it.

OK, I see. Thank you for you really quick response.