SeanLee97 / AnglE

Train and Infer Powerful Sentence Embeddings with AnglE | 🔥 SOTA on STS and MTEB Leaderboard

Home Page:https://arxiv.org/abs/2309.12871

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UAE - explanation of Non-Retrieval vs Retrieval

legolego opened this issue · comments

Hello, could you please add a little explanation of the difference Non-Retrieval and Retrieval tasks for UAE? Why would one be used instead of another? I'm looking to create sentence embeddings to store in a database. Thank you!

commented

Hi @legolego , thanks for following our work.

In UAE, we use different approaches for retrieval and non-retrieval tasks, each serving a different purpose. Retrieval tasks aim to find relevant documents, and as a result, the related documents may not have strict semantic similarities to each other.

For instance, when querying "How about chatgpt?", the related documents should contain information pertaining to "chatgpt," such as "chatgpt is amazing..." or "chatgpt is bad....".

Conversely, non-retrieval tasks, such as semantic textual similarity, require sentences that are semantically similar. For example, a sentence semantically similar to "How about chatgpt?" could be "What is your opinion about chatgpt?".

To distinguish between these two types of tasks, we use different prompts. For retrieval tasks, we use the prompt "Represent this sentence for searching relevant passages: {text}" (Prompts.C in angle_emb). For non-retrieval tasks, we set the prompt to empty, i.e., just input your text without specifying a prompt.

So, if your scenario is retrieval-related, it is highly recommended to set the prompt with angle.set_prompt(prompt=Prompts.C). If not, leave the prompt empty or use angle.set_prompt(prompt=None).

Thank you for replying, that makes it more clear.
I would like to experiment with addition and subtraction of sentence embeddings something like KING - MAN + WOMAN = QUEEN, but for combinations of ideas in sentences. The goal would be to find sentences similar in meaning to the result of the arithmetic.
Would this be a non-retrieval task because semantic similarity is important?

commented

Thank you for replying, that makes it more clear. I would like to experiment with addition and subtraction of sentence embeddings something like KING - MAN + WOMAN = QUEEN, but for combinations of ideas in sentences. The goal would be to find sentences similar in meaning to the result of the arithmetic. Would this be a non-retrieval task because semantic similarity is important?

@legolego It is a good idea and very interesting! You can try the non-retrieval embedding. If the performance is less than expected, you can fine-tune the model on arithmetic datasets. We have provided a friendly interface to fine-tune the model. Because our pretraining set does not include arithmetic datasets, we cannot ensure good performance on arithmetic similarity.

Thank you for confirming! Do you have an example of an semantic arithmetic dataset like that? I've never heard of one like that. Searching in Google gave me results about mathematics with numbers, but not arithmetic with the ideas in sentences.

commented

Thank you for confirming! Do you have an example of an semantic arithmetic dataset like that? I've never heard of one like that. Searching in Google gave me results about mathematics with numbers, but not arithmetic with the ideas in sentences.

Sorry, I do not know much about arithmetic semantics; I mainly focus on textual similarity.

Thank you for your answers!

Hi! I know this is close but related to non-retrieval vs retrieval, how was this handled during training? When providing positive and negative paris did you added the prompt.C at some point? Thank you beforehand.