UAE - explanation of Non-Retrieval vs Retrieval

Question

UAE - explanation of Non-Retrieval vs Retrieval

legolego opened this issue 6 months ago · comments

Hello, could you please add a little explanation of the difference Non-Retrieval and Retrieval tasks for UAE? Why would one be used instead of another? I'm looking to create sentence embeddings to store in a database. Thank you!

Sean · Answer 1 · Tue Dec 05 2023 16:57:09 GMT+0800 (China Standard Time)

Hi @legolego , thanks for following our work.

In UAE, we use different approaches for retrieval and non-retrieval tasks, each serving a different purpose. Retrieval tasks aim to find relevant documents, and as a result, the related documents may not have strict semantic similarities to each other.

For instance, when querying "How about chatgpt?", the related documents should contain information pertaining to "chatgpt," such as "chatgpt is amazing..." or "chatgpt is bad....".

Conversely, non-retrieval tasks, such as semantic textual similarity, require sentences that are semantically similar. For example, a sentence semantically similar to "How about chatgpt?" could be "What is your opinion about chatgpt?".

To distinguish between these two types of tasks, we use different prompts. For retrieval tasks, we use the prompt "Represent this sentence for searching relevant passages: {text}" (Prompts.C in angle_emb). For non-retrieval tasks, we set the prompt to empty, i.e., just input your text without specifying a prompt.

So, if your scenario is retrieval-related, it is highly recommended to set the prompt with angle.set_prompt(prompt=Prompts.C). If not, leave the prompt empty or use angle.set_prompt(prompt=None).

Oleg N · Answer 2 · Tue Dec 05 2023 23:30:08 GMT+0800 (China Standard Time)

Thank you for replying, that makes it more clear.
I would like to experiment with addition and subtraction of sentence embeddings something like KING - MAN + WOMAN = QUEEN, but for combinations of ideas in sentences. The goal would be to find sentences similar in meaning to the result of the arithmetic.
Would this be a non-retrieval task because semantic similarity is important?

Sean · Answer 3 · Wed Dec 06 2023 08:47:33 GMT+0800 (China Standard Time)

Thank you for replying, that makes it more clear. I would like to experiment with addition and subtraction of sentence embeddings something like KING - MAN + WOMAN = QUEEN, but for combinations of ideas in sentences. The goal would be to find sentences similar in meaning to the result of the arithmetic. Would this be a non-retrieval task because semantic similarity is important?

@legolego It is a good idea and very interesting! You can try the non-retrieval embedding. If the performance is less than expected, you can fine-tune the model on arithmetic datasets. We have provided a friendly interface to fine-tune the model. Because our pretraining set does not include arithmetic datasets, we cannot ensure good performance on arithmetic similarity.

Oleg N · Answer 4 · Wed Dec 06 2023 10:41:50 GMT+0800 (China Standard Time)

Thank you for confirming! Do you have an example of an semantic arithmetic dataset like that? I've never heard of one like that. Searching in Google gave me results about mathematics with numbers, but not arithmetic with the ideas in sentences.

Sean · Answer 5 · Thu Dec 07 2023 08:28:58 GMT+0800 (China Standard Time)

Thank you for confirming! Do you have an example of an semantic arithmetic dataset like that? I've never heard of one like that. Searching in Google gave me results about mathematics with numbers, but not arithmetic with the ideas in sentences.

Sorry, I do not know much about arithmetic semantics; I mainly focus on textual similarity.

Oleg N · Answer 6 · Thu Dec 07 2023 10:35:02 GMT+0800 (China Standard Time)

Thank you for your answers!

Pedro Vicente Valdez · Answer 7 · Wed Jan 17 2024 08:46:26 GMT+0800 (China Standard Time)

Hi! I know this is close but related to non-retrieval vs retrieval, how was this handled during training? When providing positive and negative paris did you added the prompt.C at some point? Thank you beforehand.