Training data for UAE-Large-V1

Question

Training data for UAE-Large-V1

memray opened this issue 6 months ago · comments

Hi,

Awesome work! Can you share the details about what data was used for adapting WhereIsAI/UAE-Large-V1 from BGE-large? Can you share the data as well?

Thanks!

Sean · Answer 1 · Thu Dec 14 2023 10:06:26 GMT+0800 (China Standard Time)

Hi @memray, many thanks for following our work!

We're sorry for any inconvenience caused by the fact that we did not publish our training details yet.
Below is the training data that was used for fine-tuning UAE.

high_q_sts: it is a high-quality and challenging sts dataset, collected by human annotating
retrieval: we transformed multiple QA datasets for retrieval tasks. Plus, we collected some actual retrieval data (positive samples) from search engines and used some techniques to generate negative samples.

We are now working on Next Generation sentence embeddings. After we release our new sentence embedding model, we will open-source our training details for UAE.