Training data for UAE-Large-V1
memray opened this issue · comments
Rui Meng commented
Hi,
Awesome work! Can you share the details about what data was used for adapting WhereIsAI/UAE-Large-V1 from BGE-large? Can you share the data as well?
Thanks!
Sean commented
Hi @memray, many thanks for following our work!
We're sorry for any inconvenience caused by the fact that we did not publish our training details yet.
Below is the training data that was used for fine-tuning UAE.
- high_q_sts: it is a high-quality and challenging sts dataset, collected by human annotating
- retrieval: we transformed multiple QA datasets for retrieval tasks. Plus, we collected some actual retrieval data (positive samples) from search engines and used some techniques to generate negative samples.
We are now working on Next Generation sentence embeddings. After we release our new sentence embedding model, we will open-source our training details for UAE.