princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Number of rows in NLI dataset

xlpczv opened this issue · comments

commented

Hello, I have a question for the NLI dataset.
In the paper, it is written that 314k samples are used for supervised SimCSE training using the NLI dataset.
However, when I read the dataset provided by your github, there were only 275,601 rows.
What is the difference between the data you provided and the data written in the paper?

Additionally, I ask if you can provide other supervised datasets, QQP, etc. for example, you used in the experiments.
That would be very helpful for my research.

Thank you very much for the wonderful github repository.

Hi,

The 314k data refer to the NLI dataset without hard negatives. When using hard negatives, some of the examples are filtered out because they don't have a corresponding hard example. This dataset is the one that we used for our final and strongest model.

Sorry that we don't have a copy of the other datasets used anymore. However, we didn't do any special processing to those datasets and you can download the original ones from their corresponding sources.

commented

Hi, thank you for the answer. When generating my own data, I will refer to this advice. Thank you.