Number of rows in NLI dataset

Question

Number of rows in NLI dataset

xlpczv opened this issue a year ago · comments

Hello, I have a question for the NLI dataset.
In the paper, it is written that 314k samples are used for supervised SimCSE training using the NLI dataset.
However, when I read the dataset provided by your github, there were only 275,601 rows.
What is the difference between the data you provided and the data written in the paper?

Additionally, I ask if you can provide other supervised datasets, QQP, etc. for example, you used in the experiments.
That would be very helpful for my research.

Thank you very much for the wonderful github repository.

Tianyu Gao · Answer 1 · Sat Apr 15 2023 01:18:58 GMT+0800 (China Standard Time)

Hi,

The 314k data refer to the NLI dataset without hard negatives. When using hard negatives, some of the examples are filtered out because they don't have a corresponding hard example. This dataset is the one that we used for our final and strongest model.

Sorry that we don't have a copy of the other datasets used anymore. However, we didn't do any special processing to those datasets and you can download the original ones from their corresponding sources.

xlpczv · Answer 2 · Tue Apr 18 2023 12:35:44 GMT+0800 (China Standard Time)

Hi, thank you for the answer. When generating my own data, I will refer to this advice. Thank you.