Weird SST2 dataset size

Question

Weird SST2 dataset size

sh0416 opened this issue a year ago · comments

I've just reproduced this work and starting with the dataset named SST2.

I found that the size of SST2 in this paper is 6.9k, but the size of SST2 in GLUE paper is 69k.

I double checked this information, and the SST2 dataset distributed in this repository has 6.9k, but the SST2 dataset distributed in the huggingface datasets has 69k.

I think some kind of filtering is applied to this data.

Could you clarify what it is?

Thank you

Seonghyeon · Answer 1 · Fri Feb 24 2023 16:20:12 GMT+0800 (China Standard Time)

I want to reproduce the performance of SST2 Fine-tuning(full) in Table 3.
In the caption, the size of dataset is described in Table B.1.
In the Table B.1., the size of SST2 train dataset is 6920.

When I downloaded the SST2 training data from huggingface datasets, the size of the dataset is around 69000, which is ten times larger than the dataset distributed in this repository.

Seonghyeon · Answer 2 · Fri Feb 24 2023 16:22:34 GMT+0800 (China Standard Time)

Also, I am curious whether the Fine-tuning (full) is a traditional approach without template and label words.

Adam Fisch · Answer 3 · Sat Feb 25 2023 03:04:46 GMT+0800 (China Standard Time)

Hi, thanks for the interest. The training set for SST-2 dataset that is commonly distributed by GLUE etc is separated into densely labeled phrases. The dev and test sets are full sentences (note the equivalent sizes to ours). We use the original (unsplit) sentences, hence the order of magnitude size difference. This is the same as in https://github.com/openai/generating-reviews-discovering-sentiment, for example.

The fine-tuning (full) is the traditional approach.

Seonghyeon · Answer 4 · Sat Feb 25 2023 20:13:54 GMT+0800 (China Standard Time)

Great. Thank you for clarifying the detail.