Problem about the GIDS dataset

Question

Problem about the GIDS dataset

ShomyLiu opened this issue 5 years ago · comments

Hi,

After checking that the max_len of all sentences in the GIDS dataset is 100, however, there are some values in SubPos and ObjPos larger than 100

So it seems that the max_len 100 is the length after preprocessing instead of the real max length of the sentences, is it?

Thanks

Shikhar Vashishth · Answer 1 · Mon Jan 28 2019 12:19:46 GMT+0800 (China Standard Time)

Hi @ShomyLiu,
Yes, that is true based on the distribution of length of sentences in GIDS dataset, we decided to fix max_len to 100. Taking real max length would require a lot of padding and thus things will not fit in GPU memory. Majority of sentences have less than 100 words so this decision doesn't affect the performance much.