Question about token independent Replace operation

Question

Question about token independent Replace operation

IstiaqAnsari opened this issue 2 years ago · comments

I don't think this question is an issue about the code, but I wasn't sure where to ask this.
Out of 5000 vocabulary of edit operation prediction there are 3802 random words that are replacing original words in a sentence with grammar error. But the Replace tokens are very random and looks like there just covering the English word vocabulary.
For example
$REPLACE_electric
$REPLACE_sister
$APPEND_car
$REPLACE_fantastic
$REPLACE_examination
$APPEND_city
$REPLACE_eaten
These words are very random and have no grammatical significance. Why are these words in the output space? What are the reasons behind this?
If they are there, shouldn't every word in english be in the output space? I am guessing if the English vocabulary size is too large that's why only the words with highest frequency in English language have been selected for this operation. Could that be the only reason?

Alex Skurzhanskyi · Answer 1 · Tue Dec 28 2021 23:37:07 GMT+0800 (China Standard Time)

Hello
These words were used to fix grammatical errors in the training data. Indeed, as we need to have a limited number of labels, not all words are present here – only those with the highest frequency in training data. That means that people often tend to make errors in such words (like spelling).