shuiba0's repositories
GLUE-X
We collect 13 publicly available datasets as OOD test data and conduct evaluations on 8 classic NLP tasks over popularly used models. Our findings confirm that the OOD accuracy in NLP tasks needs to be paid more attention to since the significant performance decay compared to ID accuracy has been found in all settings.