yvchen / JointSLU

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ATIS data split (train,dev,test) issues

kpe opened this issue · comments

commented

I noticed some issues with the data split in the ATIS dataset (see visualization of the label distributions here.):

  • duplicated data samples - 397 (from 5871)
  • no train data for 5 intent and 7 slot labels
  • up to 20% of the labels present in the training dataset are not present at all in the dev or test datasets

I believe this could make the interpretation of model performance measures somewhat unreliable, and tried to build an alternative, more balanced
data split here.