dataset split is invalid (because of reset_index)

Question

dataset split is invalid (because of reset_index)

jsrozner opened this issue 4 years ago · comments

In the main() function, you have the following:

train_dataset=df.sample(frac=train_size, random_state = config.SEED).reset_index(drop=True)

val_dataset=df.drop(train_dataset.index).reset_index(drop=True)

But since you reset the index, the dropped rows will drop based on the new index. So your train and eval datasets overlap. You can reset the index only after doing drop

train_dataset=df.sample(frac=train_size, random_state = config.SEED)   #<-- removed rop
    val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
    train_dataset = train_dataset.reset_index(drop=True)

Abhishek Kumar Mishra · Answer 1 · Wed Aug 12 2020 10:03:40 GMT+0800 (China Standard Time)

@jsrozner , thanks for the feeback. I have updated the notebook with correction to the split.