dataset split is invalid (because of reset_index)
jsrozner opened this issue · comments
In the main() function, you have the following:
train_dataset=df.sample(frac=train_size, random_state = config.SEED).reset_index(drop=True)
val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
But since you reset the index, the dropped rows will drop based on the new index. So your train and eval datasets overlap. You can reset the index only after doing drop
train_dataset=df.sample(frac=train_size, random_state = config.SEED) #<-- removed rop
val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)
@jsrozner , thanks for the feeback. I have updated the notebook with correction to the split.