abhimishra91 / transformers-tutorials

Github repo with tutorials to fine tune transformers for diff NLP tasks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dataset split is invalid (because of reset_index)

jsrozner opened this issue · comments

commented

In the main() function, you have the following:

train_dataset=df.sample(frac=train_size, random_state = config.SEED).reset_index(drop=True)

val_dataset=df.drop(train_dataset.index).reset_index(drop=True)

But since you reset the index, the dropped rows will drop based on the new index. So your train and eval datasets overlap. You can reset the index only after doing drop

train_dataset=df.sample(frac=train_size, random_state = config.SEED)   #<-- removed rop
    val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
    train_dataset = train_dataset.reset_index(drop=True)

@jsrozner , thanks for the feeback. I have updated the notebook with correction to the split.