mlfoundations / open_lm

A repository for research on medium sized language models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dataloading Epoch Update Bug

sedrick-keh-tri opened this issue · comments

In get_wds_dataset, it loops over all datasets and creates a shared_epoch for each dataset, but the function get_wds_dataset returns a DataInfo object for only one shared_epoch.

Thus when we call data["train"].set_epoch(epoch) in train_one_epoch, it only updates the epoch number for one of the datasets. All other datasets are stuck in epoch=0 and will end up sampling the same data over and over.

Thanks for catching this - are you sampling with or without replacement? I think this only affects the former since in the latter the dataloader is constructed from scratch each epoch. Regardless, I'll push a fix.

Yup, I think we were sampling with replacement. Thanks for pushing the fix!