Duplicate ids in Dolma v1.7
Vedaad-Shakib opened this issue · comments
Vedaad Shakib commented
Hi,
While downloading and processing Dolma v1.7, I noticed that there are many duplicate samples with the same id
field in the dataset. E.g. in the Project Gutenberg
source, there are 175 duplicates that can be found by just looking at the id
column. An example of a duplicate id
is 8fddd3535f86e159339e1ff9be64fdda
in the RefinedWeb split. This was surprising given that you had done significant deduping in Dolma 1.7. Is this a bug in the dataset?