Possible bug in `local_shuffle`?

Question

Possible bug in `local_shuffle`?

hwijeen opened this issue 4 months ago · comments

Hi, thanks for the great library! It's good to see such a library written in Python, and it is a great source to learn about data side of LLM pretraining.

I was looking at the part where data is shuffled and saw that local_shuffle is not working as what I expected. I expected each process to gather a local_shuffle number of tokenized documents (each line in a json.gz file) from source paths (json.gz files), shuffle those, and then write those tokenized documents via mmap.

But it seems that the code does the shuffling and writing for each document, instead of a local_shuffle amount of documents. I think this makes local shuffling a no-op and also results in a more frequent writing, which may have performance implications. I think something like de-indenting lines 121 and below could be a fix?

Is this a bug, or am I misunderstanding something? Thank you!
I am tagging @soldni who wrote this file :)

Hwijeen Ahn · Answer 1 · Fri Mar 22 2024 04:01:08 GMT+0800 (China Standard Time)

To verify it it is really the case, I added the following print line and got this

# dolma/python/dolma/tokenizer/executor.py:121
# shuffle sequence order to ensure that the sequences are well mixed
print(f"Shuffling {len(accumulator)} sequences")
random.shuffle(accumulator)

Shuffling 1 sequences
Shuffling 1 sequences
Shuffling 1 sequences
...

The command I used was dolma tokens --documents ./downloaded --destination ./downloaded --tokenizer.name_or_path allenai/OLMo-1B --tokenizer.bos_token_id 1 --tokenizer.eos_token_id 2 --processes 2

Luca Soldaini · Answer 2 · Fri Mar 22 2024 09:26:09 GMT+0800 (China Standard Time)

Great catch! Fixed in #140.