allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.

Home Page:https://allenai.github.io/dolma/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Possible bug in `local_shuffle`?

hwijeen opened this issue · comments

Hi, thanks for the great library! It's good to see such a library written in Python, and it is a great source to learn about data side of LLM pretraining.

I was looking at the part where data is shuffled and saw that local_shuffle is not working as what I expected. I expected each process to gather a local_shuffle number of tokenized documents (each line in a json.gz file) from source paths (json.gz files), shuffle those, and then write those tokenized documents via mmap.

But it seems that the code does the shuffling and writing for each document, instead of a local_shuffle amount of documents. I think this makes local shuffling a no-op and also results in a more frequent writing, which may have performance implications. I think something like de-indenting lines 121 and below could be a fix?

Is this a bug, or am I misunderstanding something? Thank you!
I am tagging @soldni who wrote this file :)

To verify it it is really the case, I added the following print line and got this

# dolma/python/dolma/tokenizer/executor.py:121
# shuffle sequence order to ensure that the sequences are well mixed
print(f"Shuffling {len(accumulator)} sequences")
random.shuffle(accumulator)
Shuffling 1 sequences
Shuffling 1 sequences
Shuffling 1 sequences
...

The command I used was dolma tokens --documents ./downloaded --destination ./downloaded --tokenizer.name_or_path allenai/OLMo-1B --tokenizer.bos_token_id 1 --tokenizer.eos_token_id 2 --processes 2

Great catch! Fixed in #140.