allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.

Home Page:https://allenai.github.io/dolma/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue with ring tokenizer

davidbrandfonbrener opened this issue · comments

This line seems to throw an error when ring_size < len(source_paths) for division by 0.

Basically it seems that len(tokenizer_ring) will be decremented here. The inner loop is broken, but the outer loop keeps going and divides by 0.

I'm not exactly sure what the right fix is, and it seems things work fine as long as ring_size * processes >= num_files. Any clarity here would be appreciated, thanks!