allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.

Home Page:https://allenai.github.io/dolma/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can I use the dolma toolkit to process my own datasets?

Tendo33 opened this issue · comments

I got some data myself through a crawler, and I was wondering if I could use the dolma toolkit to remove duplicates.

Yes! you can use our dolma dedupe command. Please let us know if you have questions!