allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.

Home Page:https://allenai.github.io/dolma/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A Question about the meaning of dolma_v1.6_cc_en

aleien95 opened this issue · comments

Hello, I found that the naming of the dolma_v1.6_cc_en includes cc_en_head,cc_en_middle and cc_en_tail. What do these names mean?

Hi @aleien95,

Names refer to buckets in which the CCNet pipeline organizes documents extracted from common crawl. The CCNet pipeline estimates how similar documents are to wikipedia pages using a KenLM statistical language model. Documents that are highly similar are placed in cc_en_head, followed by cc_en_middle and cc_en_tail.

We retain the same layout out of convenience.

Hope this helps! Feel free to reopen this issue if you have more questions.

Best,
Luca