A Question about the meaning of dolma_v1.6_cc_en
aleien95 opened this issue · comments
Hello, I found that the naming of the dolma_v1.6_cc_en includes cc_en_head,cc_en_middle and cc_en_tail. What do these names mean?
Hi @aleien95,
Names refer to buckets in which the CCNet pipeline organizes documents extracted from common crawl. The CCNet pipeline estimates how similar documents are to wikipedia pages using a KenLM statistical language model. Documents that are highly similar are placed in cc_en_head
, followed by cc_en_middle
and cc_en_tail
.
We retain the same layout out of convenience.
Hope this helps! Feel free to reopen this issue if you have more questions.
Best,
Luca