allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.

Home Page:https://allenai.github.io/dolma/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clarification Needed on "C4 NoPunc" in Data Processing

codefly13 opened this issue · comments

I am currently working with a dataset and noticed the term "C4 NoPunc" used in the context of data quality filtering. I would like to clarify what exactly this term refers to. Specifically, does "C4 NoPunc" mean:

  1. Quality filters are applied except for the "lines_with_no_ending_punctuation" rule. This means all other C4 quality filters are applied, but lines are not removed based solely on the absence of ending punctuation.

  2. Only the "lines_with_no_ending_punctuation" rule is used in quality filtering. This means that the sole criterion for removing lines is the absence of ending punctuation, and no other C4 quality filters are applied.

Could you please provide some insight into which of these interpretations is correct, or if there's another meaning entirely?

Hi @codefly13!

It's the latter: only the lines_with_no_ending_punctuation rule is used in quality filtering.

I'm closing this issue assuming that the above answers your question, but please re-open it in case you need further clarification!

Best,
Luca