allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.

Home Page:https://allenai.github.io/dolma/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Need help in customizing python/dolma/taggers/c4.py

mihara-bot opened this issue · comments

Dear authors,
I tried to implement the rule on page 57 of your Dolma paper 'Remove documents with more than half of their line not ending in...'.
And I modified a few lines of code at python/dolma/taggers/c4.py to:
Line 107~ Line 130

        start = count = 0
        line_no_pending_punc_count = 0
        for sent in text.split("\n"):
            end = start + len(sent)
            if end != len(text):
                # account for the newline
                end += 1

            # strip any trailing whitespace
            sent = sent.strip()

            if not sent.endswith((".", "?", "!", '"')):
                spans.append(Span(start, end, type="lines_with_no_ending_punctuation"))
                line_no_pending_punc_count += 1

            if len(sent.split()) < MIN_WORDS_PER_LINE:
                spans.append(Span(start, end, type="lines_with_too_few_words"))

            count += 1
            start = end

        spans.append(Span(0, len(doc.text), type="line_count", score=count))
        spans.append(Span(0, len(doc.text), type="lines_with_no_ending_punctuation_ratio", score=line_no_pending_punc_count / count))
        return DocResult(doc=doc, spans=spans)

However, I found that 'lines_with_no_ending_punctuation_ratio' is not working and the results of c4_v2 don't contain this data field.
Could you please help me on this c4 rule?
Many thanks! :)

Best regards,
Xinlin Zhuang