huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Understand the output of deduplication

Manel-Hik opened this issue · comments

Hi
I have arabic split from the CC trying to deduplicate it
I used datatrove for this with a small example
I got in my output folder two files
0000.c4_dup and 0000.c4_sig
Could you help me to understand this output
I cannot read its content as it's c/00000.c4_sig is not UTF-8 encoded and seems to be binary files
where should I see the nex text deduplicated
Thanks in advance

Hi, can you share the pipeline code you used? You can use this example for reference: https://github.com/huggingface/datatrove/blob/main/examples/sentence_deduplication.py
Step 3 is the one that takes the original data and the .c4_dup files and removes the duplicate sections

Hi
yes I used that example
`def run_example():

pipeline_1 = [
    JsonlReader("cc_sample_100k"),
    SentenceDedupSignature(output_folder="try_cc/")
]

pipeline_2 = [
    SentenceFindDedups(
        data_folder="try_cc/",
        output_folder="try_cc/"
    )
]

pipeline_3 = [
    JsonlReader(data_folder="cc_sample_100k"),
    SentenceDedupFilter(data_folder="try_cc/")
]

executor_1: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_1, workers=1, tasks=1)

executor_2: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_2, workers=1, tasks=1)

executor_3: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_3, workers=1, tasks=1)

print(executor_1.run())

print(executor_2.run())

print(executor_3.run())

if name == "main":
run_example()`
and my data is ready directly for deduplication, that's why I edited in pipeline 1
where can I see the text data deduplicated? I need to understand the output
I got .dup and .sig files in my output folder (try_cc)
thanks