Understand the output of deduplication
Manel-Hik opened this issue · comments
Hi
I have arabic split from the CC trying to deduplicate it
I used datatrove for this with a small example
I got in my output folder two files
0000.c4_dup and 0000.c4_sig
Could you help me to understand this output
I cannot read its content as it's c/00000.c4_sig is not UTF-8 encoded and seems to be binary files
where should I see the nex text deduplicated
Thanks in advance
Hi, can you share the pipeline code you used? You can use this example for reference: https://github.com/huggingface/datatrove/blob/main/examples/sentence_deduplication.py
Step 3 is the one that takes the original data and the .c4_dup files and removes the duplicate sections
Hi
yes I used that example
`def run_example():
pipeline_1 = [
JsonlReader("cc_sample_100k"),
SentenceDedupSignature(output_folder="try_cc/")
]
pipeline_2 = [
SentenceFindDedups(
data_folder="try_cc/",
output_folder="try_cc/"
)
]
pipeline_3 = [
JsonlReader(data_folder="cc_sample_100k"),
SentenceDedupFilter(data_folder="try_cc/")
]
executor_1: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_1, workers=1, tasks=1)
executor_2: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_2, workers=1, tasks=1)
executor_3: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_3, workers=1, tasks=1)
print(executor_1.run())
print(executor_2.run())
print(executor_3.run())
if name == "main":
run_example()`
and my data is ready directly for deduplication, that's why I edited in pipeline 1
where can I see the text data deduplicated? I need to understand the output
I got .dup and .sig files in my output folder (try_cc)
thanks