huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why is read_files_shard() taking too long?

mohataher opened this issue Β· comments

So I'm currently using the library to process a large huggingface dataset.

I have prepared the code to deduplicate the dataset:

# Deduplication using datatrove minhash configuration on 4 stages, resulted folder "BasePath" contains removed and output documents.
minhash_config = MinhashConfig(use_64bit_hashes=True)  # better precision -> fewer false positives (collisions)
INPUT_READER = ParquetReader(f"./The Arabic Pile/Temporary Parquet/{dataset_category}/Parquet")
MINHASH_BASE_PATH = f"./{dataset_category}/{dataset_category}_BasePath/"
LOGS_FOLDER =       f"./{dataset_category}/{dataset_category}_Logging__Directory/"
LOCAL_LOGS_FOLDER =    f"./{dataset_category}/{dataset_category}_Local_Logs_Folder/"
TOTAL_TASKS = 10 # 
run_deduplication(minhash_config, INPUT_READER, BASE_PATH, LOGS_FOLDER, LOCAL_LOGS_FOLDER, TOTAL_TASKS)

and the function implementation as:

def run_deduplication(minhash_config, INPUT_READER, MINHASH_BASE_PATH, LOGS_FOLDER, LOCAL_LOGS_FOLDER, TOTAL_TASKS):
    pipeline_1 = [
        INPUT_READER,
         MinhashDedupSignature(output_folder=f"{MINHASH_BASE_PATH}/signatures", config=minhash_config)
    ]

    pipeline_2 =[
        MinhashDedupBuckets(
            input_folder=f"{MINHASH_BASE_PATH}/signatures",
            output_folder=f"{MINHASH_BASE_PATH}/buckets",
            config=minhash_config,
        ),
    ]
    pipeline_3 =[
        MinhashDedupCluster(
            input_folder=f"{MINHASH_BASE_PATH}/buckets",
            output_folder=f"{MINHASH_BASE_PATH}/remove_ids",
            config=minhash_config,
        )
    ]

    pipeline_4 =[
        INPUT_READER,
        TokensCounter(),  # nice way to see how many tokens we had before and after deduplication
        MinhashDedupFilter(
            input_folder=f"{MINHASH_BASE_PATH}/remove_ids",
            exclusion_writer=JsonlWriter(f"{MINHASH_BASE_PATH}/removed"),
        ),
        JsonlWriter(output_folder=f"{MINHASH_BASE_PATH}/deduplicated_output")
    ]

    num_workers = 1 # I used 8 in other runs and still had the same issue
    executor_1: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_1, workers = num_workers,  tasks=TOTAL_TASKS, logging_dir=f"{LOGS_FOLDER}/signatures")

    executor_2: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_2, workers = num_workers, tasks=minhash_config.num_buckets, logging_dir=f"{LOGS_FOLDER}/buckets")

    executor_3: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_3,  tasks=1, logging_dir=f"{LOGS_FOLDER}/clusters")

    executor_4: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_4, workers = num_workers, tasks=TOTAL_TASKS, logging_dir=f"{LOGS_FOLDER}/filter")

    print('running exec 1')
    print('running exec 1', executor_1.run())
    print('running exec 2')
    print('running exec 2', executor_2.run())
    print('running exec 3')
    print('running exec 3', executor_3.run())
    print('running exec 4')
    print('running exec 4', executor_4.run())

The good news is pipeline finished the first 9 tasks in very short time. However, in step 10, it has taken over 12 hours for a 6GB dataset. Logs below:


running exec 1

2024-03-24 05:18:02.895 | INFO     | datatrove.executor.local:run:118 - Skipping 9 already completed tasks # I have re-run it after changing the number of workers to see if that has any effect.
2024-03-24 05:18:03.844 | INFO     | datatrove.utils.logging:add_task_logger:47 - Launching pipeline for rank=0
2024-03-24 05:18:03.847 | INFO     | datatrove.utils.logging:log_pipeline:76 - 
--- πŸ› οΈ PIPELINE πŸ› 
πŸ“– - READER: πŸ“’ Parquet
πŸ«‚ - DEDUP: 🎯 MinHash stage 1
2024-03-24 05:18:03.854 | INFO     | datatrove.pipeline.readers.base:read_files_shard:193 - Reading input file 
2024-03-24 18:25:36.251 | INFO     | datatrove.pipeline.dedup.minhash:run:236 - Sorting buckets...

I'm running this on a local M1 Pro with 16GB ram. I should note that the majority of text is in Arabic.

Is there anything I'm doing wrong?

Hi! Your config seems fine to me, not sure what the issue could be specially since it seems to reach the sorting buckets stage.

You mention the data is in Arabic, by default when minhashing we use an English word tokenizer that splits the text into words mostly based on spacing, I'm not sure how spacing works in Arabic but maybe there's a very large text somewhere without spaces? But then again it wouldn't reach the sorting part in that case. Maybe there is some other weird sort of outlier somewhere?

Unfortunately I don't think I'll be able to help much more without taking a look at the actual data.

I have profiled nltk English tokenizers for Arabic and have seen no major difference between arabic or english text. So we can rule this one out. I have also noticed that for each bucket, there exists 10 minhas.sig files. File 00 is 7.48GB and the rest of them are empty.

I just have a few questions:

  1. Does the code using multiprocessing in the last task of reading files and sorting buckets? Or is it a single htread?
  2. Does the dataset size (~1B rows) require any changes to bucket size or any other config values to optimise for the code speed?
  3. Why aren't hashes distributed among buckets?
  1. no. the MinhashSignatures block does not use multiprocessing and if your executor has workers=1 neither does the executor
  2. the way to optimize this is to have the dataset split across many diff files and then use a large nb of tasks (files will be distributed across these tasks and thus you can parallelize more). Bucket size is meant to change your preferred similarity threshold and not so much speed
  3. They are. If you look at the output from step1 each worker will write the relevant hashes to each bucket's folder for step2

Ahh.. I assume now the issue is that we are passing one huge parquest file. So one worker/task is picking this job. Obviously increasing the number of workers/tasks won't have any effect on code performance as we noticed already.

I also noticed that my team is using ParquetReader after converting HF dataset to parquet file. Maybe using a HuggingFaceDatasetReader could have a better performance in terms of helping the code split data more efficiently instead of breaking up our parquet file? Or do we still need to break that dataset up ourselved?

Can confirm, changing our reader to HuggingFaceDatasetReader and its internal sharding managed to boost up the performance.

Silly mistake on our end to put all this enormous amount of data in a single parquet file.

Would be nice to see a warning message when a single huge file is thrown at the pipeline?