OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.

Home Page:https://optimalscale.github.io/LMFlow/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

json load dataset takes for a long time

biaoliu-kiritsugu opened this issue · comments

In the following code in src/lmflow/datasets, when loading the dataset using json, it is a very long process for me, is this normal? how to reduce the time of loading the dataset?

for single_file in data_files:
                with open(single_file) as fin:
                    json_data = json.load(fin)
                    if KEY_TYPE not in json_data.keys():
                        raise ValueError(
                            f'"{KEY_TYPE}" field must be specified for data, e.g.'
                            '{\n'
                            f'   "{KEY_TYPE}: "text_only",\n'
                            f'   "{KEY_INSTANCES}": [\n'
                            '       { "text": "Sentence 1: This is a sentence." }\n'
                            '       { "text": "Sentence 2: This is another sentence." }\n'
                            f'   ]\n'
                            '}'
                        )
                    if self.type is None:
                        self.type = json_data[KEY_TYPE]
                    elif self.type != json_data[KEY_TYPE]:
                        raise ValueError(
                            'All task files must have same data types. Previous'
                            f' files have type "{self.type}", but in file'
                            f' {single_file}, it has type "{self.type}".'
                        )
commented

Thanks for your interest in LMFlow! When loading the dataset for the first time, LMFlow needs to tokenized the dataset. After that, the cache for the tokenized dataset will be stored and it should be much faster for later runs. You may also use --preprocessing_num_workers 20 to accelerate the process by parallelism.

Also, the speed could be because the json file is too large. In that case, we recommend splitting the json file into smaller files, each with no more than several megabytes.

Hope this information can be helpful 😄