Convert dataset with many dataloaders
germanjke opened this issue · comments
train: 0it [00:00, ?it/s]Too many dataloader workers: 224 (max is dataset.n_shards=1). Stopping 223 dataloader workers. To parallelize data loading, we give each process some shards (or data sources) to process. Therefore it's unnecessary to have a number of workers greater than dataset.n_shards=1. To enable more parallelism, please split the dataset in more files than 1. train: 16897it [08:29, 43.96it/s]
Hello! I've got warning like this, with scripts/data_prep/conver_dataset_hf.py
, maybe I can use all dataloaders somehow to parallel processes? Thanks!
thanks, I solved my issue by divide dataset to many parquet
files, after this i use load_dataset
with list of files like this
dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'] 'test': 'my_test_file.csv'})