mosaicml / llm-foundry

LLM training code for Databricks foundation models

Home Page:https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Convert dataset with many dataloaders

germanjke opened this issue · comments

train: 0it [00:00, ?it/s]Too many dataloader workers: 224 (max is dataset.n_shards=1). Stopping 223 dataloader workers. To parallelize data loading, we give each process some shards (or data sources) to process. Therefore it's unnecessary to have a number of workers greater than dataset.n_shards=1. To enable more parallelism, please split the dataset in more files than 1. train: 16897it [08:29, 43.96it/s]

Hello! I've got warning like this, with scripts/data_prep/conver_dataset_hf.py, maybe I can use all dataloaders somehow to parallel processes? Thanks!

thanks, I solved my issue by divide dataset to many parquet files, after this i use load_dataset with list of files like this
dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'] 'test': 'my_test_file.csv'})