Reading data is slowly！

Question

Reading data is slowly！

Lisennlp opened this issue 9 months ago · comments

I followed readme：

  git lfs clone https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps
  python utils/unshard_memmap.py --input_file ./pythia_deduped_pile_idxmaps/pile_0.87_deduped_text_document-00000-of-00082.bin --num_shards 83 --output_dir ./pythia_pile_idxmaps/

I got a 600+G file, and then I used gpt-neox's dataloader to read the data, which was very slow. It takes about 6s to read 2048-length pieces of data. May I ask why?

liu09114 · Answer 1 · Tue Jan 30 2024 14:45:57 GMT+0800 (China Standard Time)

I get a file onlu 386G.. "386G Jan 30 13:28 pile_0.87_deduped_text_document.bin"
And I didn't get the '*.idx' file, should we use the download idx file directly?