EleutherAI / pythia

The hub for EleutherAI's work on interpretability and learning dynamics

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reading data is slowly!

Lisennlp opened this issue · comments

I followed readme:

  git lfs clone https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps
  python utils/unshard_memmap.py --input_file ./pythia_deduped_pile_idxmaps/pile_0.87_deduped_text_document-00000-of-00082.bin --num_shards 83 --output_dir ./pythia_pile_idxmaps/

I got a 600+G file, and then I used gpt-neox's dataloader to read the data, which was very slow. It takes about 6s to read 2048-length pieces of data. May I ask why?

image

I get a file onlu 386G.. "386G Jan 30 13:28 pile_0.87_deduped_text_document.bin"
And I didn't get the '*.idx' file, should we use the download idx file directly?