microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

It should be no data at this website....

hellangleZ opened this issue · comments

/pretrain_gpt_1.3B_seq_parallel.sh: line 151: ds_ssh: command not found
--2023-11-22 03:43:40-- https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.bin
Resolving the-eye.eu (the-eye.eu)... 162.213.130.6
Connecting to the-eye.eu (the-eye.eu)|162.213.130.6|:443... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

I have also encountered this problem, how should I solve it?

Same issue. The website https://the-eye.eu/public/AI/ does not have this data path of pile_neox. How can I solve this problem?

Same issue. The website https://the-eye.eu/public/AI/ does not have this data path of pile_neox. How can I solve this problem?

我手动下载到本地才解决的,在项目文件找到下载连接,然后用某雷svip才能链接上资源下载

Same issue. The website https://the-eye.eu/public/AI/ does not have this data path of pile_neox. How can I solve this problem?

我手动下载到本地才解决的,在项目文件找到下载连接,然后用某雷svip才能链接上资源下载

which link? I just access this website, there is no this file /pile_neox/data/BookCorpusDataset_text_document.bin

Same issue. The website https://the-eye.eu/public/AI/ does not have this data path of pile_neox. How can I solve this problem?

我手动下载到本地才解决的,在项目文件找到下载连接,然后用某雷svip才能链接上资源下载

which link? I just access this website, there is no this file /pile_neox/data/BookCorpusDataset_text_document.bin

It's still this link, https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.bin
But I used Xunlei(China) to add the link and download it.
So I think using the general download software to add a link can download, do not directly jump to this link to download. I think this is a bug caused by a change in the architecture of the Website page but the database behind the site didn't remove it

Correct, this looks to be related to the website the data is hosted on and not Megatron-DeepSpeed. Closing this issue for now, but using the updated link above should work.

who knows how the file generated? I wanna to fake one, i can't download that too.

who knows how the file generated? I wanna to fake one, i can't download that too.

python tools/preprocess_data.py
--input my-corpus.json
--output-prefix my-gpt2
--vocab-file gpt2-vocab.json
--dataset-impl mmap
--tokenizer-type GPT2BPETokenizer
--merge-file gpt2-merges.txt
--append-eod