mosaicml / llm-foundry

LLM training code for Databricks foundation models

Home Page:https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

convert_dataset_hf.py example stuck

eldarkurtic opened this issue · comments

Hi,

Converting C4 dataset to Streaming format is stuck without any errors. More specifically, I am running:

# Convert C4 dataset to StreamingDataset format
python convert_dataset_hf.py \
  --dataset c4 --data_subset en \
  --out_root /ssdpool/eldar/opt125m/c4 --splits train val train_small val_small \
  --concat_tokens 2048 --tokenizer facebook/opt-125m

and the output looks like this (stuck at "train: 54%")

Downloading builder script: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.29k/3.29k [00:00<00:00, 14.4MB/s]
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.77k/7.77k [00:00<00:00, 21.4MB/s]
Converting train to MDS format...
Note: the progress bar is based on the dataset length before tokenization, and may finish at a value before 100%.
train:  54%|███████████████████████████████████████████████████████████████████████████▊                                                                | 52083426/96205664 [1:13:39<7:46:00, 1578.00it/s

Restarting the entire process from scratch doesn't help. The --out_root location has plenty of space available (> 2TB). Any ideas what might be happening and how to actually debug it further?
(I am using the latest llm-foundry installed from source)

To add a bit more info: some attempts would break with this

Converting train to MDS format...
Note: the progress bar is based on the dataset length before tokenization, and may finish at a value before 100%.
train:   1%|| 650520/96205664 [07:34<30:03, 52974.56it/s]'(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: 94f3cdba-c04a-4586-9fb2-fd84a5f4b2ea)')' thrown while requesting GET https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00143-of-01024.json.gz
Retrying in 1s [Retry 1/5].
'(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: 341311e0-7811-41cb-8452-3b4457b311be)')' thrown while requesting GET https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00252-of-01024.json.gz
Retrying in 1s [Retry 1/5].
'(MaxRetryError("HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /datasets/allenai/c4/e65ee475e3b6682b57bfa3f7b9c1cdabf36a7282fc793865df63dbe6a6a3d1fe?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27c4-train.00227-of-01024.json.gz%3B+filename%3D%22c4-train.00227-of-01024.json.gz%22%3B&response-content-type=application%2Fgzip&Expires=1706355753&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNjM1NTc1M319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9kYXRhc2V0cy9hbGxlbmFpL2M0L2U2NWVlNDc1ZTNiNjY4MmI1N2JmYTNmN2I5YzFjZGFiZjM2YTcyODJmYzc5Mzg2NWRmNjNkYmU2YTZhM2QxZmU~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=LZVow3Zgh6ecIquJYbI3YEj-FLWGIJ0NvSbBFwvInX7pGinq5gZL1DyfPRZ41cKFguPW-9PmSNSWy60Ig1R2X9Sr0fTcmsa3JGxsmURqxDGA4SeeBYXT89xSXxX4M9RNNubCEyY2hO3FY3MdMLDobz8fia-2ZvlXNagjVr60YiCwsR33rm5esIFQpLe85IFw~HfIDwBzqgeGLkoQ2DYy0qPXt9dUB-gWKIoxoKyMvFcQ~SXZ3nuE-PJQulCqzQ8TNde1ESVi28RD0LIPFStziqqFO0C2kQ5XCwL1kiFfvAdmv~D7B82KTdNvtcH7q5axZO9tRj~ma-4oBOdVPozwZQ__&Key-Pair-Id=KVTP0A1DKRTAX (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)')))"), '(Request ID: 34f8402f-fd9d-43c5-97da-4c3d355b9b83)')' thrown while requesting GET https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00227-of-01024.json.gz
Retrying in 1s [Retry 1/5].
'(MaxRetryError("HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /datasets/allenai/c4/acc052da4501691fdd28f269c633ed826840f2634be49b5e2c5e1273de0cb4a8?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27c4-train.00189-of-01024.json.gz%3B+filename%3D%22c4-train.00189-of-01024.json.gz%22%3B&response-content-type=application%2Fgzip&Expires=1706355753&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNjM1NTc1M319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9kYXRhc2V0cy9hbGxlbmFpL2M0L2FjYzA1MmRhNDUwMTY5MWZkZDI4ZjI2OWM2MzNlZDgyNjg0MGYyNjM0YmU0OWI1ZTJjNWUxMjczZGUwY2I0YTg~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=LjuKY8fuGEV-hTND0v~i31UPcBfFyWPyxM3epgvOz2yf35kbYRAMlKA2jZVKCz1mWkW5DtgjpTHy--Et2mz2VcxrfQ6KJN2owrNpj6yYR9N7Kcv7Md-0ElLmS2baMU1vkrpGppiZ5Vm~yQqnAX0ou6dmo2w~qYxUH~Gxi8-BHhRlLBhqyHr3iAzlusWIe9Cj5j8zsLA9mqhqZ~pbZ74o2DZFAHn4i3DfFjK1MD8r5ubaLOBJ6Tu-Yuk44FCsYMMtO96B5XQS8YPiScLAjPz1Sz2q9nLEKvPEi-24AHqlXGj6MfsCNXpJgwESHRyTSMCULjpff~3g~ELY9eNDrPH~Fw__&Key-Pair-Id=KVTP0A1DKRTAX (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)')))"), '(Request ID: 96eaf595-6784-4093-a937-14c1d1c1a9ac)')' thrown while requesting GET https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00189-of-01024.json.gz
Retrying in 1s [Retry 1/5].

The errors you posted look like connection errors to huggingface, which are unfortunately not infrequent. Going to close, but feel free to open a new issue if you have problems that look to be with llm foundry.