getting a datasets.utils.info_utils.NonMatchingSplitsSizesError when downloading the openwebtext dataset from huggingface
lovodkin93 opened this issue · comments
hello,
I am trying to download the openwebtext dataset from huggingface, but I keep getting the following error:
Downloading data: 100%|________________________________________________________________________________________________________________| 12.9G/12.9G [25:43<00:00, 8.35MB/s]
/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/download/download_manager.py:527: FutureWarning: 'num_proc' was deprecated in version 2.6.2 and will be removed in 3.0.0. Pass `DownloadConfig(num_proc=<num_proc>)` to the initializer instead.
warnings.warn(
Extracting data files: 100%|________________________________________________________________________________________________________| 20610/20610 [9:43:42<00:00, 1.70s/it]
Traceback (most recent call last):
File "ssd_process_data.py", line 485, in <module>
main()
File "ssd_process_data.py", line 369, in main
raw_datasets["train"] = load_dataset(
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1782, in load_dataset
builder_instance.download_and_prepare(
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 872, in download_and_prepare
self._download_and_prepare(
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
super()._download_and_prepare(
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 985, in _download_and_prepare
verify_splits(self.info.splits, split_dict)
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 100, in verify_splits
raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=39769494896, num_examples=8013769, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=39769065791, num_examples=8013740, shard_lengths=[101000, 100000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 101000, 101000, 101000, 101000, 102000, 102000, 100000, 101000, 100000, 101000, 102000, 101000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 101000, 101000, 102000, 101000, 102000, 101000, 101000, 100000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 100000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 102000, 102000, 101000, 101000, 102000, 102000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 102000, 101000, 13740], dataset_name='openwebtext')}]
I have tried forcing the redownloading of the dataset by passing the download_mode="force_redownload" parameter, but it yield the same error.
I have also tried passing the ignore_verifications=True
parameter, but this in turn yielded the following error:
raw_datasets["train"] = load_dataset(
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1754, in load_dataset
verification_mode = VerificationMode(
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 339, in __call__
return cls.__new__(cls, value)
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 663, in __new__
raise ve_exc
ValueError: 'none' is not a valid VerificationMode
Has anyone encountered such a problem, or knows what I can do?