FileNotFoundError:error when loading C4 dataset
W-215 opened this issue · comments
Describe the bug
can't load c4 datasets
When I replace the datasets package to 2.12.2 I get raise datasets.utils.info_utils.ExpectedMoreSplits: {'train'}
How can I fix this?
Steps to reproduce the bug
1.from datasets import load_dataset
2.dataset = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation')
3. raise FileNotFoundError(
FileNotFoundError: Couldn't find a dataset script at local_path/c4_val/allenai/c4/c4.py or any data file in the same directory. Couldn't find 'allenai/c4' on the Hugging Face Hub either: FileNotFoundError: Unable to find 'hf://datasets/allenai/c4@1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-validation.00003-of-00008.json.gz' with any supported extension ['.csv', '.tsv', '.json', '.jsonl', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', '.h5', '.hdf', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns', '.ico', '.im', '.iim', '.tif', '.tiff', '.jfif', '.jpe', '.jpg', '.jpeg', '.mpg', '.mpeg', '.msp', '.pcd', '.pxr', '.pbm', '.pgm', '.ppm', '.pnm', '.psd', '.bw', '.rgb', '.rgba', '.sgi', '.ras', '.tga', '.icb', '.vda', '.vst', '.webp', '.wmf', '.emf', '.xbm', '.xpm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.H5', '.HDF', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', '.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.zip']
Expected behavior
The data was successfully imported
Environment info
python version 3.9
datasets version 2.19.2
same problem here
Hello,
Are you sure you are really using datasets version 2.19.2? We just made the patch release yesterday specifically to fix this issue:
I can't reproduce the error:
In [1]: from datasets import load_dataset
In [2]: ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation')
Downloading readme: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41.1k/41.1k [00:00<00:00, 596kB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40.7M/40.7M [00:04<00:00, 8.50MB/s]
Generating validation split: 45576 examples [00:01, 44956.75 examples/s]
In [3]: ds
Out[3]:
Dataset({
features: ['text', 'timestamp', 'url'],
num_rows: 45576
})
Hello,
Are you sure you are really using datasets version 2.19.2? We just made the patch release yesterday specifically to fix this issue:
I can't reproduce the error:
In [1]: from datasets import load_dataset In [2]: ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation') Downloading readme: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41.1k/41.1k [00:00<00:00, 596kB/s] Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40.7M/40.7M [00:04<00:00, 8.50MB/s] Generating validation split: 45576 examples [00:01, 44956.75 examples/s] In [3]: ds Out[3]: Dataset({ features: ['text', 'timestamp', 'url'], num_rows: 45576 })
Thank you for your reply,ExpectedMoreSplits was encountered in datasets version 2.12.2. After I updated the version, that is, datasets version 2.19.2, I encountered the FileNotFoundError problem mentioned above.
That might be due to a corrupted cache.
Please, retry loading the dataset passing: download_mode="force_redownload"
ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")
It the above command does not fix the issue, then you will need to fix the cache manually, by removing the corresponding directory inside ~/.cache/huggingface/
.
That might be due to a corrupted cache.
Please, retry loading the dataset passing:
download_mode="force_redownload"
ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")It the above command does not fix the issue, then you will need to fix the cache manually, by removing the corresponding directory inside
~/.cache/huggingface/
.
The two methods you mentioned above can not solve this problem, but the command line interface shows Downloading readme: 41.1kB [00:00, 281kB/s], and then FileNotFoundError appears. It is worth noting that I have no problem loading other datasets with the initial method, such as wikitext datasets
The two methods you mentioned above can not solve this problem, but the command line interface shows Downloading readme: 41.1kB [00:00, 281kB/s], and then FileNotFoundError appears.
Same issue encountered.
I really think the issue is caused by a corrupted cache, between versions 2.12.0 (there does not exist 2.12.2 version) and 2.19.2.
Are you sure you removed all the corresponding corrupted directories within the cache?
You can easily check if the issue is caused by a corrupted cache by removing the entire cache:
mv ~/.cache/huggingface ~/.cache/huggingface.bak
and then reloading the dataset:
ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")
@albertvillanova Thanks for the reply. I tried removing the entire cache and reloading the dataset as you suggest. However, the same issue still exists.
As a test, I switch to a new platform, which (is a Windows system and) hasn't downloaded huggingface dataset before, and the dataset is loaded successfully. So I think "a corrupted cache" explanation makes sense. I wonder, besides ~/.cache/huggingface
, is there any other directory that may save the cache thing?
As a side note, I am using datasets==2.20.0
and proxy export HF_ENDPOINT=https://hf-mirror.com
.
Ho @ZhangGe6,
As far as I know, that directory is the only one where the cache is saved, unless you configured another one. You can check it:
import datasets.config
print(datasets.config.HF_CACHE_HOME)
# ~/.cache/huggingface
print(datasets.config.HF_DATASETS_CACHE)
# ~/.cache/huggingface/datasets
print(datasets.config.HF_MODULES_CACHE)
# ~/.cache/huggingface/modules
print(datasets.config.DOWNLOADED_DATASETS_PATH)
# ~/.cache/huggingface/datasets/downloads
print(datasets.config.EXTRACTED_DATASETS_PATH)
# ~/.cache/huggingface/datasets/downloads/extracted
Additionally, datasets
uses huggingface_hub
, but its cache directory should also be inside ~/.cache/huggingface
, unless you configured another one. You can check it:
import huggingface_hub.constants
print(huggingface_hub.constants.HF_HOME)
# ~/.cache/huggingface
print(huggingface_hub.constants.HF_HUB_CACHE)
# ~/.cache/huggingface/hub
@albertvillanova I checked the directories you listed, and find that they are the same as the ones you provided. I am going to find more clues and will update what I find here.
I've had a similar problem, and for some reason decreasing the number of workers in the dataloader solved it
Same issue.
Hi folks. Finally, I find it is a network issue that causes huggingface hub unreachable (in China).
To run the following script
from datasets import load_dataset
ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")
Without setting export HF_ENDPOINT=https://hf-mirror.com
, I get the following error log
Traceback (most recent call last):
File ".\demo.py", line 8, in <module>
ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")
File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 2594, in load_dataset
builder_instance = load_dataset_builder(
File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 2266, in load_dataset_builder
dataset_module = dataset_module_factory(
File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 1914, in dataset_module_factory
raise e1 from None
File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 1845, in dataset_module_factory
raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({e.__class__.__name__})") from e
ConnectionError: Couldn't reach 'allenai/c4' on the Hub (ConnectionError)
After setting export HF_ENDPOINT=https://hf-mirror.com
, I get the following error, which is exactly the same as what we are debugging in this issue
Downloading readme: 41.1kB [00:00, 41.1MB/s]
Traceback (most recent call last):
File ".\demo.py", line 8, in <module>
ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")
File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 2594, in loa builder_instance = load_dataset_builder(
File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 2266, in load_dataset_builder
dataset_module = dataset_module_factory(
raise FileNotFoundError(
FileNotFoundError: Couldn't find a dataset script at C:\Users\ZhangGe\Desktop\allenai\c4\c4.py or any data file in the same directory. Couldn't find 'allenai/c4' on the Hugging Face Hub either: FileNotFoundError: Unable to find 'hf://datasets/allenai/c4@1588ec454eed extension ['.csv', '.tsv', '.json', '.jsonl', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib',
'.h5', '.hdf', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns',pm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.H5', '.HDF', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK',
'.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.zip']
Using a proxy software that avoids the internet access restrictions imposed by China, I can download the dataset using the same script
Downloading readme: 100%|███████████████████████████████████████████| 41.1k/41.1k [00:00<00:00, 312kB/s]
Downloading data: 100%|████████████████████████████████████████████| 40.7M/40.7M [00:19<00:00, 2.07MB/s]
Generating validation split: 45576 examples [00:00, 54883.48 examples/s]
So allenai/c4
is still unreachable even after setting export HF_ENDPOINT=https://hf-mirror.com
.
I have created an issue to inform the maintainers of hf-mirror
:padeoe/hf-mirror-site#30
Thanks for the investigation: so finally it is an issue with the specific endpoint you are using.
You properly opened an issue in their repo, so they can fix it.
I am closing this issue here.