FileNotFoundError：error when loading C4 dataset

Question

FileNotFoundError：error when loading C4 dataset

W-215 opened this issue 2 months ago · comments

weiliu commented 2 months ago

Describe the bug

can't load c4 datasets

When I replace the datasets package to 2.12.2 I get raise datasets.utils.info_utils.ExpectedMoreSplits: {'train'}

How can I fix this？

Steps to reproduce the bug

1.from datasets import load_dataset
2.dataset = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation')
3. raise FileNotFoundError(
FileNotFoundError: Couldn't find a dataset script at local_path/c4_val/allenai/c4/c4.py or any data file in the same directory. Couldn't find 'allenai/c4' on the Hugging Face Hub either: FileNotFoundError: Unable to find 'hf://datasets/allenai/c4@1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-validation.00003-of-00008.json.gz' with any supported extension ['.csv', '.tsv', '.json', '.jsonl', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', '.h5', '.hdf', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns', '.ico', '.im', '.iim', '.tif', '.tiff', '.jfif', '.jpe', '.jpg', '.jpeg', '.mpg', '.mpeg', '.msp', '.pcd', '.pxr', '.pbm', '.pgm', '.ppm', '.pnm', '.psd', '.bw', '.rgb', '.rgba', '.sgi', '.ras', '.tga', '.icb', '.vda', '.vst', '.webp', '.wmf', '.emf', '.xbm', '.xpm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.H5', '.HDF', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', '.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.zip']

Expected behavior

The data was successfully imported

Environment info

python version 3.9
datasets version 2.19.2

Wei Huang · Answer 1 · Tue Jun 04 2024 15:18:36 GMT+0800 (China Standard Time)

same problem here

Albert Villanova del Moral · Answer 2 · Tue Jun 04 2024 19:20:32 GMT+0800 (China Standard Time)

Hello,

Are you sure you are really using datasets version 2.19.2? We just made the patch release yesterday specifically to fix this issue:

#6925

I can't reproduce the error:

In [1]: from datasets import load_dataset

In [2]: ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation')
Downloading readme: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41.1k/41.1k [00:00<00:00, 596kB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40.7M/40.7M [00:04<00:00, 8.50MB/s]
Generating validation split: 45576 examples [00:01, 44956.75 examples/s]

In [3]: ds
Out[3]: 
Dataset({
    features: ['text', 'timestamp', 'url'],
    num_rows: 45576
})

weiliu · Answer 3 · Tue Jun 04 2024 19:53:09 GMT+0800 (China Standard Time)

Hello,

Are you sure you are really using datasets version 2.19.2? We just made the patch release yesterday specifically to fix this issue:

Fix NonMatchingSplitsSizesError/ExpectedMoreSplits when passing data_dir/data_files in no-code Hub datasets #6925

I can't reproduce the error:

In [1]: from datasets import load_dataset

In [2]: ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation')
Downloading readme: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41.1k/41.1k [00:00<00:00, 596kB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40.7M/40.7M [00:04<00:00, 8.50MB/s]
Generating validation split: 45576 examples [00:01, 44956.75 examples/s]

In [3]: ds
Out[3]: 
Dataset({
    features: ['text', 'timestamp', 'url'],
    num_rows: 45576
})

Thank you for your reply,ExpectedMoreSplits was encountered in datasets version 2.12.2. After I updated the version, that is, datasets version 2.19.2, I encountered the FileNotFoundError problem mentioned above.

Albert Villanova del Moral · Answer 4 · Tue Jun 04 2024 20:18:59 GMT+0800 (China Standard Time)

That might be due to a corrupted cache.

Please, retry loading the dataset passing: download_mode="force_redownload"

ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")

It the above command does not fix the issue, then you will need to fix the cache manually, by removing the corresponding directory inside ~/.cache/huggingface/.

weiliu · Answer 5 · Tue Jun 04 2024 20:48:38 GMT+0800 (China Standard Time)

That might be due to a corrupted cache.

Please, retry loading the dataset passing: download_mode="force_redownload"
ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")
It the above command does not fix the issue, then you will need to fix the cache manually, by removing the corresponding directory inside ~/.cache/huggingface/.

The two methods you mentioned above can not solve this problem, but the command line interface shows Downloading readme: 41.1kB [00:00, 281kB/s], and then FileNotFoundError appears. It is worth noting that I have no problem loading other datasets with the initial method, such as wikitext datasets

Zhang Ge · Answer 6 · Wed Jun 19 2024 10:58:54 GMT+0800 (China Standard Time)

The two methods you mentioned above can not solve this problem, but the command line interface shows Downloading readme: 41.1kB [00:00, 281kB/s], and then FileNotFoundError appears.

Same issue encountered.

Albert Villanova del Moral · Answer 7 · Wed Jun 19 2024 14:18:16 GMT+0800 (China Standard Time)

I really think the issue is caused by a corrupted cache, between versions 2.12.0 (there does not exist 2.12.2 version) and 2.19.2.

Are you sure you removed all the corresponding corrupted directories within the cache?

You can easily check if the issue is caused by a corrupted cache by removing the entire cache:

mv ~/.cache/huggingface ~/.cache/huggingface.bak

and then reloading the dataset:

ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")

Zhang Ge · Answer 8 · Thu Jun 20 2024 09:18:42 GMT+0800 (China Standard Time)

@albertvillanova Thanks for the reply. I tried removing the entire cache and reloading the dataset as you suggest. However, the same issue still exists.

As a test, I switch to a new platform, which (is a Windows system and) hasn't downloaded huggingface dataset before, and the dataset is loaded successfully. So I think "a corrupted cache" explanation makes sense. I wonder, besides ~/.cache/huggingface, is there any other directory that may save the cache thing?

As a side note, I am using datasets==2.20.0 and proxy export HF_ENDPOINT=https://hf-mirror.com.

Albert Villanova del Moral · Answer 9 · Thu Jun 20 2024 12:48:04 GMT+0800 (China Standard Time)

Ho @ZhangGe6,

As far as I know, that directory is the only one where the cache is saved, unless you configured another one. You can check it:

import datasets.config

print(datasets.config.HF_CACHE_HOME)
# ~/.cache/huggingface

print(datasets.config.HF_DATASETS_CACHE)
# ~/.cache/huggingface/datasets

print(datasets.config.HF_MODULES_CACHE)
# ~/.cache/huggingface/modules

print(datasets.config.DOWNLOADED_DATASETS_PATH)
# ~/.cache/huggingface/datasets/downloads

print(datasets.config.EXTRACTED_DATASETS_PATH)
# ~/.cache/huggingface/datasets/downloads/extracted

Additionally, datasets uses huggingface_hub, but its cache directory should also be inside ~/.cache/huggingface, unless you configured another one. You can check it:

import huggingface_hub.constants

print(huggingface_hub.constants.HF_HOME)
# ~/.cache/huggingface

print(huggingface_hub.constants.HF_HUB_CACHE)
# ~/.cache/huggingface/hub

Zhang Ge · Answer 10 · Thu Jun 20 2024 14:09:16 GMT+0800 (China Standard Time)

@albertvillanova I checked the directories you listed, and find that they are the same as the ones you provided. I am going to find more clues and will update what I find here.

Francesco Sacco · Answer 11 · Thu Jun 20 2024 18:45:16 GMT+0800 (China Standard Time)

I've had a similar problem, and for some reason decreasing the number of workers in the dataloader solved it

Tang Zhimin · Answer 12 · Sat Jun 22 2024 10:10:36 GMT+0800 (China Standard Time)

Same issue.

Zhang Ge · Answer 13 · Mon Jun 24 2024 22:39:25 GMT+0800 (China Standard Time)

Hi folks. Finally, I find it is a network issue that causes huggingface hub unreachable (in China).

To run the following script

from datasets import load_dataset

ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")

Without setting export HF_ENDPOINT=https://hf-mirror.com, I get the following error log

Traceback (most recent call last):
  File ".\demo.py", line 8, in <module>
    ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")
  File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 2594, in load_dataset
    builder_instance = load_dataset_builder(
  File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 2266, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 1914, in dataset_module_factory
    raise e1 from None
  File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 1845, in dataset_module_factory
    raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({e.__class__.__name__})") from e
ConnectionError: Couldn't reach 'allenai/c4' on the Hub (ConnectionError)

After setting export HF_ENDPOINT=https://hf-mirror.com, I get the following error, which is exactly the same as what we are debugging in this issue

Downloading readme: 41.1kB [00:00, 41.1MB/s]
Traceback (most recent call last):
  File ".\demo.py", line 8, in <module>
    ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")
  File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 2594, in loa    builder_instance = load_dataset_builder(
  File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 2266, in load_dataset_builder
    dataset_module = dataset_module_factory(
    raise FileNotFoundError(
FileNotFoundError: Couldn't find a dataset script at C:\Users\ZhangGe\Desktop\allenai\c4\c4.py or any data file in the same directory. Couldn't find 'allenai/c4' on the Hugging Face Hub either: FileNotFoundError: Unable to find 'hf://datasets/allenai/c4@1588ec454eed extension ['.csv', '.tsv', '.json', '.jsonl', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', 
'.h5', '.hdf', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns',pm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.H5', '.HDF', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', 
'.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.zip']

Using a proxy software that avoids the internet access restrictions imposed by China, I can download the dataset using the same script

Downloading readme: 100%|███████████████████████████████████████████| 41.1k/41.1k [00:00<00:00, 312kB/s] 
Downloading data: 100%|████████████████████████████████████████████| 40.7M/40.7M [00:19<00:00, 2.07MB/s] 
Generating validation split: 45576 examples [00:00, 54883.48 examples/s]

So allenai/c4 is still unreachable even after setting export HF_ENDPOINT=https://hf-mirror.com.

Zhang Ge · Answer 14 · Mon Jun 24 2024 22:52:13 GMT+0800 (China Standard Time)

I have created an issue to inform the maintainers of hf-mirror：padeoe/hf-mirror-site#30

Albert Villanova del Moral · Answer 15 · Tue Jun 25 2024 14:21:28 GMT+0800 (China Standard Time)

Thanks for the investigation: so finally it is an issue with the specific endpoint you are using.

You properly opened an issue in their repo, so they can fix it.

I am closing this issue here.