huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page:https://huggingface.co/docs/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FileNotFoundError:error when loading C4 dataset

W-215 opened this issue · comments

Describe the bug

can't load c4 datasets

When I replace the datasets package to 2.12.2 I get raise datasets.utils.info_utils.ExpectedMoreSplits: {'train'}

How can I fix this?

Steps to reproduce the bug

1.from datasets import load_dataset
2.dataset = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation')
3. raise FileNotFoundError(
FileNotFoundError: Couldn't find a dataset script at local_path/c4_val/allenai/c4/c4.py or any data file in the same directory. Couldn't find 'allenai/c4' on the Hugging Face Hub either: FileNotFoundError: Unable to find 'hf://datasets/allenai/c4@1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-validation.00003-of-00008.json.gz' with any supported extension ['.csv', '.tsv', '.json', '.jsonl', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', '.h5', '.hdf', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns', '.ico', '.im', '.iim', '.tif', '.tiff', '.jfif', '.jpe', '.jpg', '.jpeg', '.mpg', '.mpeg', '.msp', '.pcd', '.pxr', '.pbm', '.pgm', '.ppm', '.pnm', '.psd', '.bw', '.rgb', '.rgba', '.sgi', '.ras', '.tga', '.icb', '.vda', '.vst', '.webp', '.wmf', '.emf', '.xbm', '.xpm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.H5', '.HDF', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', '.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.zip']

Expected behavior

The data was successfully imported

Environment info

python version 3.9
datasets version 2.19.2

same problem here

Hello,

Are you sure you are really using datasets version 2.19.2? We just made the patch release yesterday specifically to fix this issue:

I can't reproduce the error:

In [1]: from datasets import load_dataset

In [2]: ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation')
Downloading readme: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41.1k/41.1k [00:00<00:00, 596kB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40.7M/40.7M [00:04<00:00, 8.50MB/s]
Generating validation split: 45576 examples [00:01, 44956.75 examples/s]

In [3]: ds
Out[3]: 
Dataset({
    features: ['text', 'timestamp', 'url'],
    num_rows: 45576
})

Hello,

Are you sure you are really using datasets version 2.19.2? We just made the patch release yesterday specifically to fix this issue:

I can't reproduce the error:

In [1]: from datasets import load_dataset

In [2]: ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation')
Downloading readme: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41.1k/41.1k [00:00<00:00, 596kB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40.7M/40.7M [00:04<00:00, 8.50MB/s]
Generating validation split: 45576 examples [00:01, 44956.75 examples/s]

In [3]: ds
Out[3]: 
Dataset({
    features: ['text', 'timestamp', 'url'],
    num_rows: 45576
})

Thank you for your reply,ExpectedMoreSplits was encountered in datasets version 2.12.2. After I updated the version, that is, datasets version 2.19.2, I encountered the FileNotFoundError problem mentioned above.

That might be due to a corrupted cache.

Please, retry loading the dataset passing: download_mode="force_redownload"

ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")

It the above command does not fix the issue, then you will need to fix the cache manually, by removing the corresponding directory inside ~/.cache/huggingface/.

That might be due to a corrupted cache.

Please, retry loading the dataset passing: download_mode="force_redownload"

ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")

It the above command does not fix the issue, then you will need to fix the cache manually, by removing the corresponding directory inside ~/.cache/huggingface/.

The two methods you mentioned above can not solve this problem, but the command line interface shows Downloading readme: 41.1kB [00:00, 281kB/s], and then FileNotFoundError appears. It is worth noting that I have no problem loading other datasets with the initial method, such as wikitext datasets

The two methods you mentioned above can not solve this problem, but the command line interface shows Downloading readme: 41.1kB [00:00, 281kB/s], and then FileNotFoundError appears.

Same issue encountered.

I really think the issue is caused by a corrupted cache, between versions 2.12.0 (there does not exist 2.12.2 version) and 2.19.2.

Are you sure you removed all the corresponding corrupted directories within the cache?

You can easily check if the issue is caused by a corrupted cache by removing the entire cache:

mv ~/.cache/huggingface ~/.cache/huggingface.bak

and then reloading the dataset:

ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")

@albertvillanova Thanks for the reply. I tried removing the entire cache and reloading the dataset as you suggest. However, the same issue still exists.

As a test, I switch to a new platform, which (is a Windows system and) hasn't downloaded huggingface dataset before, and the dataset is loaded successfully. So I think "a corrupted cache" explanation makes sense. I wonder, besides ~/.cache/huggingface, is there any other directory that may save the cache thing?

As a side note, I am using datasets==2.20.0 and proxy export HF_ENDPOINT=https://hf-mirror.com.

Ho @ZhangGe6,

As far as I know, that directory is the only one where the cache is saved, unless you configured another one. You can check it:

import datasets.config

print(datasets.config.HF_CACHE_HOME)
# ~/.cache/huggingface

print(datasets.config.HF_DATASETS_CACHE)
# ~/.cache/huggingface/datasets

print(datasets.config.HF_MODULES_CACHE)
# ~/.cache/huggingface/modules

print(datasets.config.DOWNLOADED_DATASETS_PATH)
# ~/.cache/huggingface/datasets/downloads

print(datasets.config.EXTRACTED_DATASETS_PATH)
# ~/.cache/huggingface/datasets/downloads/extracted

Additionally, datasets uses huggingface_hub, but its cache directory should also be inside ~/.cache/huggingface, unless you configured another one. You can check it:

import huggingface_hub.constants

print(huggingface_hub.constants.HF_HOME)
# ~/.cache/huggingface

print(huggingface_hub.constants.HF_HUB_CACHE)
# ~/.cache/huggingface/hub

@albertvillanova I checked the directories you listed, and find that they are the same as the ones you provided. I am going to find more clues and will update what I find here.

I've had a similar problem, and for some reason decreasing the number of workers in the dataloader solved it

Same issue.

Hi folks. Finally, I find it is a network issue that causes huggingface hub unreachable (in China).

To run the following script

from datasets import load_dataset

ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")

Without setting export HF_ENDPOINT=https://hf-mirror.com, I get the following error log

Traceback (most recent call last):
  File ".\demo.py", line 8, in <module>
    ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")
  File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 2594, in load_dataset
    builder_instance = load_dataset_builder(
  File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 2266, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 1914, in dataset_module_factory
    raise e1 from None
  File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 1845, in dataset_module_factory
    raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({e.__class__.__name__})") from e
ConnectionError: Couldn't reach 'allenai/c4' on the Hub (ConnectionError)

After setting export HF_ENDPOINT=https://hf-mirror.com, I get the following error, which is exactly the same as what we are debugging in this issue

Downloading readme: 41.1kB [00:00, 41.1MB/s]
Traceback (most recent call last):
  File ".\demo.py", line 8, in <module>
    ds = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation', download_mode="force_redownload")
  File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 2594, in loa    builder_instance = load_dataset_builder(
  File "D:\SoftwareInstall\Python\lib\site-packages\datasets\load.py", line 2266, in load_dataset_builder
    dataset_module = dataset_module_factory(
    raise FileNotFoundError(
FileNotFoundError: Couldn't find a dataset script at C:\Users\ZhangGe\Desktop\allenai\c4\c4.py or any data file in the same directory. Couldn't find 'allenai/c4' on the Hugging Face Hub either: FileNotFoundError: Unable to find 'hf://datasets/allenai/c4@1588ec454eed extension ['.csv', '.tsv', '.json', '.jsonl', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', 
'.h5', '.hdf', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns',pm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.H5', '.HDF', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', 
'.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.zip']

Using a proxy software that avoids the internet access restrictions imposed by China, I can download the dataset using the same script

Downloading readme: 100%|███████████████████████████████████████████| 41.1k/41.1k [00:00<00:00, 312kB/s] 
Downloading data: 100%|████████████████████████████████████████████| 40.7M/40.7M [00:19<00:00, 2.07MB/s] 
Generating validation split: 45576 examples [00:00, 54883.48 examples/s]

So allenai/c4 is still unreachable even after setting export HF_ENDPOINT=https://hf-mirror.com.

I have created an issue to inform the maintainers of hf-mirrorpadeoe/hf-mirror-site#30

Thanks for the investigation: so finally it is an issue with the specific endpoint you are using.

You properly opened an issue in their repo, so they can fix it.

I am closing this issue here.