huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page:https://huggingface.co/docs/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

load_dataset with data_dir and cache_dir set fail with not supported

fah opened this issue · comments

commented

Describe the bug

with python 3.11 I execute:

from transformers import Wav2Vec2Processor, Data2VecAudioModel
import torch
from torch import nn
from datasets import load_dataset, concatenate_datasets

# load demo audio and set processor
dataset_clean = load_dataset("librispeech_asr", "clean", split="validation", data_dir="data", cache_dir="cache")

This fails in the last line with

Found cached dataset librispeech_asr (file:///Users/as/Documents/Project/git/audio2vec/cache/librispeech_asr/clean-data_dir=data/2.1.0/cff5df6e7955c80a67f80e27e7e655de71c689e2d2364bece785b972acb37fe7)
Traceback (most recent call last):
  File "/Users/as/Documents/Project/git/audio2vec/src/music2vec-v1.py", line 7, in <module>
    dataset_clean = load_dataset("librispeech_asr", "clean", split="validation", data_dir="data", cache_dir="cache")
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/as/anaconda3/lib/python3.11/site-packages/datasets/load.py", line 1810, in load_dataset
    ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/as/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1113, in as_dataset
    raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).__name__} is not supported.")
NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

Steps to reproduce the bug

I setup an venv with requirements.txt

transformers==4.40.2
torch==2.2.2
datasets==2.16.0
fsspec==2023.9.2

pip freeze is:

aiohttp==3.9.5
aiosignal==1.3.1
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
datasets==2.16.0
dill==0.3.7
filelock==3.14.0
frozenlist==1.4.1
fsspec==2023.9.2
huggingface-hub==0.23.0
idna==3.7
Jinja2==3.1.4
MarkupSafe==2.1.5
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.15
networkx==3.3
numpy==1.26.4
packaging==24.0
pandas==2.2.2
pyarrow==16.0.0
pyarrow-hotfix==0.6
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
regex==2024.4.28
requests==2.31.0
safetensors==0.4.3
six==1.16.0
sympy==1.12
tokenizers==0.19.1
torch==2.2.2
tqdm==4.66.4
transformers==4.40.2
typing_extensions==4.11.0
tzdata==2024.1
urllib3==2.2.1
xxhash==3.4.1
yarl==1.9.4

I execute this on a M1 Mac.

Expected behavior

I don't understand the error message. Why is "local" caching not supported. Would it possible to give some additional hint with the error message how to solve this issue?

Environment info

source ....
python -u example.py