huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page:https://huggingface.co/docs/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for pathlib.Path in datasets 2.19.0

lamyiowce opened this issue · comments

Describe the bug

After the recent update of datasets, Dataset.save_to_disk does not accept a pathlib.Path anymore. It was supported in 2.18.0 and previous versions. Is this intentional? Was it supported before only because of a Python dusk-typing miracle?

Steps to reproduce the bug

from datasets import Dataset
import pathlib

path = pathlib.Path("./my_out_path")
Dataset.from_dict(
    {"text": ["hello world"], "label": [777], "split": ["train"]}
.save_to_disk(path)

This results in an error when using datasets 2.19:

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/Users/jb/scratch/venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 1515, in save_to_disk
    fs, _ = url_to_fs(dataset_path, **(storage_options or {}))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jb/scratch/venv/lib/python3.11/site-packages/fsspec/core.py", line 383, in url_to_fs
    chain = _un_chain(url, kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jb/scratch/venv/lib/python3.11/site-packages/fsspec/core.py", line 323, in _un_chain
    if "::" in path
       ^^^^^^^^^^^^
TypeError: argument of type 'PosixPath' is not iterable

Converting to str works, however.

Dataset.from_dict(
     {"text": ["hello world"], "label": [777], "split": ["train"]}
).save_to_disk(str(path))

Expected behavior

My dataset gets saved to disk without an error.

Environment info

aiohttp==3.9.5
aiosignal==1.3.1
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
datasets==2.19.0
dill==0.3.8
filelock==3.14.0
frozenlist==1.4.1
fsspec==2024.3.1
huggingface-hub==0.23.2
idna==3.7
multidict==6.0.5
multiprocess==0.70.16
numpy==1.26.4
packaging==24.0
pandas==2.2.2
pyarrow==16.1.0
pyarrow-hotfix==0.6
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
requests==2.32.3
six==1.16.0
tqdm==4.66.4
typing_extensions==4.12.0
tzdata==2024.1
urllib3==2.2.1
xxhash==3.4.1
yarl==1.9.4