huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page:https://huggingface.co/docs/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`Dataset.with_format` behaves inconsistently with documentation

iansheng opened this issue · comments

Describe the bug

The actual behavior of the interface Dataset.with_format is inconsistent with the documentation.
https://huggingface.co/docs/datasets/use_with_pytorch#n-dimensional-arrays
https://huggingface.co/docs/datasets/v2.19.0/en/use_with_tensorflow#n-dimensional-arrays

If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
In particular, a PyTorch formatted dataset outputs nested lists instead of a single tensor.
A TensorFlow formatted dataset outputs a RaggedTensor instead of a single tensor.

But I get a single tensor by default, which is inconsistent with the description.

Actually the current behavior seems more reasonable to me. Therefore, the document needs to be modified.

Steps to reproduce the bug

>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': tensor([[1, 2],
        [3, 4]])}
>>> ds = ds.with_format("tf")
>>> ds[0]
{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy=
array([[1, 2],
       [3, 4]])>}

Expected behavior

>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': [tensor([1, 2]), tensor([3, 4])]}
>>> ds = ds.with_format("tf")
>>> ds[0]
{'data': <tf.RaggedTensor [[1, 2], [3, 4]]>}

Environment info

datasets==2.19.1
torch==2.1.0
tensorflow==2.13.1

Hi ! It seems the documentation was outdated in this paragraph

I fixed it here: #6956