`Dataset.with_format` behaves inconsistently with documentation
iansheng opened this issue · comments
Describe the bug
The actual behavior of the interface Dataset.with_format
is inconsistent with the documentation.
https://huggingface.co/docs/datasets/use_with_pytorch#n-dimensional-arrays
https://huggingface.co/docs/datasets/v2.19.0/en/use_with_tensorflow#n-dimensional-arrays
If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
In particular, a PyTorch formatted dataset outputs nested lists instead of a single tensor.
A TensorFlow formatted dataset outputs a RaggedTensor instead of a single tensor.
But I get a single tensor by default, which is inconsistent with the description.
Actually the current behavior seems more reasonable to me. Therefore, the document needs to be modified.
Steps to reproduce the bug
>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': tensor([[1, 2],
[3, 4]])}
>>> ds = ds.with_format("tf")
>>> ds[0]
{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy=
array([[1, 2],
[3, 4]])>}
Expected behavior
>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': [tensor([1, 2]), tensor([3, 4])]}
>>> ds = ds.with_format("tf")
>>> ds[0]
{'data': <tf.RaggedTensor [[1, 2], [3, 4]]>}
Environment info
datasets==2.19.1
torch==2.1.0
tensorflow==2.13.1
Hi ! It seems the documentation was outdated in this paragraph
I fixed it here: #6956