`Dataset.with_format` behaves inconsistently with documentation

Question

`Dataset.with_format` behaves inconsistently with documentation

iansheng opened this issue 2 months ago · comments

Describe the bug

The actual behavior of the interface Dataset.with_format is inconsistent with the documentation.
https://huggingface.co/docs/datasets/use_with_pytorch#n-dimensional-arrays
https://huggingface.co/docs/datasets/v2.19.0/en/use_with_tensorflow#n-dimensional-arrays

If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
In particular, a PyTorch formatted dataset outputs nested lists instead of a single tensor.
A TensorFlow formatted dataset outputs a RaggedTensor instead of a single tensor.

But I get a single tensor by default, which is inconsistent with the description.

Actually the current behavior seems more reasonable to me. Therefore, the document needs to be modified.

Steps to reproduce the bug

>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': tensor([[1, 2],
        [3, 4]])}
>>> ds = ds.with_format("tf")
>>> ds[0]
{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy=
array([[1, 2],
       [3, 4]])>}

Expected behavior

>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': [tensor([1, 2]), tensor([3, 4])]}
>>> ds = ds.with_format("tf")
>>> ds[0]
{'data': <tf.RaggedTensor [[1, 2], [3, 4]]>}

Environment info

datasets==2.19.1
torch==2.1.0
tensorflow==2.13.1

Quentin Lhoest · Answer 1 · Wed Jun 05 2024 00:40:40 GMT+0800 (China Standard Time)

Hi ! It seems the documentation was outdated in this paragraph

I fixed it here: #6956