huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page:https://huggingface.co/docs/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Invalid YAML in README.md: unknown tag !<tag:yaml.org,2002:python/tuple>

juanqui opened this issue · comments

Describe the bug

I wrote a notebook to load an existing dataset, process it, and upload as a private dataset using dataset.push_to_hub(...) at the end. The push to hub is failing with:

ValueError: Invalid metadata in README.md.
- Invalid YAML in README.md: unknown tag !<tag:yaml.org,2002:python[/tuple](http://192.168.1.128:8888/tuple)> (50:11)

 47 |             - 4
 48 |             - 4
 49 |             - 8
 50 |           - !!binary |
----------------^
 51 |             TwAAAA==
 52 |           '1': !!python[/object/apply](http://192.168.1.128:8888/object/apply):nump ...

My dataset has a train and validation dataset. These are the features:

{'c1': Value(dtype='string', id=None),
 'c2': Value(dtype='string', id=None),
 'c3': [{'value': Value(dtype='string', id=None),
   'start': Value(dtype='int64', id=None),
   'end': Value(dtype='int64', id=None),
   'label': Value(dtype='string', id=None)}],
 'c4': Value(dtype='string', id=None),
 'c5': Value(dtype='string', id=None),
 'c6': Value(dtype='string', id=None),
 'c7': Value(dtype='string', id=None),
 'c8': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'c9': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'c10': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'labels': Sequence(feature=ClassLabel(names=['O', 'B-ABC', 'I-ABC', ...], id=None), length=-1, id=None),
 'c12': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

This used to work until I decided to cast the labels feature to a Sequence(ClassLabel(...)) type with:

ds['train'] = ds['train'].cast_column("labels", Sequence(ClassLabel(names=list(labels))))
ds['validation'] = ds['validation'].cast_column("labels", Sequence(ClassLabel(names=list(labels))))

Steps to reproduce the bug

  1. Start with any token classification dataset.
  2. Add a labels column with data such as [0,0,0,12,13,13,13,0,0].
  3. Cast the label column from Sequence to Sequence(ClassLabel)) with:
labels = ['O', 'B-TEST', 'I-TEST']
ds = ds.cast_column("labels", Sequence(ClassLabel(names=labels)))
  1. Push to hub with ds.push_to_hub("me/awesome-stuff-dataset")

Expected behavior

I expected push_to_hub to successfully push my dataset to the hub without error.

Environment info

Python 3.11.9

datasets==2.19.1
transformers==4.41.1
PyYAML==6.0.1