Invalid YAML in README.md: unknown tag !<tag:yaml.org,2002:python/tuple>
juanqui opened this issue · comments
Juan Villa commented
Describe the bug
I wrote a notebook to load an existing dataset, process it, and upload as a private dataset using dataset.push_to_hub(...)
at the end. The push to hub is failing with:
ValueError: Invalid metadata in README.md.
- Invalid YAML in README.md: unknown tag !<tag:yaml.org,2002:python[/tuple](http://192.168.1.128:8888/tuple)> (50:11)
47 | - 4
48 | - 4
49 | - 8
50 | - !!binary |
----------------^
51 | TwAAAA==
52 | '1': !!python[/object/apply](http://192.168.1.128:8888/object/apply):nump ...
My dataset has a train
and validation
dataset. These are the features:
{'c1': Value(dtype='string', id=None),
'c2': Value(dtype='string', id=None),
'c3': [{'value': Value(dtype='string', id=None),
'start': Value(dtype='int64', id=None),
'end': Value(dtype='int64', id=None),
'label': Value(dtype='string', id=None)}],
'c4': Value(dtype='string', id=None),
'c5': Value(dtype='string', id=None),
'c6': Value(dtype='string', id=None),
'c7': Value(dtype='string', id=None),
'c8': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
'c9': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
'c10': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
'labels': Sequence(feature=ClassLabel(names=['O', 'B-ABC', 'I-ABC', ...], id=None), length=-1, id=None),
'c12': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}
This used to work until I decided to cast the labels
feature to a Sequence(ClassLabel(...))
type with:
ds['train'] = ds['train'].cast_column("labels", Sequence(ClassLabel(names=list(labels))))
ds['validation'] = ds['validation'].cast_column("labels", Sequence(ClassLabel(names=list(labels))))
Steps to reproduce the bug
- Start with any token classification dataset.
- Add a
labels
column with data such as[0,0,0,12,13,13,13,0,0]
. - Cast the label column from
Sequence
toSequence(ClassLabel))
with:
labels = ['O', 'B-TEST', 'I-TEST']
ds = ds.cast_column("labels", Sequence(ClassLabel(names=labels)))
- Push to hub with
ds.push_to_hub("me/awesome-stuff-dataset")
Expected behavior
I expected push_to_hub
to successfully push my dataset to the hub without error.
Environment info
Python 3.11.9
datasets==2.19.1
transformers==4.41.1
PyYAML==6.0.1