TypeError: Invalid `datasets`. `datasets` must have compatible element specs.
bpucla opened this issue · comments
@shayne-longpre Thank you for sharing the reproducing script!
I got the following error when I was trying to reproduce flan2021_submix
. The venv is the one you specified in the repo (flan/v2/requirements.txt
). It seems that some datasets have different fields. Any suggestions would be appreciated!
Traceback (most recent call last):
File "flan/v2/run_example.py", line 93, in
dataset = selected_mixture.get_dataset(
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/seqio/dataset_providers.py", line 1758, in get_dataset
dataset = self._sample_fn(datasets, rates, sample_seed)
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py", line 371, in new_func
return func(*args, **kwargs)
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/experimental/ops/interleave_ops.py", line 148, in sample_from_datasets_v2
return dataset_ops.Dataset.sample_from_datasets(
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 3571, in sample_from_datasets
return sample_from_datasets_op._sample_from_datasets( # pylint: disable=protected-access
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/ops/sample_from_datasets_op.py", line 119, in _sample_from_datasets
return directed_interleave_op._directed_interleave( # pylint: disable=protected-access
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/ops/directed_interleave_op.py", line 25, in _directed_interleave
return _DirectedInterleaveDataset(
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/ops/directed_interleave_op.py", line 50, in init
raise TypeError(f"Invaliddatasets
.datasets
must have compatible "
TypeError: Invaliddatasets
.datasets
must have compatible element specs.
Dataset 0 element_spec={'_task_name': TensorSpec(shape=(), dtype=tf.string, name=None), '_task_source': TensorSpec(shape=(), dtype=tf.string, name=None), '_template_type': TensorSpec(shape=(), dtype=tf.string, name=None), '_template_idx': TensorSpec(shape=(), dtype=tf.int32, name=None), 'inputs_pretokenized': TensorSpec(shape=(), dtype=tf.string, name=None), 'inputs': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'targets_pretokenized': TensorSpec(shape=(), dtype=tf.string, name=None), 'targets': TensorSpec(shape=(None,), dtype=tf.int32, name=None)}.
Dataset 19 element_spec={'_template_type': TensorSpec(shape=(), dtype=tf.string, name=None), '_template_idx': TensorSpec(shape=(), dtype=tf.int32, name=None), 'inputs_pretokenized': TensorSpec(shape=(), dtype=tf.string, name=None), 'inputs': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'targets_pretokenized': TensorSpec(shape=(), dtype=tf.string, name=None), 'targets': TensorSpec(shape=(None,), dtype=tf.int32, name=None)}.
@bpucla I'm not sure why this would happen after reviewing the code. Are you able to load datasets one at a time and let me know which one(s) are missing _task_source
and _task_name
?
If you don't care about this metadata you could also remove this line of code: https://github.com/google-research/FLAN/blob/main/flan/v2/run_example.py#L99
@shayne-longpre Thank you for your prompt response! After playing around, I found out 6/64 datasets in flan2021_submix
have the issues of missing _task_source
and _task_name
, and I can find task_names for the other 58/64. If you point out where I can find the whole dataset list for flan2021_submix
, I can take the set difference and tell which 6 datasets have the issues.
I can find the dataset list for other mixtures like T0 (constants_t0.py
) and niv2 (constants_niv2.py
). I see there is summary list for all mixtures, but cannot fine the one specific for flan2021_submix
.
@shayne-longpre Thank you for your prompt response! After playing around, I found out 6/64 datasets in
flan2021_submix
have the issues of missing_task_source
and_task_name
, and I can find task_names for the other 58/64. If you point out where I can find the whole dataset list forflan2021_submix
, I can take the set difference and tell which 6 datasets have the issues.I can find the dataset list for other mixtures like T0 (
constants_t0.py
) and niv2 (constants_niv2.py
). I see there is summary list for all mixtures, but cannot fine the one specific forflan2021_submix
.
Thanks so much for doing this! from the summary list you linked, just subtract keys that start with cot_
, stream_
, t0_task_adaptation
, tfds_natural_instructions
, qrecc
, and wiki_dialog
.
I faced the same issue. These are the tasks where _task_name and _task_source are missing:
'fix_punct_template_0to10_no_opt_zero_shot',
'fix_punct_template_0to10_zero_shot',
'opinion_abstracts_idebate_template_0to10_no_opt_zero_shot',
'opinion_abstracts_idebate_template_0to10_zero_shot',
'opinion_abstracts_rotten_tomatoes_template_0to10_no_opt_zero_shot',
'opinion_abstracts_rotten_tomatoes_template_0to10_zero_shot',
'para_crawl_enes_template_0to10_no_opt_zero_shot',
'para_crawl_enes_template_0to10_zero_shot',
'true_case_template_0to10_no_opt_zero_shot',
'true_case_template_0to10_zero_shot',
'word_segment_template_0to10_no_opt_zero_shot',
'word_segment_template_0to10_zero_shot'
@ari9dam Thanks for outlining these! I see where the issue is and will push a fix.
Still facing it.
Not sure if it is due to caching, how do I only delete those cached files?
I deleted two folders from cache, that are specified as source in the task_config. Still the same
fix_punct_template_0to10_no_opt_zero_shot
fix_punct_template_0to10_zero_shot
opinion_abstracts_idebate_template_0to10_no_opt_zero_shot
opinion_abstracts_idebate_template_0to10_zero_shot
opinion_abstracts_rotten_tomatoes_template_0to10_no_opt_zero_shot
opinion_abstracts_rotten_tomatoes_template_0to10_zero_shot
para_crawl_enes_template_0to10_no_opt_zero_shot
para_crawl_enes_template_0to10_zero_shot
true_case_template_0to10_no_opt_zero_shot
true_case_template_0to10_zero_shot
word_segment_template_0to10_no_opt_zero_shot
word_segment_template_0to10_zero_shot
@ari9dam Hmm okay sorry! Reviewing this today. Thanks for re-flagging!