google-research / FLAN

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TypeError: Invalid `datasets`. `datasets` must have compatible element specs.

bpucla opened this issue · comments

@shayne-longpre Thank you for sharing the reproducing script!

I got the following error when I was trying to reproduce flan2021_submix. The venv is the one you specified in the repo (flan/v2/requirements.txt ). It seems that some datasets have different fields. Any suggestions would be appreciated!

Traceback (most recent call last):
File "flan/v2/run_example.py", line 93, in
dataset = selected_mixture.get_dataset(
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/seqio/dataset_providers.py", line 1758, in get_dataset
dataset = self._sample_fn(datasets, rates, sample_seed)
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py", line 371, in new_func
return func(*args, **kwargs)
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/experimental/ops/interleave_ops.py", line 148, in sample_from_datasets_v2
return dataset_ops.Dataset.sample_from_datasets(
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 3571, in sample_from_datasets
return sample_from_datasets_op._sample_from_datasets( # pylint: disable=protected-access
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/ops/sample_from_datasets_op.py", line 119, in _sample_from_datasets
return directed_interleave_op._directed_interleave( # pylint: disable=protected-access
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/ops/directed_interleave_op.py", line 25, in _directed_interleave
return _DirectedInterleaveDataset(
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/ops/directed_interleave_op.py", line 50, in init
raise TypeError(f"Invalid datasets. datasets must have compatible "
TypeError: Invalid datasets. datasets must have compatible element specs.
Dataset 0 element_spec={'_task_name': TensorSpec(shape=(), dtype=tf.string, name=None), '_task_source': TensorSpec(shape=(), dtype=tf.string, name=None), '_template_type': TensorSpec(shape=(), dtype=tf.string, name=None), '_template_idx': TensorSpec(shape=(), dtype=tf.int32, name=None), 'inputs_pretokenized': TensorSpec(shape=(), dtype=tf.string, name=None), 'inputs': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'targets_pretokenized': TensorSpec(shape=(), dtype=tf.string, name=None), 'targets': TensorSpec(shape=(None,), dtype=tf.int32, name=None)}.
Dataset 19 element_spec={'_template_type': TensorSpec(shape=(), dtype=tf.string, name=None), '_template_idx': TensorSpec(shape=(), dtype=tf.int32, name=None), 'inputs_pretokenized': TensorSpec(shape=(), dtype=tf.string, name=None), 'inputs': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'targets_pretokenized': TensorSpec(shape=(), dtype=tf.string, name=None), 'targets': TensorSpec(shape=(None,), dtype=tf.int32, name=None)}.

@bpucla I'm not sure why this would happen after reviewing the code. Are you able to load datasets one at a time and let me know which one(s) are missing _task_source and _task_name?

If you don't care about this metadata you could also remove this line of code: https://github.com/google-research/FLAN/blob/main/flan/v2/run_example.py#L99

@shayne-longpre Thank you for your prompt response! After playing around, I found out 6/64 datasets in flan2021_submix have the issues of missing _task_source and _task_name, and I can find task_names for the other 58/64. If you point out where I can find the whole dataset list for flan2021_submix, I can take the set difference and tell which 6 datasets have the issues.

I can find the dataset list for other mixtures like T0 (constants_t0.py) and niv2 (constants_niv2.py). I see there is summary list for all mixtures, but cannot fine the one specific for flan2021_submix.

@shayne-longpre Thank you for your prompt response! After playing around, I found out 6/64 datasets in flan2021_submix have the issues of missing _task_source and _task_name, and I can find task_names for the other 58/64. If you point out where I can find the whole dataset list for flan2021_submix, I can take the set difference and tell which 6 datasets have the issues.

I can find the dataset list for other mixtures like T0 (constants_t0.py) and niv2 (constants_niv2.py). I see there is summary list for all mixtures, but cannot fine the one specific for flan2021_submix.

Thanks so much for doing this! from the summary list you linked, just subtract keys that start with cot_, stream_, t0_task_adaptation, tfds_natural_instructions , qrecc, and wiki_dialog.

I faced the same issue. These are the tasks where _task_name and _task_source are missing:
'fix_punct_template_0to10_no_opt_zero_shot',
'fix_punct_template_0to10_zero_shot',
'opinion_abstracts_idebate_template_0to10_no_opt_zero_shot',
'opinion_abstracts_idebate_template_0to10_zero_shot',
'opinion_abstracts_rotten_tomatoes_template_0to10_no_opt_zero_shot',
'opinion_abstracts_rotten_tomatoes_template_0to10_zero_shot',
'para_crawl_enes_template_0to10_no_opt_zero_shot',
'para_crawl_enes_template_0to10_zero_shot',
'true_case_template_0to10_no_opt_zero_shot',
'true_case_template_0to10_zero_shot',
'word_segment_template_0to10_no_opt_zero_shot',
'word_segment_template_0to10_zero_shot'

@ari9dam Thanks for outlining these! I see where the issue is and will push a fix.

@ari9dam @bpucla This should have fixed it: #53

Still facing it.

Not sure if it is due to caching, how do I only delete those cached files?

I deleted two folders from cache, that are specified as source in the task_config. Still the same
fix_punct_template_0to10_no_opt_zero_shot
fix_punct_template_0to10_zero_shot
opinion_abstracts_idebate_template_0to10_no_opt_zero_shot
opinion_abstracts_idebate_template_0to10_zero_shot
opinion_abstracts_rotten_tomatoes_template_0to10_no_opt_zero_shot
opinion_abstracts_rotten_tomatoes_template_0to10_zero_shot
para_crawl_enes_template_0to10_no_opt_zero_shot
para_crawl_enes_template_0to10_zero_shot
true_case_template_0to10_no_opt_zero_shot
true_case_template_0to10_zero_shot
word_segment_template_0to10_no_opt_zero_shot
word_segment_template_0to10_zero_shot

@ari9dam Hmm okay sorry! Reviewing this today. Thanks for re-flagging!

@ari9dam this seems to fix it from a quick test: #55