luyug / Reranker

Build Text Rerankers with Deep Language Models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem with reading dataset

HerrKrishna opened this issue · comments

I tried to follow the training section of the readme.
I get the following error:

Traceback (most recent call last):
File "C:\Users\Christoph.Schneider\PycharmProjects\SentBertHelpDesk\try_reranker.py", line 22, in
train_dataset = GroupedTrainDataset(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\reranker\data.py", line 31, in init
self.nlp_dataset = datasets.load_dataset(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\load.py", line 742, in load_dataset
builder_instance.download_and_prepare(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 574, in download_and_prepare
self._download_and_prepare(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 652, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 1041, in _prepare_split
for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\tqdm\std.py", line 1133, in iter
for obj in iterable:
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\packaged_modules\json\json.py", line 96, in _generate_table
s
pa_table = pa_table.cast(self.config.schema)
File "pyarrow\table.pxi", line 1409, in pyarrow.lib.Table.cast
ValueError: Target schema's field names are not matching the table's field names: ['qry', 'pos', 'neg'], ['neg', 'pos', 'qry']
train.zip

i've attached the training file that i use. It follows the standards described in the readme.

What version of datasets are you using?

Thank you for helping. I'm using datasets 1.8.0
I've reordered neg pos and qry. Now i get this error:

Traceback (most recent call last):
File "C:\Users\Christoph.Schneider\PycharmProjects\SentBertHelpDesk\try_reranker.py", line 25, in
train_dataset = GroupedTrainDataset(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\reranker\data.py", line 31, in init
self.nlp_dataset = datasets.load_dataset(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\load.py", line 742, in load_dataset
builder_instance.download_and_prepare(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 574, in download_and_prepare
self._download_and_prepare(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 652, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 1041, in _prepare_split
for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\tqdm\std.py", line 1133, in iter
for obj in iterable:
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\packaged_modules\json\json.py", line 96, in _generate_table
s
pa_table = pa_table.cast(self.config.schema)
File "pyarrow\table.pxi", line 1414, in pyarrow.lib.Table.cast
File "pyarrow\table.pxi", line 277, in pyarrow.lib.ChunkedArray.cast
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\pyarrow\compute.py", line 281, in cast
return call_function("cast", [arr], options)
File "pyarrow_compute.pyx", line 465, in pyarrow._compute.call_function
File "pyarrow_compute.pyx", line 294, in pyarrow._compute.Function.call
File "pyarrow\error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow\error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from struct<qid: string, passage: list<item: int64>> to struct using function cast_struct

Can you help with that?

Please first try out our tested environment setup torch==1.6.0, transformers==4.2.0, datasets==1.1.3, and in addition pyarrow==2.0.0 to see where the regression comes from. Meanwhile, your data does not seem to be in correct format.