Problem creating DataSilo for multi task learning

Question

Problem creating DataSilo for multi task learning

johann-petrak opened this issue 3 years ago · comments

I am trying to get started with using FARM for multi task learning (for now, just two simple classification heads, but eventually I want to implement my own head layers for one of the two).

I am using the latest git master branch at bec0a9a for this.

Sadly I could not find any example or instructions for how to do this so I tried this code, which feels like the most logical approach - first define the things common to both heads (e.g. the text column) in the TextClassificationProcessor, then add the specific fields for each task:

tokenizer = Tokenizer.load(
    pretrained_model_name_or_path="bert-base-german-cased",
    do_lower_case=False)

LABEL_LIST_COARSE = ["OTHER", "OFFENSE"]
LABEL_LIST_FINE = ["OTHER", "ABUSE", "INSULT", "PROFANITY"]

mtl_processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=128,
                                        data_dir="../data",
                                        train_filename="germeval2019_ALL_cleaned.tsv",
                                        test_filename="germeval2019_ALL_cleaned.tsv",  # check resubstitutio error!
                                        dev_split=0.1,
                                        text_column_name="text")
mtl_processor.add_task(name="coarse", label_list=LABEL_LIST_COARSE, metric="acc", label_column_name="coarse")
mtl_processor.add_task(name="fine", label_list=LABEL_LIST_FINE, metric="acc", label_column_name="fine")

BATCH_SIZE = 32

data_silo = DataSilo(
    processor=mtl_processor,
    batch_size=BATCH_SIZE)

This throws the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-8-8af0dfbe6ff0> in <module>
      1 BATCH_SIZE = 32
      2 
----> 3 data_silo = DataSilo(
      4     processor=mtl_processor,
      5     batch_size=BATCH_SIZE)

~/work-git/FARM-forked/farm/data_handler/data_silo.py in __init__(self, processor, batch_size, eval_batch_size, distributed, automatic_loading, max_multiprocessing_chunksize, max_processes, caching, cache_path)
    111             # In most cases we want to load all data automatically, but in some cases we rather want to do this
    112             # later or load from dicts instead of file (https://github.com/deepset-ai/FARM/issues/85)
--> 113             self._load_data()
    114 
    115     @classmethod

~/work-git/FARM-forked/farm/data_handler/data_silo.py in _load_data(self, train_dicts, dev_dicts, test_dicts)
    220             train_file = self.processor.data_dir / self.processor.train_filename
    221             logger.info("Loading train set from: {} ".format(train_file))
--> 222             self.data["train"], self.tensor_names = self._get_dataset(train_file)
    223         else:
    224             logger.info("No train set is being loaded")

~/work-git/FARM-forked/farm/data_handler/data_silo.py in _get_dataset(self, filename, dicts)
    139         # loading dicts from file (default)
    140         if dicts is None:
--> 141             dicts = list(self.processor.file_to_dicts(filename))
    142             #shuffle list of dicts here if we later want to have a random dev set splitted from train set
    143             if str(self.processor.train_filename) in str(filename):

~/work-git/FARM-forked/farm/data_handler/processor.py in file_to_dicts(self, file)
    604             column_mapping[task["label_column_name"]] = task["label_name"]
    605             column_mapping[task["text_column_name"]] = "text"
--> 606         dicts = read_tsv(
    607             filename=file,
    608             delimiter=self.delimiter,

~/work-git/FARM-forked/farm/data_handler/utils.py in read_tsv(filename, rename_columns, quotechar, delimiter, skiprows, header, proxies, max_samples)
     58     # read file into df - but only read those cols we need
     59     columns_needed = list(rename_columns.keys())
---> 60     df = pd.read_csv(
     61         filename,
     62         sep=delimiter,

~/.conda/envs/farm-dev/lib/python3.8/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    608     kwds.update(kwds_defaults)
    609 
--> 610     return _read(filepath_or_buffer, kwds)
    611 
    612 

~/.conda/envs/farm-dev/lib/python3.8/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    460 
    461     # Create the parser.
--> 462     parser = TextFileReader(filepath_or_buffer, **kwds)
    463 
    464     if chunksize or iterator:

~/.conda/envs/farm-dev/lib/python3.8/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    817             self.options["has_index_names"] = kwds["has_index_names"]
    818 
--> 819         self._engine = self._make_engine(self.engine)
    820 
    821     def close(self):

~/.conda/envs/farm-dev/lib/python3.8/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1048             )
   1049         # error: Too many arguments for "ParserBase"
-> 1050         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1051 
   1052     def _failover_to_python(self):

~/.conda/envs/farm-dev/lib/python3.8/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1861 
   1862         # GH20529, validate usecol arg before TextReader
-> 1863         self.usecols, self.usecols_dtype = _validate_usecols_arg(kwds["usecols"])
   1864         kwds["usecols"] = self.usecols
   1865 

~/.conda/envs/farm-dev/lib/python3.8/site-packages/pandas/io/parsers.py in _validate_usecols_arg(usecols)
   1239 
   1240         if usecols_dtype not in ("empty", "integer", "string"):
-> 1241             raise ValueError(msg)
   1242 
   1243         usecols = set(usecols)

ValueError: 'usecols' must either be list-like of all strings, all unicode, all integers or a callable.

Johann Petrak · Answer 1 · Wed May 19 2021 02:29:18 GMT+0800 (China Standard Time)

Sadly pandas only complains about what it did not get, does not say what it got, but when I check the values of variables
columns_needed and rename_columns I see:

columns_needed:  ['coarse', None, 'fine']
rename_columns {'coarse': 'coarse_label', None: 'text', 'fine': 'fine_label'}
header:  0

So for some reason this passes None where probably "text" would be needed?

Note that when I add the parameters for the coarse task directly to the TextClassificationProcessor constructor and only add the fine task afterwards, the columns_needed will contain text AND None in addition to coarse and fine.

So, am I doing this wrong, how is this supposed to work?

Timo Moeller · Answer 2 · Wed May 19 2021 17:06:51 GMT+0800 (China Standard Time)

I know we have a good issue on MTL here: #724
Unfortunately I dont have access to the linked colab notebook that should give a good example of MTL in FARM. Maybe you can ask the author to give access and maybe also write an MTL example in FARM? 😄

If you cannot find the information there @tholor could you look into the tasks and MTL setup described here, please?

Johann Petrak · Answer 3 · Wed May 19 2021 17:47:52 GMT+0800 (China Standard Time)

I did not add to #724 because it seems to be about the actual learning, while my problem is already related to just loading the data for it.
I was hoping that the author(s) who designed the data loading strategy for multiple tasks could have a look at this.
Once the data can be loaded I will be happy to share the actual MTL example if I get it to run (which I really need to happen).

Timo Moeller · Answer 4 · Wed May 19 2021 18:46:24 GMT+0800 (China Standard Time)

I remember when I still had access to the linked colab in #724 there was also the data loading part covered... and the author wanted to create an example script for MTL.

Our "tasks" setup is not explicitly tested by us for MTL, we only used BertStyleLMProcessor for MTL preprocessing in one processor. @tholor might be able to help on how to use "tasks"

Johann Petrak · Answer 5 · Wed May 19 2021 19:25:27 GMT+0800 (China Standard Time)

Is there an issue or document that describes the design for the whole MTL process, including the data management and the actual training/inference part?
It is a bit hard to start with just an example and the source code.

Timo Moeller · Answer 6 · Wed May 19 2021 19:52:35 GMT+0800 (China Standard Time)

In FARM, unfortunately there is no such document. We did not work much with MTL in FARM to be honest...

Johann Petrak · Answer 7 · Wed May 19 2021 21:03:09 GMT+0800 (China Standard Time)

OK so when trying to add all possible parameters to the add_task invocation like this:

mtl_processor.add_task(name="coarse", 
                       task_type="classification",
                       label_list=LABEL_LIST_COARSE, 
                       metric="acc", 
                       text_column_name="text",
                       label_column_name="coarse")
mtl_processor.add_task(name="fine", 
                       task_type="classification",
                       label_list=LABEL_LIST_FINE, 
                       metric="acc", 
                       text_column_name="text",
                       label_column_name="fine")

Creating the data silo works without exception. This is one of the many cases where we need to update the documentation to 1) include information about defaults and include which of several kwargs are required.

BTW, if the task_type parameter is set to an incorrect value (e.g. "text_classification") then the following exception occurs instead of an error message informing about allowed values:

Preprocessing Dataset ../data/germeval2019_ALL_cleaned.tsv:   0%|          | 0/15459 [00:00<?, ? Dicts/s]

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/johann/software/anaconda/envs/farm-dev/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/data/johann/work-git/FARM-forked/farm/data_handler/data_silo.py", line 132, in _dataset_from_chunk
    dataset, tensor_names, problematic_sample_ids = processor.dataset_from_dicts(dicts=dicts, indices=indices)
  File "/data/johann/work-git/FARM-forked/farm/data_handler/processor.py", line 654, in dataset_from_dicts
    label_dict = self.convert_labels(dictionary)
  File "/data/johann/work-git/FARM-forked/farm/data_handler/processor.py", line 699, in convert_labels
    ret[task["label_tensor_name"]] = label_ids
UnboundLocalError: local variable 'label_ids' referenced before assignment
"""

The above exception was the direct cause of the following exception:

UnboundLocalError                         Traceback (most recent call last)
<ipython-input-16-8af0dfbe6ff0> in <module>
      1 BATCH_SIZE = 32
      2 
----> 3 data_silo = DataSilo(
      4     processor=mtl_processor,
      5     batch_size=BATCH_SIZE)

/data/johann/work-git/FARM-forked/farm/data_handler/data_silo.py in __init__(self, processor, batch_size, eval_batch_size, distributed, automatic_loading, max_multiprocessing_chunksize, max_processes, caching, cache_path)
    111             # In most cases we want to load all data automatically, but in some cases we rather want to do this
    112             # later or load from dicts instead of file (https://github.com/deepset-ai/FARM/issues/85)
--> 113             self._load_data()
    114 
    115     @classmethod

/data/johann/work-git/FARM-forked/farm/data_handler/data_silo.py in _load_data(self, train_dicts, dev_dicts, test_dicts)
    220             train_file = self.processor.data_dir / self.processor.train_filename
    221             logger.info("Loading train set from: {} ".format(train_file))
--> 222             self.data["train"], self.tensor_names = self._get_dataset(train_file)
    223         else:
    224             logger.info("No train set is being loaded")

/data/johann/work-git/FARM-forked/farm/data_handler/data_silo.py in _get_dataset(self, filename, dicts)
    183                 desc += f" {filename}"
    184             with tqdm(total=len(dicts), unit=' Dicts', desc=desc) as pbar:
--> 185                 for dataset, tensor_names, problematic_samples in results:
    186                     datasets.append(dataset)
    187                     # update progress bar (last step can have less dicts than actual chunk_size)

~/software/anaconda/envs/farm-dev/lib/python3.8/multiprocessing/pool.py in next(self, timeout)
    866         if success:
    867             return value
--> 868         raise value
    869 
    870     __next__ = next                    # XXX

UnboundLocalError: local variable 'label_ids' referenced before assignment

Johann Petrak · Answer 8 · Wed May 19 2021 21:20:51 GMT+0800 (China Standard Time)

OK, turns out this is not a bug.

Timo Moeller · Answer 9 · Thu May 20 2021 00:05:48 GMT+0800 (China Standard Time)

Thanks for self serving here @johann-petrak 😄
Which settings solved the issue then?

Johann Petrak · Answer 10 · Thu May 20 2021 00:35:51 GMT+0800 (China Standard Time)

My first attempt was to specify the text column in the constructor and then add a task using

mtl_processor.add_task(name="coarse", label_list=LABEL_LIST_COARSE, metric="acc", label_column_name="coarse")

(this was suggested somewhere in some release notes, I think).

The second attempt, which worked, was to instead include ALL possible parameters for the add_task method:

mtl_processor.add_task(name="coarse", 
                       task_type="classification",
                       label_list=LABEL_LIST_COARSE, 
                       metric="acc", 
                       text_column_name="text",
                       label_column_name="coarse")
mtl_processor.add_task(name="fine", 
                       task_type="classification",
                       label_list=LABEL_LIST_FINE, 
                       metric="acc", 
                       text_column_name="text",
                       label_column_name="fine")

I did not systematically try to find out which of the parameters I added actually was/were the crucial one(s), but I assume task_type should not be missing.

Timo Moeller · Answer 11 · Thu May 20 2021 03:52:18 GMT+0800 (China Standard Time)

Nice thanks, hopefully other people find this info as well.

Johann Petrak · Answer 12 · Thu May 20 2021 04:20:47 GMT+0800 (China Standard Time)

Once I get the MTL example to run through, I will add it to the examples dir which I hope should help.