How to prepare two sequences as input for bert-multitask-learning?

Question

How to prepare two sequences as input for bert-multitask-learning?

rudra0713 opened this issue 5 years ago · comments

Hi, I have a dataset that involves 2 sequences and the task is classifying the sequence pair. I am not sure how to prepare the input in this case. So far, I have been working with only one sequence where I used the following format:

["Everyone", "should", "be", "happy", "."]
How do I extend this for 2 sequences? Do I have to insert a "SEP" token myself?

Jay Yip · Answer 1 · Mon Nov 25 2019 20:03:58 GMT+0800 (China Standard Time)

Now you reminded me... Sorry, it's not implemented.

https://github.com/JayYip/bert-multitask-learning/blob/9fe97739194f801e539efbadbaaf97a9c945eaaa/bert_multitask_learning/create_generators.py#L47

Jay Yip · Answer 2 · Mon Nov 25 2019 20:06:55 GMT+0800 (China Standard Time)

Sorry, I misread your question. You can prepare something like:

@preprocessing_fn
def proc_fn(params, mode):
    return [{'a': ["Everyone", "should", "be", "happy", "."], 'b': ["you're", "right"]}], ['true']

rudra0713 · Answer 3 · Sun Jan 05 2020 19:49:59 GMT+0800 (China Standard Time)

I prepared two sequences following your format, Here's an example:
{'a': ['Everyone', 'should', 'not', 'be', 'happy', '.'], 'b': ["you're", 'right']}
After printring tokens in add_special_tokens_with_seqs function in utils.py, I got this,
tokens -> ['[CLS]', 'a', 'b', '[SEP]']
I was expecting 'a' and 'b' to be replaced by the original sequences. Is this okay?
For a single sequence task, when I printed tokens, I got the desired output,
tokens -> ['[CLS]', 'marriage', 'came', 'from', 'religion', '.', '[SEP]']

Jay Yip · Answer 4 · Mon Jan 06 2020 12:14:38 GMT+0800 (China Standard Time)

Maybe it's a bug. Could you confirm that the example argument of create_single_problem_single_instance is a tuple like below?

({'a': ['Everyone', 'should', 'not', 'be', 'happy', '.'], 'b': ["you're", 'right']}, 'some label')

rudra0713 · Answer 5 · Tue Jan 07 2020 08:34:21 GMT+0800 (China Standard Time)

After adding this print, this is what I have found
if the Mode in the preprocessing function is 'train' or 'eval', the output of example aligns with what you mentioned,
`example (from create_single_problem_single_instance function) -> ({'a': ['we', 'Should', 'be', 'optimistic', 'about', 'the', 'future', '.'], 'b': ['Anything', 'that', 'improves', 'rush', 'hour', 'traffic', 'ca', "n't", 'be', 'all', 'that', 'bad', '.']}, 0)
tokens (from add_special_tokens_with_seqs function)-> ['[CLS]', 'we', 'should', 'be', 'op', '##timi', '##stic', 'about', 'the', 'future', '.', '[SEP]', 'anything', 'that', 'improve', '##s', 'rus', '##h', 'hour', 'traffic', 'ca', 'n', "##'", '##t', 'be', 'all', 'that', 'bad', '.', '[SEP]']

But when the mode is 'infer', right before printing the accuracies of the particular task, there is no print of 'example', and tokens become like this ->
tokens -> ['[CLS]', 'a', 'b', '[SEP]']
Also, for the same dataset and same split, previously I got 76% accuracy with BERT model, but with the multitask setting, for that same task alone, I am getting only 48.71% accuracy.
`

Jay Yip · Answer 6 · Fri Jan 10 2020 09:40:37 GMT+0800 (China Standard Time)

But when the mode is 'infer', right before printing the accuracies of the particular task, there is no print of 'example', and tokens become like this ->
tokens -> ['[CLS]', 'a', 'b', '[SEP]']

This is a bug. I'll fix it later.

Also, for the same dataset and same split, previously I got 76% accuracy with BERT model, but with the multitask setting, for that same task alone, I am getting only 48.71% accuracy.

That's weird. Maybe it's caused by another bug. Could you provide more info?

Jay Yip · Answer 7 · Fri Jan 10 2020 09:41:08 GMT+0800 (China Standard Time)

Sorry, accidentally closed. Reopen now.