JayYip / m3tl

BERT for Multitask Learning

Home Page:https://jayyip.github.io/m3tl/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to prepare two sequences as input for bert-multitask-learning?

rudra0713 opened this issue · comments

Hi, I have a dataset that involves 2 sequences and the task is classifying the sequence pair. I am not sure how to prepare the input in this case. So far, I have been working with only one sequence where I used the following format:

["Everyone", "should", "be", "happy", "."]
How do I extend this for 2 sequences? Do I have to insert a "SEP" token myself?

Sorry, I misread your question. You can prepare something like:

@preprocessing_fn
def proc_fn(params, mode):
    return [{'a': ["Everyone", "should", "be", "happy", "."], 'b': ["you're", "right"]}], ['true']

I prepared two sequences following your format, Here's an example:
{'a': ['Everyone', 'should', 'not', 'be', 'happy', '.'], 'b': ["you're", 'right']}
After printring tokens in add_special_tokens_with_seqs function in utils.py, I got this,
tokens -> ['[CLS]', 'a', 'b', '[SEP]']
I was expecting 'a' and 'b' to be replaced by the original sequences. Is this okay?
For a single sequence task, when I printed tokens, I got the desired output,
tokens -> ['[CLS]', 'marriage', 'came', 'from', 'religion', '.', '[SEP]']

Maybe it's a bug. Could you confirm that the example argument of create_single_problem_single_instance is a tuple like below?

({'a': ['Everyone', 'should', 'not', 'be', 'happy', '.'], 'b': ["you're", 'right']}, 'some label')

After adding this print, this is what I have found
if the Mode in the preprocessing function is 'train' or 'eval', the output of example aligns with what you mentioned,
`example (from create_single_problem_single_instance function) -> ({'a': ['we', 'Should', 'be', 'optimistic', 'about', 'the', 'future', '.'], 'b': ['Anything', 'that', 'improves', 'rush', 'hour', 'traffic', 'ca', "n't", 'be', 'all', 'that', 'bad', '.']}, 0)
tokens (from add_special_tokens_with_seqs function)-> ['[CLS]', 'we', 'should', 'be', 'op', '##timi', '##stic', 'about', 'the', 'future', '.', '[SEP]', 'anything', 'that', 'improve', '##s', 'rus', '##h', 'hour', 'traffic', 'ca', 'n', "##'", '##t', 'be', 'all', 'that', 'bad', '.', '[SEP]']

But when the mode is 'infer', right before printing the accuracies of the particular task, there is no print of 'example', and tokens become like this ->
tokens -> ['[CLS]', 'a', 'b', '[SEP]']
Also, for the same dataset and same split, previously I got 76% accuracy with BERT model, but with the multitask setting, for that same task alone, I am getting only 48.71% accuracy.
`

But when the mode is 'infer', right before printing the accuracies of the particular task, there is no print of 'example', and tokens become like this ->
tokens -> ['[CLS]', 'a', 'b', '[SEP]']

This is a bug. I'll fix it later.

Also, for the same dataset and same split, previously I got 76% accuracy with BERT model, but with the multitask setting, for that same task alone, I am getting only 48.71% accuracy.

That's weird. Maybe it's caused by another bug. Could you provide more info?

Sorry, accidentally closed. Reopen now.