Three custom languages and two tasks — is this a good place to start?

Question

Three custom languages and two tasks — is this a good place to start?

jbmaxwell opened this issue 2 years ago · comments

I have aligned datasets for three different custom languages. Each corpus is a flat text file where each line is a sentence, and documents are separated by empty lines. All sentences and documents match between the datasets. There are two tasks I'd like to be able to perform: 1) translate between the languages, and 2) infill sentences from any single language. For the translation task, given languages A, B, and C, it's actually not likely I'll ever go from C -> A or B -> A, but I definitely want to translate A -> B and A -> C. Other translations that would be helpful would be B -> C and C -> B.

From the MBART examples at HuggingFace it looks like MBartForConditionalGeneration could perhaps do task 1 (though maybe not in all directions listed above?), and BartForConditionalGeneration could do task 2. But is there any reason why MBartForConditionalGeneration couldn't do both? That is, if I pass an input with a <mask> token to MBART, will it perform the infilling, just as BART would? If so, then does your toolkit make sense as a place to start?

Any thoughts very much appreciated.

jbm · Answer 1 · Wed Feb 16 2022 03:31:24 GMT+0800 (China Standard Time)

I've just started digging into giving yanmtt a try.

One question: I have a non-natural-language usage that's better suited to a pre-determined vocabulary (which I've created in advance) and I'm wondering if there's a way to use BertTokenizer with my custom vocab.txt for MBART training? I've done this successfully with other non-BertTokenizer models (GPT-2 and RoBERTa) in the past.

BTW: I'm most interested in sentence infilling, though translation could also be useful.

jbm · Answer 2 · Thu Feb 17 2022 02:53:38 GMT+0800 (China Standard Time)

Okay, after hacking in a BertTokenizer I've hit the error mentioned in another issue (2 vs 5 args). I see now that BertTokenizer will not be possible, so I'll close this (and trying creating a new tokenizer, as suggested).

Raj Dabre · Answer 3 · Thu Feb 17 2022 11:29:47 GMT+0800 (China Standard Time)

Hi,

Can you send me a chat request on hangouts at prajdabre@gmail.com ?

I'll be happy to explain and answer whatever questions you have cause it looks like your use case needs several back and forth interaction.

jbm · Answer 4 · Thu Feb 17 2022 11:41:00 GMT+0800 (China Standard Time)

Ah, okay. I've sent the invite.
I did find another script for training BART on the huggingface git, but I it doesn't seem to be quite what I need. I'm a little surprised that sentence infilling isn't a "bigger thing" than it is, but since it really doesn't seem to be a common task it's kind of tricky getting the help needed to build a working system. Thanks.