prajdabre / yanmtt

Yet Another Neural Machine Translation Toolkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extending IndicBART or IndicBERT

singhakr opened this issue · comments

I want to basically pre-train models from scratch, including tokenizer, for languages included in IndicBART and IndicBERT and some more languages, so as to build a something like IndicBARTExt and IndicBERTExt.

While going through some related issues, I noticed that there some conventions about language codes. It is possible to use pre-existing language codes for new languages.

Is there some way to add new language codes, say, for IndicBART/IndicBERT without much change in Python code which is called from the pre-train shell script? Or will it require considerable changes?

Or, if I can continue pre-training of the IndicBART and IndicBERT models on some more Indian languages, preferably with additional language codes, that will be even better. I have a vague idea of how to do it, but not exactly in terms of the toolkit code. Any help or pointers will be helpful.

There is already an indicbert v2 which supports 24 Indic languages: https://huggingface.co/ai4bharat/IndicBERTv2-SS
AFAIK it was trained from scratch.

As for IndicBARTv2, there is one in the works and should be out soon.

Regarding what you want to do, you will need to figure out the following:

  1. Extend the AlbertTokenizer with additional language tokens.
  2. Corresponding to the IDs of the language indicator tokens, extend the embedding matrix of the IndicBART model checkpoint (the .pt or .bin file containing the weights).

This will involve no change to the codebase.

The URL you included is giving 404 error. Is there some other place where it may be described?

Extend the AlbertTokenizer with additional language tokens.

I should be able to do this.

Corresponding to the IDs of the language indicator tokens, extend the embedding matrix of the IndicBART model checkpoint (the .pt or .bin file containing the weights)

For this can I simply load the pre-trained model and call the train script or function again on more data or should I keep in mind some other points?

And do the language codes matter or one can simply reuse existing codes? In the paper I remember reading that languages are not distinguished while training to allow zero shot learning. But I guess the code will matter when trying to actually translate.

BTW, I also want to fine-tune on some basic NLP tasks like POS tagging and NER etc. Will the pre-training differ for bilingual and monolingual tasks. There are, I think two different scripts for monolingual and translation tasks.

The URL is working now. I have data for three languages, two of them are not in the list of 24 and for one I may have some more data.

For IndicBERT, I have posted an issue on their repository.

For IndicBART, I am not clear about how to use the language codes to continue pre-training or while fine-tuning.

Should the pre-training be different for multilingual parallel corpora and multilingual monolingual corpora? Or will only fine-tuning be different? In either case, how to proceed properly? I don't have experience working with BERT before.

The URL you included is giving 404 error. Is there some other place where it may be described?

Extend the AlbertTokenizer with additional language tokens.

I should be able to do this.

Corresponding to the IDs of the language indicator tokens, extend the embedding matrix of the IndicBART model checkpoint (the .pt or .bin file containing the weights)

For this can I simply load the pre-trained model and call the train script or function again on more data or should I keep in mind some other points?

You will have to look into how to resize the embedding layer. You will have to hack YANMTT for this. Take a look here for hints: https://discuss.huggingface.co/t/adding-new-tokens-while-preserving-tokenization-of-adjacent-tokens/12604/3

And do the language codes matter or one can simply reuse existing codes? In the paper I remember reading that languages are not distinguished while training to allow zero shot learning. But I guess the code will matter when trying to actually translate.

You can reuse but its a hacky solution.

BTW, I also want to fine-tune on some basic NLP tasks like POS tagging and NER etc. Will the pre-training differ for bilingual and monolingual tasks. There are, I think two different scripts for monolingual and translation tasks.

YANMTT is not designed for basic NLP tasks. Its designed for NLG tasks. However you can treat the NLP task as a NLG task and see what happens. One script is for pre-training however, the fine-tuning script can also be indirectly used for pre-training. Im planning to retire the pre-training script since the latter one can already do everything.

The URL is working now. I have data for three languages, two of them are not in the list of 24 and for one I may have some more data.

For IndicBERT, I have posted an issue on their repository.

For IndicBART, I am not clear about how to use the language codes to continue pre-training or while fine-tuning.

For fine-tuning you can take a look here: https://github.com/AI4Bharat/indic-bart
For continued pre-training, you can reuse the commands in the aforementioned repo except that your source and target corpora are the same monolingual corpora files.

Should the pre-training be different for multilingual parallel corpora and multilingual monolingual corpora? Or will only fine-tuning be different? In either case, how to proceed properly? I don't have experience working with BERT before.

Rather than jumping into continued fine-tuning I recommend that you first get used to pretraining models from scratch with YANMTT. Once you familiarize yourself with this, things will get easier. Look into the examples folder for help.

Also I recommend going through the issues section. Most of your questions can be answered from there:

#37
#36
#34
#33
#54
#31
#17
#5