prajdabre / yanmtt

Yet Another Neural Machine Translation Toolkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Getting error when pretraining with new languages sanskrit

Aniruddha-JU opened this issue · comments

We are tring to pre-train a model with initializing indicBART. we use the below command
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs sa --mono_src examples/data/train.sa --batch_size 8 --batch_size_indicates_lines --shard_files --model_path ai4bharat/IndicBART

we are getting below error.

Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
Traceback (most recent call last):
File "pretrain_nmt.py", line 968, in
run_demo()
File "pretrain_nmt.py", line 965, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/aniruddha/machine_translation/yanmtt/pretrain_nmt.py", line 85, in model_create_load_run_save
tok = AlbertTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1789, in from_pretrained
resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1860, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/models/albert/tokenization_albert.py", line 153, in init
self.sp_model.Load(vocab_file)
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: /sentencepiece/python/bundled/sentencepiece/src/sentencepiece_processor.cc(848) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

Hi,

I think you are not using the version of transformers that I have provided with the toolkit. Either that or your sentencepiece version is not the one in the requirements.txt file.

Kindly uninstall any existing version of transformers by "pip uninstall transformers" and then install the version I have provided in the transformers folder by "cd transformers && python setup.py install"

Also, your command needs some fixing.

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs XX --mono_src examples/data/train.sa --batch_size 8 --batch_size_indicates_lines --shard_files --model_path <local path like /home/raj/model_folder/model>

XX should be one of the 11 language tokens that the model supports. Currently, I have not yet included a method to specify new languages. So the way to bypass this would be to use any of the tokens -- as,bn,gu,hi,kn,ml,mr,or,pa,ta,te. Typically choose one token which you dont plan to do any fine-tuning experiments with.

Hi, Thanks for your reply : I am getting the error when I am using this command
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi --mono_src /home/aniruddha/sanjana/train.hi --batch_size 8 --batch_size_indicates_lines --shard_files --model_path /home/aniruddha/IndicBART.ckpt --port 8080


Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1860, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/models/mbart/tokenization_mbart.py", line 97, in init
super().init(*args, tokenizer_file=tokenizer_file, **kwargs)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 135, in init
self.sp_model.Load(str(vocab_file))
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "None": No such file or directory Error #2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/aniruddha/machine_translation/yanmtt/pretrain_nmt.py", line 85, in model_create_load_run_save
tok = MBartTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1789, in from_pretrained
resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1863, in _from_pretrained
"Unable to load vocabulary from file. "
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.


But when I am putting: model_path a blank folder then the code is running
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi --mono_src /home/aniruddha/sanjana/train.hi --batch_size 8 --batch_size_indicates_lines --shard_files --model_path IndicBART --port 8080

Hi,

The error made me realize that there was a tiny bug.

elif "IndicBART" in args.pretrained_model: tok = MBartTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)

Should be:

elif "IndicBART" in args.pretrained_model: tok = AlbertTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)

Im surprised that it actually worked. Should have thrown an error.

Also the way you specify the --model_path should be /home/aniruddha/IndicBART.ckpt/model

It should actually be path+"/"+prefix where path = /home/aniruddha/IndicBART.ckpt and prefix = model

Thats something I should clarify in the documentation even better.

Please pull the latest code after 15 mins.

Hi,
I realized and changed earlier. I have one query also..the model path argument in the above query does not use for any initialize model if we are using --use_official_pretrained and --pre_trained argument. AM I RIGHT? can you please verify

Model path is the place where the model is saved. Pretrained model is where the params are loaded.

So,We should not give any exiting model path right. Rather, I am giving anew path where the new pre-trained model will save.. AM I rIGHT? please confirm it once .. --model_path ai4bhart/IndicBART .. this ai4bhart/IndicBART is new directory..

as we are using args.use_official_pretrained so we don't need to give any exiting model path.. Because in your code, model_path is used to store the model, config, and tokenizer, AM I RIGHT?

Both paths are needed. One is for loading one is for saving. If you dont use a pretrained model then just use the --model_path.

If you dont specify the --model_path then the model will be saved with the default value for the argument (pl check the code).

model_path should be be a local path. I think there is some confusion.

  1. ai4bhart/IndicBART is not a local path. It is an identifier for huggingface.
  2. Since it is a pretrained model it should be passed to --pretrained_model.
  3. Since this is an official model on the huggingface hub, you need to specify an additional flag: --use_official_pretrained

In my fixed version of the code if --use_official_pretrained is used then the config, model is loaded from --pretrained_model and tokenizer is loaded from --tokenizer_name_or_path.

Your use case is simple: Fine-tune IndicBART on your own monolingual data so the following command is sufficient:

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi --mono_src ../data/hi/hi.txt.00 --batch_size 8 --batch_size_indicates_lines --shard_files --model_path /tmp/model --port 8080

--pretrained_model ai4bharat/IndicBART because you want to load the official IndicBART model from HF hub. If you had instead downloaded the IndicBART model from here "https://github.com/AI4Bharat/indic-bart" then you would have to first download the model checkpoint and tokenizer locally and then specify their paths to --pretrained_model and --tokenizer_name_or_path

--use_official_pretrained because you are loading the official IndicBART model from HF hub.

--model_path /tmp/model because you want to save your model in the /tmp folder. Model files will have several suffixes depending on their use. You will only be looking at the file model.pure_model

Hi Thank you for your reply. Yes, model_path should be local path, actually I created it as the of ai4bhart/IndicBART like huggingface model name, and I have verified that the model is saving this path, thank you for your reply