Getting error when pretraining with new languages sanskrit

Question

Getting error when pretraining with new languages sanskrit

Aniruddha-JU opened this issue 2 years ago · comments

We are tring to pre-train a model with initializing indicBART. we use the below command
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs sa --mono_src examples/data/train.sa --batch_size 8 --batch_size_indicates_lines --shard_files --model_path ai4bharat/IndicBART

we are getting below error.

Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
Traceback (most recent call last):
File "pretrain_nmt.py", line 968, in
run_demo()
File "pretrain_nmt.py", line 965, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/aniruddha/machine_translation/yanmtt/pretrain_nmt.py", line 85, in model_create_load_run_save
tok = AlbertTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1789, in from_pretrained
resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1860, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/models/albert/tokenization_albert.py", line 153, in init
self.sp_model.Load(vocab_file)
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

Aniruddha-JU · Answer 1 · Mon Aug 15 2022 20:47:27 GMT+0800 (China Standard Time)

return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: /sentencepiece/python/bundled/sentencepiece/src/sentencepiece_processor.cc(848) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

Raj Dabre · Answer 2 · Mon Aug 15 2022 21:05:07 GMT+0800 (China Standard Time)

Hi,

I think you are not using the version of transformers that I have provided with the toolkit. Either that or your sentencepiece version is not the one in the requirements.txt file.

Kindly uninstall any existing version of transformers by "pip uninstall transformers" and then install the version I have provided in the transformers folder by "cd transformers && python setup.py install"

Also, your command needs some fixing.

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs XX --mono_src examples/data/train.sa --batch_size 8 --batch_size_indicates_lines --shard_files --model_path <local path like /home/raj/model_folder/model>

XX should be one of the 11 language tokens that the model supports. Currently, I have not yet included a method to specify new languages. So the way to bypass this would be to use any of the tokens -- as,bn,gu,hi,kn,ml,mr,or,pa,ta,te. Typically choose one token which you dont plan to do any fine-tuning experiments with.

Aniruddha-JU · Answer 3 · Tue Aug 23 2022 17:08:46 GMT+0800 (China Standard Time)

Hi, Thanks for your reply : I am getting the error when I am using this command
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi --mono_src /home/aniruddha/sanjana/train.hi --batch_size 8 --batch_size_indicates_lines --shard_files --model_path /home/aniruddha/IndicBART.ckpt --port 8080

Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1860, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/models/mbart/tokenization_mbart.py", line 97, in init
super().init(*args, tokenizer_file=tokenizer_file, **kwargs)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 135, in init
self.sp_model.Load(str(vocab_file))
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "None": No such file or directory Error #2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/aniruddha/machine_translation/yanmtt/pretrain_nmt.py", line 85, in model_create_load_run_save
tok = MBartTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1789, in from_pretrained
resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1863, in _from_pretrained
"Unable to load vocabulary from file. "
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

Aniruddha-JU · Answer 4 · Tue Aug 23 2022 17:10:03 GMT+0800 (China Standard Time)

But when I am putting: model_path a blank folder then the code is running
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi --mono_src /home/aniruddha/sanjana/train.hi --batch_size 8 --batch_size_indicates_lines --shard_files --model_path IndicBART --port 8080

Raj Dabre · Answer 5 · Tue Aug 23 2022 17:32:09 GMT+0800 (China Standard Time)

Hi,

The error made me realize that there was a tiny bug.

elif "IndicBART" in args.pretrained_model: tok = MBartTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)

Should be:

elif "IndicBART" in args.pretrained_model: tok = AlbertTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)

Im surprised that it actually worked. Should have thrown an error.

Also the way you specify the --model_path should be /home/aniruddha/IndicBART.ckpt/model

It should actually be path+"/"+prefix where path = /home/aniruddha/IndicBART.ckpt and prefix = model

Thats something I should clarify in the documentation even better.

Please pull the latest code after 15 mins.

Aniruddha-JU · Answer 6 · Tue Aug 23 2022 17:53:37 GMT+0800 (China Standard Time)

Hi,
I realized and changed earlier. I have one query also..the model path argument in the above query does not use for any initialize model if we are using --use_official_pretrained and --pre_trained argument. AM I RIGHT? can you please verify

Raj Dabre · Answer 7 · Tue Aug 23 2022 17:55:09 GMT+0800 (China Standard Time)

Model path is the place where the model is saved. Pretrained model is where the params are loaded.

Aniruddha-JU · Answer 8 · Tue Aug 23 2022 17:59:07 GMT+0800 (China Standard Time)

So,We should not give any exiting model path right. Rather, I am giving anew path where the new pre-trained model will save.. AM I rIGHT? please confirm it once .. --model_path ai4bhart/IndicBART .. this ai4bhart/IndicBART is new directory..

Aniruddha-JU · Answer 9 · Tue Aug 23 2022 18:03:49 GMT+0800 (China Standard Time)

as we are using args.use_official_pretrained so we don't need to give any exiting model path.. Because in your code, model_path is used to store the model, config, and tokenizer, AM I RIGHT?

Raj Dabre · Answer 10 · Tue Aug 23 2022 18:22:05 GMT+0800 (China Standard Time)

Both paths are needed. One is for loading one is for saving. If you dont use a pretrained model then just use the --model_path.

If you dont specify the --model_path then the model will be saved with the default value for the argument (pl check the code).

model_path should be be a local path. I think there is some confusion.

ai4bhart/IndicBART is not a local path. It is an identifier for huggingface.
Since it is a pretrained model it should be passed to --pretrained_model.
Since this is an official model on the huggingface hub, you need to specify an additional flag: --use_official_pretrained

In my fixed version of the code if --use_official_pretrained is used then the config, model is loaded from --pretrained_model and tokenizer is loaded from --tokenizer_name_or_path.

Your use case is simple: Fine-tune IndicBART on your own monolingual data so the following command is sufficient:

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi --mono_src ../data/hi/hi.txt.00 --batch_size 8 --batch_size_indicates_lines --shard_files --model_path /tmp/model --port 8080

--pretrained_model ai4bharat/IndicBART because you want to load the official IndicBART model from HF hub. If you had instead downloaded the IndicBART model from here "https://github.com/AI4Bharat/indic-bart" then you would have to first download the model checkpoint and tokenizer locally and then specify their paths to --pretrained_model and --tokenizer_name_or_path

--use_official_pretrained because you are loading the official IndicBART model from HF hub.

--model_path /tmp/model because you want to save your model in the /tmp folder. Model files will have several suffixes depending on their use. You will only be looking at the file model.pure_model

Aniruddha-JU · Answer 11 · Tue Aug 23 2022 18:26:04 GMT+0800 (China Standard Time)

Hi Thank you for your reply. Yes, model_path should be local path, actually I created it as the of ai4bhart/IndicBART like huggingface model name, and I have verified that the model is saving this path, thank you for your reply

Aniruddha-JU · Answer 12 · Tue Oct 11 2022 16:15:23 GMT+0800 (China Standard Time)

hi, i am getting one point..That your code is only working when I am putting .hi extension. Otherwise its getting error. Like when I am passing train.kn it is getting error, and when I renamed the file with train.hi it works.

…

On Tue, Aug 23, 2022 at 3:52 PM Raj Dabre ***@***.***> wrote: Both paths are needed. One is for loading one is for saving. If you dont use a pretrained model then just use the --model_path. If you dont specify the --model_path then the model will be saved with the default value for the argument (pl check the code). model_path should be be a local path. I think there is some confusion. 1. ai4bhart/IndicBART is not a local path. It is an identifier for huggingface. 2. Since it is a pretrained model it should be passed to --pretrained_model. 3. Since this is an official model on the huggingface hub, you need to specify an additional flag: --use_official_pretrained In my fixed version of the code if --use_official_pretrained is used then the config, model is loaded from --pretrained_model and tokenizer is loaded from --tokenizer_name_or_path. Your use case is simple: Fine-tune IndicBART on your own monolingual data so the following command is sufficient: python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi --mono_src ../data/hi/hi.txt.00 --batch_size 8 --batch_size_indicates_lines --shard_files --model_path /tmp/model --port 8080 --pretrained_model ai4bharat/IndicBART because you want to load the official IndicBART model from HF hub. If you had instead downloaded the IndicBART model from here "https://github.com/AI4Bharat/indic-bart" then you would have to first download the model checkpoint and tokenizer locally and then specify their paths to --pretrained_model and --tokenizer_name_or_path --use_official_pretrained because you are loading the official IndicBART model from HF hub. --model_path /tmp/model because you want to save your model in the /tmp folder. Model files will have several suffixes depending on their use. You will only be looking at the file model.pure_model — Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIWJFZREOXIVU2YETEVIAGLV2SQ5PANCNFSM56SD4VMA> . You are receiving this because you authored the thread.Message ID: ***@***.***>