prajdabre / yanmtt

Yet Another Neural Machine Translation Toolkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some problem with loss

raullese opened this issue · comments

Hi, after training 1.5million steps with setting in issue:#39
I check the loss in tensorboard, and the image is as following
image

and refer to the run_train.log, the print loss has been growing, from value 2. to value 6. during this 1.5million steps
I'm wondering if this result is reasonable?for training loss

Its not a problem with the loss. Your training has diverged. It could be a result of overfitting, too high learning rate, need for additional gradient clipping etc.

I am not sure how to fix this. You should use a checkpoint before the divergence.

@prajdabre Thank you, Seems tricky, I'll keep looking at it,Actually my setup is pretty much the same as the original paper

By the way, do you know how to convert this project's .bin model to the .pt model?
Because when I want to use this generate model in the fairseq pipeline, there occur errors;
that is: the key best_loss not found; it seems that fairseq want model form as .pt

And I found that the open source mbart-50-large from huggingface can be load in fairseq; but the model generated from this project can't be load

Personally, I have had my experiences in model divergence. All I could do is consider a checkpoint before the divergence and continue from there with different learning rates, gradient clips etc. The BLOOM people also had to do this kind of "open surgery" on larger models quite often.

I am assuming that you are talking about the .bin model in the deploy folder. Im not sure how to make it compatible with fairseq. What procedure are you using to do this: https://github.com/facebookresearch/fairseq/tree/main/examples/multilingual#mbart50-models ?
I looked into the .pt checkpoint and it contains a lot of stuff like optimizer states.

Also how are you using HF models in fairseq? If you point me to it then I might be able to tell you what you need. HF model checkpoints are not in pt format. I imagine some conversion is done.

Personally, I have had my experiences in model divergence. All I could do is consider a checkpoint before the divergence and continue from there with different learning rates, gradient clips etc. The BLOOM people also had to do this kind of "open surgery" on larger models quite often.

I am assuming that you are talking about the .bin model in the deploy folder. Im not sure how to make it compatible with fairseq. What procedure are you using to do this: https://github.com/facebookresearch/fairseq/tree/main/examples/multilingual#mbart50-models ? I looked into the .pt checkpoint and it contains a lot of stuff like optimizer states.

Also how are you using HF models in fairseq? If you point me to it then I might be able to tell you what you need. HF model checkpoints are not in pt format. I imagine some conversion is done.

yeah, We use the pipeline from there: https://github.com/facebookresearch/fairseq/tree/main/examples/multilingual#mbart50-models
this project's open source mbart50, and also use the script to finetune from this project,
And the problem is when we use the model that from yanmtt, there occur errors,
It mainly affects the loading of the model. Our parameter name is like this 'module.model.decoder.layers.11.fc2.bias', and the open source one is like this 'model.decoder.layers.11.fc2.bias'

And when we did some operations to solve this problem, we found that, we can't translate from en to vi
image

add another point, we reproduced the test results in the paper with the open source mbart50 downloaded here: https://github.com/facebookresearch/fairseq/tree/main/examples/multilingual#mbart50-models, and they are basically the same as the results from the paper

If you use the models with the "pure_model" suffix then you wont have the parameter naming issue.

Honestly, I did not make YANMTT to be compatible with fairseq so I am afraid I cant help you there since compatibility with fairseq is not really my goal.

As for not being able to translate from en to vi, I will need to know your training flow to understand why that happens.