Optimize the tokenization

Question

Optimize the tokenization

HillZhang1999 opened this issue 3 years ago · comments

First, thanks for your excellent work. Here is my question:

I used your code to reproduce the results in your paper, but found the CPU utilization rate was really high during training process, especially for stage 1. However, the GPU rate was not always 100%, sometimes only 50~60%, and fluctuated.
I debugged and assumed that the reason is the dynamically word-piece tokenization operation in the indexer.
I also made minor changes to adapt your code for Chinese GEC, e.g., 1) upgrade allennlp to latest version; 2) discard the word-piece operations since in Chinese we directly use chars as input units. And in Chinese experiments, i found that after above adaptations, the CPU rate was reduced heavily and training speed was greatly accelerated (e.g., 900M sentences cost ~5d for English, but only 1d for Chinese).
So i wonder if your implementation can be accelerated by upgrading allennlp to the latest version, or preprocessing the data (i.e., do word-piece or bpe segmentation) statically before training.

Alex Skurzhanskyi · Answer 1 · Thu Dec 09 2021 19:35:15 GMT+0800 (China Standard Time)

That's a good suggestion. Indeed, tokenization may require heavy CPU usage.
I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

Yue Zhang · Answer 2 · Thu Dec 09 2021 21:29:41 GMT+0800 (China Standard Time)

That's a good suggestion. Indeed, tokenization may require heavy CPU usage. I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

Hi, for upgrading AllenNlp, the training speed can be accelerated, since we can easily set the parameter use_amp to true in the GradientDescentTrainer, and thus use Automatic Mixed Precision to train gector. (and, of course, we need to make some extra adaptations to support characteristics in gector like cold_steps).
For handling the problem of tokenization, I guess if we can tokenize the training data before the training process, and load the results during training, to avoid the heavy CPU usage and redundant tokenization?
Thank you for your kind reply!

jasonfang · Answer 3 · Sat Feb 12 2022 16:28:38 GMT+0800 (China Standard Time)

That's a good suggestion. Indeed, tokenization may require heavy CPU usage. I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

Hi, for upgrading AllenNlp, the training speed can be accelerated, since we can easily set the parameter use_amp to true in the GradientDescentTrainer, and thus use Automatic Mixed Precision to train gector. (and, of course, we need to make some extra adaptations to support characteristics in gector like cold_steps). For handling the problem of tokenization, I guess if we can tokenize the training data before the training process, and load the results during training, to avoid the heavy CPU usage and redundant tokenization? Thank you for your kind reply!

Hey! I'm wondering how to modify the code to support cold_steps with amp? I tried but if I freeze the encoder for the first few epochs, it's not possible to unfreeze it with "params.requires_grad = True". The loss and acc will not decrease. Have you figured out a possible solution? I need some help.

Yue Zhang · Answer 4 · Sat Feb 12 2022 16:41:39 GMT+0800 (China Standard Time)

That's a good suggestion. Indeed, tokenization may require heavy CPU usage. I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

Hi, for upgrading AllenNlp, the training speed can be accelerated, since we can easily set the parameter use_amp to true in the GradientDescentTrainer, and thus use Automatic Mixed Precision to train gector. (and, of course, we need to make some extra adaptations to support characteristics in gector like cold_steps). For handling the problem of tokenization, I guess if we can tokenize the training data before the training process, and load the results during training, to avoid the heavy CPU usage and redundant tokenization? Thank you for your kind reply!

Hey! I'm wondering how to modify the code to support cold_steps with amp? I tried but if I freeze the encoder for the first few epochs, it's not possible to unfreeze it with "params.requires_grad = True". The loss and acc will not decrease. Have you figured out a possible solution? I need some help.

One simple solution is that you can save the model parameters to your disk after finishing the cold steps. Then you can start a new training procedure, reload the model parameters and unfreeze the BERT encoders.

jasonfang · Answer 5 · Sun Feb 13 2022 20:21:36 GMT+0800 (China Standard Time)

That's a good suggestion. Indeed, tokenization may require heavy CPU usage. I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

Hi, for upgrading AllenNlp, the training speed can be accelerated, since we can easily set the parameter use_amp to true in the GradientDescentTrainer, and thus use Automatic Mixed Precision to train gector. (and, of course, we need to make some extra adaptations to support characteristics in gector like cold_steps). For handling the problem of tokenization, I guess if we can tokenize the training data before the training process, and load the results during training, to avoid the heavy CPU usage and redundant tokenization? Thank you for your kind reply!

Hey! I'm wondering how to modify the code to support cold_steps with amp? I tried but if I freeze the encoder for the first few epochs, it's not possible to unfreeze it with "params.requires_grad = True". The loss and acc will not decrease. Have you figured out a possible solution? I need some help.

One simple solution is that you can save the model parameters to your disk after finishing the cold steps. Then you can start a new training procedure, reload the model parameters and unfreeze the BERT encoders.

Okay, I found that if the requires_grad option is set inside forward method, it will work. Thank you by the way~

Zhaoyang(Damien) Xie · Answer 6 · Tue Aug 30 2022 22:15:00 GMT+0800 (China Standard Time)

Hi @HillZhang1999 Could you please suggest what changes did you made to use latest AllenNLP? Thanks!

Yue Zhang · Answer 7 · Tue Aug 30 2022 22:27:22 GMT+0800 (China Standard Time)

Hi @HillZhang1999 Could you please suggest what changes did you made to use latest AllenNLP? Thanks!

Maybe you can refer to this repo: https://github.com/HillZhang1999/MuCGEC/tree/main/models/seq2edit-based-CGEC

Zhaoyang(Damien) Xie · Answer 8 · Tue Aug 30 2022 22:51:35 GMT+0800 (China Standard Time)

Thanks for replying it so quickly!
Looks like that you did not use the tokenization file in your code base? I tried to replace the existing ones with pretrainedIndexer and pretrainedEmbedder directly. However, the predicted results are different.

jasonfang · Answer 9 · Wed Aug 31 2022 10:13:18 GMT+0800 (China Standard Time)

Thanks for replying it so quickly! Looks like that you did not use the tokenization file in your code base? I tried to replace the existing ones with pretrainedIndexer and pretrainedEmbedder directly. However, the predicted results are different.

And if you would like to train seq2edit GEC without AllenNLP bundle but with faster speed, I made a deepspeed + pytorch + transformers implementation, you can refer to this repo:
https://github.com/blcuicall/CCL2022-CLTC/tree/main/baselines/track3/seq2edit