grammarly / gector

Official implementation of the papers "GECToR – Grammatical Error Correction: Tag, Not Rewrite" (BEA-20) and "Text Simplification by Tagging" (BEA-21)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Optimize the tokenization

HillZhang1999 opened this issue · comments

First, thanks for your excellent work. Here is my question:

  • I used your code to reproduce the results in your paper, but found the CPU utilization rate was really high during training process, especially for stage 1. However, the GPU rate was not always 100%, sometimes only 50~60%, and fluctuated.
  • I debugged and assumed that the reason is the dynamically word-piece tokenization operation in the indexer.
  • I also made minor changes to adapt your code for Chinese GEC, e.g., 1) upgrade allennlp to latest version; 2) discard the word-piece operations since in Chinese we directly use chars as input units. And in Chinese experiments, i found that after above adaptations, the CPU rate was reduced heavily and training speed was greatly accelerated (e.g., 900M sentences cost ~5d for English, but only 1d for Chinese).
  • So i wonder if your implementation can be accelerated by upgrading allennlp to the latest version, or preprocessing the data (i.e., do word-piece or bpe segmentation) statically before training.

That's a good suggestion. Indeed, tokenization may require heavy CPU usage.
I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

That's a good suggestion. Indeed, tokenization may require heavy CPU usage. I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

Hi, for upgrading AllenNlp, the training speed can be accelerated, since we can easily set the parameter use_amp to true in the GradientDescentTrainer, and thus use Automatic Mixed Precision to train gector. (and, of course, we need to make some extra adaptations to support characteristics in gector like cold_steps).
For handling the problem of tokenization, I guess if we can tokenize the training data before the training process, and load the results during training, to avoid the heavy CPU usage and redundant tokenization?
Thank you for your kind reply!

That's a good suggestion. Indeed, tokenization may require heavy CPU usage. I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

Hi, for upgrading AllenNlp, the training speed can be accelerated, since we can easily set the parameter use_amp to true in the GradientDescentTrainer, and thus use Automatic Mixed Precision to train gector. (and, of course, we need to make some extra adaptations to support characteristics in gector like cold_steps). For handling the problem of tokenization, I guess if we can tokenize the training data before the training process, and load the results during training, to avoid the heavy CPU usage and redundant tokenization? Thank you for your kind reply!

Hey! I'm wondering how to modify the code to support cold_steps with amp? I tried but if I freeze the encoder for the first few epochs, it's not possible to unfreeze it with "params.requires_grad = True". The loss and acc will not decrease. Have you figured out a possible solution? I need some help.

That's a good suggestion. Indeed, tokenization may require heavy CPU usage. I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

Hi, for upgrading AllenNlp, the training speed can be accelerated, since we can easily set the parameter use_amp to true in the GradientDescentTrainer, and thus use Automatic Mixed Precision to train gector. (and, of course, we need to make some extra adaptations to support characteristics in gector like cold_steps). For handling the problem of tokenization, I guess if we can tokenize the training data before the training process, and load the results during training, to avoid the heavy CPU usage and redundant tokenization? Thank you for your kind reply!

Hey! I'm wondering how to modify the code to support cold_steps with amp? I tried but if I freeze the encoder for the first few epochs, it's not possible to unfreeze it with "params.requires_grad = True". The loss and acc will not decrease. Have you figured out a possible solution? I need some help.

One simple solution is that you can save the model parameters to your disk after finishing the cold steps. Then you can start a new training procedure, reload the model parameters and unfreeze the BERT encoders.

That's a good suggestion. Indeed, tokenization may require heavy CPU usage. I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

Hi, for upgrading AllenNlp, the training speed can be accelerated, since we can easily set the parameter use_amp to true in the GradientDescentTrainer, and thus use Automatic Mixed Precision to train gector. (and, of course, we need to make some extra adaptations to support characteristics in gector like cold_steps). For handling the problem of tokenization, I guess if we can tokenize the training data before the training process, and load the results during training, to avoid the heavy CPU usage and redundant tokenization? Thank you for your kind reply!

Hey! I'm wondering how to modify the code to support cold_steps with amp? I tried but if I freeze the encoder for the first few epochs, it's not possible to unfreeze it with "params.requires_grad = True". The loss and acc will not decrease. Have you figured out a possible solution? I need some help.

One simple solution is that you can save the model parameters to your disk after finishing the cold steps. Then you can start a new training procedure, reload the model parameters and unfreeze the BERT encoders.

Okay, I found that if the requires_grad option is set inside forward method, it will work. Thank you by the way~

Hi @HillZhang1999 Could you please suggest what changes did you made to use latest AllenNLP? Thanks!

Hi @HillZhang1999 Could you please suggest what changes did you made to use latest AllenNLP? Thanks!

Maybe you can refer to this repo: https://github.com/HillZhang1999/MuCGEC/tree/main/models/seq2edit-based-CGEC

Thanks for replying it so quickly!
Looks like that you did not use the tokenization file in your code base? I tried to replace the existing ones with pretrainedIndexer and pretrainedEmbedder directly. However, the predicted results are different.

Thanks for replying it so quickly! Looks like that you did not use the tokenization file in your code base? I tried to replace the existing ones with pretrainedIndexer and pretrainedEmbedder directly. However, the predicted results are different.

And if you would like to train seq2edit GEC without AllenNLP bundle but with faster speed, I made a deepspeed + pytorch + transformers implementation, you can refer to this repo:
https://github.com/blcuicall/CCL2022-CLTC/tree/main/baselines/track3/seq2edit