grammarly / gector

Official implementation of the papers "GECToR – Grammatical Error Correction: Tag, Not Rewrite" (BEA-20) and "Text Simplification by Tagging" (BEA-21)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use GEC with latest transformer, allennlp modules

Jiltseb opened this issue · comments

I want to use GEC on a latest transformer model (v4.4.2). However, it has several module errors in gector and seems difficult to fix. I have tried using v1.5.0 of allennlp but was running into errors.

Note: There is no issue in getting the GEC work with the specified versions in requirements.txt. It is just that, I want to use it in a virtual environment with latest transformer/allennlp versions.

Any help is highly appreciated! @skurzhanskyi

I also tried with the latest versions. It seems a lot of code is using deprecated functionality that need to be re-written.

@skurzhanskyi any news on this? We are still stuck with this.

Hi @Jiltseb
We have plans to update transformer this months

Hi, any update on this ?

I had to change the code to make it fit with new allennlp (I can do a PR if needed), but I'm still facing many issues while loading models or running predicitons.

I tried both 3 pretrained models and cannot make any of them to work....

Thanks in advance

Hi, there's a branch with the transformers==4.2.2. You can check it here:
https://github.com/grammarly/gector/tree/update_transformers_support_fasttokenizers
At the same time, pretrained models produce pure output with this code. we're in the middle of retraining the models.

Hi @skurzhanskyi, I just tried but unfortunately, I got the same errors...

I'm trying to use directly the GecBERTModel class to integrate it into my code.
I have a higher version of allennlp but I modified imports to work with it, however most of the errors come from missing keys or bad loading of the models.

I will wait for the new release.

Have a great day

Hi, sorry to post again...

I managed to make it work with transformer 4.6.1 and allennlp 2.6.0.

However, the output of handle_batch(my_string_sentence.split()) doesn't correct anything....
Like :

handle_batch("How ar you my firend ?".split())
[['How', 'ar', 'you', 'my', 'firend', '?']] 0

To do this I removed some @override decorators, and added the function

def as_padded_tensor_dict(
        self,
        tokens: Dict[str, List[int]],
        padding_lengths: Dict[str, int],
    ) -> Dict[str, List[int]]:
        return {
            "input_ids": torch.tensor(tokens["bert"]),
            "offsets": torch.tensor(tokens["bert-offsets"]),
        }

in tokenizer_indexers.py which replaces the old pad_token_sequence as it seems.

I also add to remove the mask in the seq2labels_model as it was always true and not of the same size...

I know I did some hackish things and was hoping it could work as I don't know the codebase.

I hope it can help you and that we can have a really nice open-source state of the art grammar corrector (which we can train in other languages) :)

Have a great day

@skurzhanskyi Any update on the release with the new retrained models?

Hi @Jiltseb
Sorry for the late reply.
Unfortunately, we've got problems getting the same quality of models with the branch code. So we cannot move to it completely.
In case you don't need the pretrained model, you can try using this branch.

Hi @Jiltseb @ierezell @abhinavdayal
We have great news, we just merged #133 new GECToR version, which now supports the latest transformers & torch, there're also new pretrained models (BERT, RoBERTa, XLNet). The scores are slightly different but still comparable.