hooshvare / parsbert

🤗 ParsBERT: Transformer-based Model for Persian Language Understanding

Home Page:https://doi.org/10.1007/s11063-021-10528-4

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Preprocessing code/details

chrisji opened this issue · comments

Hi - nice work on this!

Sorry if I'm not looking hard enough, but have you released the code/details for preprocessing sequences used in pre-training?

I.e. Steps 1 & 2 in the paper:

(1) removing all the trivial and junk characters and (2) standardizing the corpus with respect to Persian characters.

Is this at all handled by the included tokenizer, or should one recreate the preprocessing step for fine-tuning?

Thanks

Hi @chrisji,

We haven't released the code and the preprocessing yet due to the ongoing paper acceptance procedure. We used a hierarchy of preprocessing regarding Persian before getting data into the pretraining part (Tokenization, Model Preparation), but in the new version of Tokenizers (by Huggingface), you can easily define a bunch of preprocessing or postprocessing directly into your tokenizer as the following code (It's a schema you can add your own or even define a custom part).

from tokenizers import (
    Tokenizer,
    decoders,
    models,
    pre_tokenizers,
    processors,
    normalizers
)

tokenizer = Tokenizer(models.WordPiece())
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
tokenizer.decoder = decoders.WordPiece()
tokenizer.post_processor = processors.BertProcessing()
tokenizer.normalizer = normalizers.BertNormalizer()