Preprocessing code/details

Question

Preprocessing code/details

chrisji opened this issue 3 years ago · comments

Hi - nice work on this!

Sorry if I'm not looking hard enough, but have you released the code/details for preprocessing sequences used in pre-training?

I.e. Steps 1 & 2 in the paper:

(1) removing all the trivial and junk characters and (2) standardizing the corpus with respect to Persian characters.

Is this at all handled by the included tokenizer, or should one recreate the preprocessing step for fine-tuning?

Thanks

Mehrdad Farahani · Answer 1 · Mon May 17 2021 16:30:58 GMT+0800 (China Standard Time)

Hi @chrisji,

We haven't released the code and the preprocessing yet due to the ongoing paper acceptance procedure. We used a hierarchy of preprocessing regarding Persian before getting data into the pretraining part (Tokenization, Model Preparation), but in the new version of Tokenizers (by Huggingface), you can easily define a bunch of preprocessing or postprocessing directly into your tokenizer as the following code (It's a schema you can add your own or even define a custom part).

from tokenizers import (
    Tokenizer,
    decoders,
    models,
    pre_tokenizers,
    processors,
    normalizers
)

tokenizer = Tokenizer(models.WordPiece())
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
tokenizer.decoder = decoders.WordPiece()
tokenizer.post_processor = processors.BertProcessing()
tokenizer.normalizer = normalizers.BertNormalizer()