Modernize the Melusine tokenizer

Question

Modernize the Melusine tokenizer

hugo-quantmetry opened this issue 3 years ago · comments

Hugo Perrier commented 3 years ago

Description of Problem:
Depending on the context, tokenization may cover different functionalities. For exemple:

Gensim (gensim.utils.tokenize) : Tokenization is limited to splitting text into tokens
HuggingFace tokenizers (encode methods) : Full NLP tokenization pipeline including Text normalization, Pre-tokenization, Tokenizer model and Post-processing.

Tokenization in Melusine is currently a hybrid which covers the following functionalities:

Splitting
Stopwords removal
Name flagging

It seems to me that the Full NLP tokenization pipeline is a bit spread across the Melusine package (prepare_data.cleaning, nlp_tools.tokenizer and even the prepare_data method of the models.train.NeuralModel).

This issue can be split into a few questions:

How can we refactor the code to make the Full Tokenization pipeline standout ?
How can we easily configure the Tokenization pipeline ? (Ex: a user friendly and readable tokenizer.json file)
How can package the tokenizer to ensure repeatability ?

Overview of the Solution:
I suggest to create a revamped MelusineTokenizer class with its load and save method.
The class should neatly package many functionalities commonly found in a "Full NLP Tokenization pipeline" such as:

Text cleaning ?
Flagging (phone numbers, email addresses, etc)
Splitting
Stopwords removal

The tokenizer could be saved and loaded from a human readable "json" file.

Examples:

tokenizer = MelusineTokenizer(tokenizer_regex, stopwords, flags)
tokens = tokenizer.tokenize("Hello John how are you")
tokenizer.save("tokenizer.json")
tokenizer_reloaded = MelusineTokenizer.load("tokenizer.json")

Definition of Done:
The new tokenizer class works fine.
The tokenizer can be read from / saved into a human readable config file
The tokenizer centralizes all tokenization functionalities in the larger sens.