Modernize the Melusine tokenizer
hugo-quantmetry opened this issue · comments
Description of Problem:
Depending on the context, tokenization may cover different functionalities. For exemple:
- Gensim (gensim.utils.tokenize) : Tokenization is limited to splitting text into tokens
- HuggingFace tokenizers (encode methods) : Full NLP tokenization pipeline including Text normalization, Pre-tokenization, Tokenizer model and Post-processing.
Tokenization in Melusine is currently a hybrid which covers the following functionalities:
- Splitting
- Stopwords removal
- Name flagging
It seems to me that the Full NLP tokenization pipeline is a bit spread across the Melusine package (prepare_data.cleaning, nlp_tools.tokenizer and even the prepare_data method of the models.train.NeuralModel).
This issue can be split into a few questions:
- How can we refactor the code to make the Full Tokenization pipeline standout ?
- How can we easily configure the Tokenization pipeline ? (Ex: a user friendly and readable tokenizer.json file)
- How can package the tokenizer to ensure repeatability ?
Overview of the Solution:
I suggest to create a revamped MelusineTokenizer class with its load and save method.
The class should neatly package many functionalities commonly found in a "Full NLP Tokenization pipeline" such as:
- Text cleaning ?
- Flagging (phone numbers, email addresses, etc)
- Splitting
- Stopwords removal
The tokenizer could be saved and loaded from a human readable "json" file.
Examples:
tokenizer = MelusineTokenizer(tokenizer_regex, stopwords, flags)
tokens = tokenizer.tokenize("Hello John how are you")
tokenizer.save("tokenizer.json")
tokenizer_reloaded = MelusineTokenizer.load("tokenizer.json")
Definition of Done:
The new tokenizer class works fine.
The tokenizer can be read from / saved into a human readable config file
The tokenizer centralizes all tokenization functionalities in the larger sens.