MAIF / melusine

📧 Melusine: Use python to automatize your email processing workflow

Home Page:https://maif.github.io/melusine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Modernize the Melusine tokenizer

hugo-quantmetry opened this issue · comments

Description of Problem:
Depending on the context, tokenization may cover different functionalities. For exemple:

  • Gensim (gensim.utils.tokenize) : Tokenization is limited to splitting text into tokens
  • HuggingFace tokenizers (encode methods) : Full NLP tokenization pipeline including Text normalization, Pre-tokenization, Tokenizer model and Post-processing.

Tokenization in Melusine is currently a hybrid which covers the following functionalities:

  • Splitting
  • Stopwords removal
  • Name flagging

It seems to me that the Full NLP tokenization pipeline is a bit spread across the Melusine package (prepare_data.cleaning, nlp_tools.tokenizer and even the prepare_data method of the models.train.NeuralModel).

This issue can be split into a few questions:

  • How can we refactor the code to make the Full Tokenization pipeline standout ?
  • How can we easily configure the Tokenization pipeline ? (Ex: a user friendly and readable tokenizer.json file)
  • How can package the tokenizer to ensure repeatability ?

Overview of the Solution:
I suggest to create a revamped MelusineTokenizer class with its load and save method.
The class should neatly package many functionalities commonly found in a "Full NLP Tokenization pipeline" such as:

  • Text cleaning ?
  • Flagging (phone numbers, email addresses, etc)
  • Splitting
  • Stopwords removal

The tokenizer could be saved and loaded from a human readable "json" file.

Examples:

tokenizer = MelusineTokenizer(tokenizer_regex, stopwords, flags)
tokens = tokenizer.tokenize("Hello John how are you")
tokenizer.save("tokenizer.json")
tokenizer_reloaded = MelusineTokenizer.load("tokenizer.json")

Definition of Done:
The new tokenizer class works fine.
The tokenizer can be read from / saved into a human readable config file
The tokenizer centralizes all tokenization functionalities in the larger sens.