MAIF / melusine

📧 Melusine: Use python to automatize your email processing workflow

Home Page:https://maif.github.io/melusine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

use flashtext as a replacement for regex

remiadon opened this issue · comments

FlashText :

  • is written in pure Python and has no extra dependencies
  • can extract and replace keywords made up of multiple words
  • offers significant speedups compared to re.sub and re.find
  • can take "spelling errors" into consideration, via levenshtein distance

I see melusine uses a lot of regex for preprocessing/cleaning
I wonder if this would be useful to melusine

Hi Remi ! Thanks for the tip !

It is indeed a lot quicker than usual regex for lists of words having more than 500 items.

Thus, we decided to implement it for the name flagging as it is done by looking in a .csv file which has several thousands of items, resulting in a 20x times faster computation ! This update will be included in the next version.

Other regex uses smaller list of keywords (~100) , so it is not relevant to use Flashtext for now.