body extraction

Question

body extraction

caronpe opened this issue 4 years ago · comments

Hello !
Your project looks awesome and I really want to try it.
I see in your project that you use very clean text in your body data.
In "real" email life the content of mail body is very dirty (HTML, encoding, formating, multipart, different language...).
Did you manage it ? (or maybe you work only with your internal company emails ?)

Faithfully

Deleted user · Answer 1 · Tue Jan 07 2020 17:50:49 GMT+0800 (China Standard Time)

Melusine has the core code to deal with emails in every languages since the user (you and me) give it the patterns (in english, in spanish, for the new mail-box we haven't seen yet).

Melusine was designed for the company emails in french.
Those emails has already many and many shapes because they come from several mail-boxes.

The configuration of Melusine provides many patterns for french emails. You can complete/replace it with english patterns. See the customization of Melusine in https://github.com/MAIF/melusine/blob/master/tutorial/tutorial10_conf_file.ipynb

If you are familiar with regular expression you can copy the conf.json file to a custom_conf.json and adapt Melusine to your needs.

You can replace the regular expressions that work for french by regular expressions that work for english.

An example in python :

Add some custom patterns (regular expressions) to the "build_historic" part (split the message into many messages)

import os
import json
from melusine.config.config import ConfigJsonReader
conf_melusine = ConfigJsonReader()
conf_melusine.reset_config_path()
conf_dict = conf_melusine.get_config_file()

add_to_build_historic = [
    r">?\s*The[^;\n]{0,30}[;|\n]{0,1}[^;\n]{0,30}at[^;\n]{0,30};{0,1}[^;\n]{0,30}written\s*:?.{,100}?(?:\n[A-Z][A-Za-z]{,2}:|>{3}).*?\n",
    r"^(?:From|at|The|Cci?|Object|Date|Subject): .*?\n\s*",
    r">+.{,70}Real address .*?\n\s*",
    r"^\*{3,}\s+",
    r"TR\s?:.*?\n",
    r"Fwd\s?:.*?\n",
    r"(?:Hello.{,10}\s*)You have contacted the .{5,80} Our response :",
]

conf_dict["regex"]["build_historic"]["transition_list"] = (
    add_to_build_historic + conf_dict["regex"]["build_historic"]["transition_list"]
)

and for adding a pattern of flagging (the addresses here)

conf_dict["regex"]["cleaning"]["flags_dict"][
    r"\s*[0-9]{1,4}\s*(?:street|avenue|boulevard|road)(?:\s|of){,5}(?:(?:\s|\,)?\b\w+\b(?:\s|\'|\,)?){,6}(?:(?:\s|\'|\,|\-)?(?:\b[A-Z]+\w+\b|flag_cp_)(?:\s|\'|\,|\-)?){,3}"
] = " flag_adress_ "

and for adding footers

add_to_footer = [
    r"powered by .*?(?:(?:https?:\/\/)?(?:www\.)?[-a-zA-Z0-9:%._\\+~#=]{2,256}\.[a-zA-Z]{2,4}(?:[-a-zA-Z0-9:%_\\+.~#?&\/=]*)|\s|\b\w+\b|\(|\)){,12}",
    r"Ce message a [ée]t[ée] g[ée]n[ée]r[ée] automatiquement par [a-z-A-Z-0-9() .]{,50}",
    r"This e-mail and any attachments.*system.",
    r"This message may contains.{,80}electronic communication see",
    r"You also can consult.{5,250} you asked",
    r"This message was automatically generated by.*?\n"
]
conf_dict["regex"]["mail_segmenting"]["segmenting_dict"]["FOOTER"] = (
    add_to_footer + conf_dict["regex"]["mail_segmenting"]["segmenting_dict"]["FOOTER"]
)

add a "HELLO" pattern

add_to_hello = [r".{,40}happy?\s*new.{,30}", "(?:hello|hi).{,20}"]
conf_dict["regex"]["mail_segmenting"]["segmenting_dict"]["HELLO"] = (
    add_to_hello + conf_dict["regex"]["mail_segmenting"]["segmenting_dict"]["HELLO"]
)

Finally replace the default Melusine conf file by your custom conf file

path_to_custom_melusine = os.path.join(os.environ["CONF"], "custom_melusine_conf.json")
with open(path_to_custom_melusine, "w", encoding="utf-8") as jsonFile:
    json.dump(conf_dict, jsonFile, indent=4, ensure_ascii=False)
conf_melusine.set_config_path(file_path=path_to_custom_melusine)
print(path_to_custom_melusine)
print("Melusine custom file edited : ", path_to_custom_melusine)

Then the main amount of work is to find those regular expressions.

Tiphaine Fabre · Answer 2 · Tue Jan 07 2020 18:33:08 GMT+0800 (China Standard Time)

Hello @caronpe !
Thank you !

To complete the answer.
A lot of cleaning and formatting are already done by melusine (text_to_lowercase, remove_accents, remove_line_break, remove_superior_symbol, remove_apostrophe, remove_multiple_spaces_and_strip_text, etc. or build_historic to detect and extract mail of conversation with multiple reply and transfer)

The project is already in use for everyday mail received by a french insurance company (15K/day), not only internal company emails.

In that case more specific cleaning and transformations are done. The config.json file is edited (to detect some specific greetings for example) and there are some custom preparations to handle html tag, extract parts of mail in specific format like insurance claim or internet contact. You can add specific cleaning function to TransformerScheduler pipeline.
Encoding are managed by a proprietary solution on top of melusine. There is a standard with mime messages and an associated python library: https://docs.python.org/2/library/email.mime.html but it's never perfect.

caronpe · Answer 3 · Tue Jan 07 2020 22:19:32 GMT+0800 (China Standard Time)

Thank you for your long and very precise answer !
Thanks to this I discover more deeper your project and the features you provide.

Actually I'm more in the parsing data step and @TFA-MAIF answer me : I'm actually using email python library but like you say "it's never perfect". Extract clean text data from email can be real nightmare...

I will follow your project and I hope contribute !