MAIF / melusine

📧 Melusine: Use python to automatize your email processing workflow

Home Page:https://maif.github.io/melusine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extract Body

CameliaSNCF opened this issue · comments

Python version : 3.8.10

Melusine version : 2.3.2

Operating System : Windows

Hello, I have an issue regarding the extraction of the segmenting of the body to clean, during the segmentation Melusine tags the CC of the email as body resulting in the body_header_extract funtion to consider the CC as the body to clean.

segmentingBody

extractBody

In the exemple above, you see the part that was selected as the body (yellow) does not correspond to the actual body of the email (between blue)

Thank you for your help.
Best regards.
Camelia

Could you share the different steps of your pipeline? By that I mean all the steps from your original email to structured_body.

Best regards

Hello, our data doesn't come in the same format as yours so here is how it starts :

CorpsDeTexte1
CorpsDeTexte2

I then try to eliminate the parts that are useless and try to get the email as close to your format as I can :

body3

then I apply the segmenting pipeline :

Segmenting3

And finally the cleaning pipeline :

cleaning

Is this enough information to understand the problem ?

Hello,

Yes thank you.
As your email has a specific format you need to adapt regexs that parse the historic. Because the default regex to split messages stopped after the ";" of the "A: " field. You want it to go further.
You will need to adapt regex from the key ["build_historic"]["transition_list"] in conf.json to match your format.

Best regards.

Also let us know when you find the perfect regex. Maybe it could be a contribution if not too specific :)

Hello, thank you very much for your answer, it was very useful.
I changed the regex where you indicated it, as the ';' is only used as a separator for the recipients of the email and I am interested in the line break, I took the ';' off the regex and it works.

regex

Thank you again.
Best regards

Camelia