Extract Body
CameliaSNCF opened this issue · comments
Python version : 3.8.10
Melusine version : 2.3.2
Operating System : Windows
Hello, I have an issue regarding the extraction of the segmenting of the body to clean, during the segmentation Melusine tags the CC of the email as body resulting in the body_header_extract funtion to consider the CC as the body to clean.
In the exemple above, you see the part that was selected as the body (yellow) does not correspond to the actual body of the email (between blue)
Thank you for your help.
Best regards.
Camelia
Could you share the different steps of your pipeline? By that I mean all the steps from your original email to structured_body.
Best regards
Is this enough information to understand the problem ?
Hello,
Yes thank you.
As your email has a specific format you need to adapt regexs that parse the historic. Because the default regex to split messages stopped after the ";" of the "A: " field. You want it to go further.
You will need to adapt regex from the key ["build_historic"]["transition_list"] in conf.json to match your format.
Best regards.
Also let us know when you find the perfect regex. Maybe it could be a contribution if not too specific :)