alephdata / aleph

Search and browse documents and data; find the people and companies you look for.

Home Page:http://docs.aleph.occrp.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BUG: Aleph doesn't process all .msg email formats correctly

brrttwrks opened this issue · comments

Describe the bug
.msg email file format has had several versions and it seems that Aleph doesn't parse all of them correctly. This leads to us needing to convert them to eml format before ingesting into Aleph. The tool I've been using to convert the msg emails is msgconvert (https://www.matijs.net/software/msgconv/) The current state is problematic as Aleph gives the perception that it does process them, but some might be processed correctly and some seem to only show parts of the body of the email and none of the attachments. If it is possible to detect the different versions and parse them accordingly, then we wouldn't necessarily need to pre-process them and journalists wouldn't be surprised by the results.

To Reproduce
Steps to reproduce the behavior:

  1. Will share with you separately as the only examples I have are sensitive.

Expected behavior
All msg versions get parsed and ingested properly in Aleph.

Aleph version
4.0.0rc1

Screenshots
Cannot share.

Additional context
None