TeamMsgExtractor / msg-extractor

Extracts emails and attachments saved in Microsoft Outlook's .msg files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Emails taken from where you are the sender omit the From Email address

michaelcgilligan opened this issue · comments

For emails from outlook (at least my very old version of outlook, 2016) emails taken from the sent folder (i.e. emails where I am the sender) fails to pull the full From email address, it only gets the display name.

I noticed on on line 1338 of message_base.py there is an attempt to get the email address another way, but for my test emails, it failed to get the address and only got the display name. After diffing through the email manually I found that the email address was stored in a different subst so simply changing line 1338 from email = self._getStringStream('__substg1.0_5D01') to email = self._getStringStream('__substg1.0_5D01') or self._getStringStream('__substg1.0_808B') resolved the issue (at least for me). I am not sure how reliable that alternative string is for all instances of Outlook, but I know it worked for every email I attempted to decode.

Okay, so here is the deal. You managed to find the email in a named property (hence why it starts with 80 in the name) so this is not a valid way to handle this.

But basically, the problem is that outlook just generally doesn't actually save that value to the file, period, and if it makes it in it is likely to be under a very special property.

For your file, you should probably iterate through all the named properties to actually find the name of the property that had the email address you were looking for to even see if there is a chance of this being a relatable place to pull from.

Additionally, see #89

My first message was made right after I woke up, so I missed out on some information that I should have elaborated on, most about the named properties.

So msg files have two main types of properties: ID based and name based properties. ID based are properties represented only by a 2 byte value (which we read as 4 hex characters) followed by a two byte type. For multi-types, string streams, and binary streams, these will be stored in a storage that looks like what is originally being accessed, starting with __substg1.0_, followed by the 4 hex characters representing the name, and followed by the 4 hex characters representing the type. Multi-types have an additional 9 characters but it's not important here. For all the other types, they are going to be stored inside of the properties stream because they can be packed together and not have to take up an enormous amount of extra space.

Named properties actually use this exact same storage style: a 2 byte ID and a 2 byte type, multi, string, and byte being in the substg streams, etc. etc.

However, that is where there similarities end. Named properties have an identification system which means there is 0 guarantee that the same named property on 2 different files will end up being in the same internal stream in those files. If they do, it's likely because both files had the same exact set of named properties (or starting set) and outlook generated them in the exact same order. The msg file has a few additional streams that tell how to map the identification system of named properties to the identification system other properties use. Named properties are stored starting with the ID 0x8000 and adding the current property index in the list to that to get the ID. So if you have 7 named properties, you'll see properties from 0x8000 to 0x8006 in the storages.

It is because of this that your method will not work, because the property is named, that stream could contain literally any named property. The reason it hasn't for your tests is likely because the same version of outlook on the same device created all the files.

Moving on to the other problem, it is simply very common for outlook to just not put the sender's email into the msg file at all if the sender is the one who saved it.

So now what can you do to try and help make it more likely for extract-msg to find the email of the sender? Well, what you need to do is actually track down the name of the property that is storing that value, so we can determine if the property is even what we are expecting it to be, or if it just happens to contain the correct value in your instance. To do this, you can try to iterate over the named properties and their values until you find the right one. Unfortunately, I don't believe I documented how to do that particularly well, though I believe some of the issue threads have code that can do this. Alternatively, you can try to use the msg-explorer package that I made which will allow you to use a GUI to explore the file, and more importantly explore it's named properties. Simply double click a property whose value is a stream until you get to one with the expected email.

If the property is good, then I can add this to help find the sender's email on more files.

Closing due to inactivity