eneam / mboxviewer

A small but powerfull app for viewing MBOX files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error - Remove Duplicate Mails

okica-11 opened this issue · comments

When I try to run Remove Duplicate Mails I get an error: "Encountered an improper argument."

While trying this on a smaller number of e-mails it appears that error could occur when š,č or ž is in contact name/subject or perhaps even in the content of the email.

additionally: It would be very nice if you could export or save all de-duplicated e-mails as new mbox file.

commented

Thanks for raising the issue. Not sure whether it is related to the listed characters but can't be sure since the code is not fully UNICODE.

I am guessing the mbox program stops and show Message box with listed error, right?

I would appreciated if you can run the instrumented version included in the mbox package and redo the test. Please read the HELP file to learn how to run the instrumented and check for files created.

Thanks,
Zbigniew

commented

Regarding the ability to export de-duplicated e-mails, this is already supported assuming no crash. After the de-duplication is done, right click on any mail in "User Selected Mails" and select "Save as Mail Archive File". Duplicate mails are shown in "Found Mails".

I will be releasing new load soon. I might be able to resolve the reported issue in this new release but will need you run the instrumented version. It may help to resolve the crash.

Thanks,
Zbigniew

Hi again.

I tried to run mboxviewer under *\mbox\ReleasePlusStackTrace but I get the same error "Encountered an improper argument." There is no error code - link . Interestingly there is not a single dump .txt file generated anywhere after this error message.

(Thank you for pointing out "Save as Mail Archive File". Clearly, I have missed it. )

commented

Thanks for the test. Unfortunately, for historical reason, the source code is not fully instrumented to catch all exceptions and Windows objects can generate exceptions anywhere. I will consider to add additional tracing and exception handling around the code that removes duplicate mails into this release and ask you to rerun the test. In the past I would generate private version for some users but Windows seem to complain about not trusted executable and most users prefer official executable. I will provide an update when I have new release.

FYI. Mbox viewer relies on the message_id in the message header to detect and remove duplicates. I looks like the message_id can be any string of characters. Mbox viewer uses standard Microsoft c++ classes everywhere, Message Box doesn't say much about the problem. There is a bit more information in the Event Viewer but also tricky to use.

Thank You

Thank you very much. I will let you know if I notice (find) what could or does cause this error.

commented

I believe I found the issue. I "optimized" code too much and apparently you have large number of duplicate mails. Coming release should resolve the issue.

commented

There is a work around if you would like to try before the new release is ready.

  1. Copy all mail from All Mails to Found Mails by executing Find Advanced. Make sure that all fields are un-checked except say FROM field and the search string is set start '*' . You should see all mails in Found Mails.
  2. Go back to All Mails and copy all mails to User Selected Mails. Select Remove Duplicate Mails. Unique mails will be in User Selected Mails and duplicate mails in Found Mails.

There is a work around if you would like to try before the new release is ready.

  1. Copy all mail from All Mails to Found Mails by executing Find Advanced. Make sure that all fields are un-checked except say FROM field and the search string is set start '*' . You should see all mails in Found Mails.
  2. Go back to All Mails and copy all mails to User Selected Mails. Select Remove Duplicate Mails. Unique mails will be in User Selected Mails and duplicate mails in Found Mails.

I've just tried it and the workaround works! You are correct regarding the number of emails and its duplicates. My mbox contains 158573 e-mails, after the de-duplication number is (strangely) exactly 30000. Is there a limitation?

The number (and size) of mbox is so high because its a merge between multiple "full" exports of my google mail for the past 6 years or so. That also explains my need for de-duplication.

Thank you for your help. There was something little donated via Sourceforge :).

commented

Glad it worked and appreciate donation too. I don't know what I was thinking but obviously the term "duplication" confused my brain during the initial implementation. Duplication sounded in my mind I guess like at most two instances of the same mail and one of the arrays didn't grow when needed. Now the array can grow up to the total number of mail, i.e up to 158573. Now there no limitation for the max number of unique and duplicated mails (up to total number of mails). It looks strange that you end up with 30,000 unique mails but that must be what it is. Check the Found Mails if anything looks suspect, it should not.

The coming release will support command line option to merge mbox and/or eml files listed in the file. Mbox files can be located in different folders. Let you know when the release is ready.

Thank You,
Zbigniew

commented

Did you take another look at the unusual 30,000 (exact number) unique e-mails? I don't see issues removing duplication on a couple of my test mbox files. Maybe you could download and merge the latest gmail mbox and see what happens?

I tried it again with the same mbox files and the results are the same. I requested a new archive from now Google and will let you know when I manage to get it, download it, and test it.

After merging data with new mbox backup, total number of e-mails grew to 170871. After the de-duplication there are 30195 emails - therefore we can predict de-duplication works correctly :).

commented

Thanks. It was really strange you end up with exactly nice count of 30,000. Currently, the message_id, date&time, From and To are used to de-duplicate. I think that should work fine but I will review the algorithm and let you know if I make any change.

commented

I released new v1.0.3.17 of the mbox viewer to resolve crash while executing Remove Duplicate Mails option.
Also, enhanced a bit the check to remove duplicate mails. Using the previous releases you may still end up in some cases with few duplicate mails in the output.

README describes all changes.

If you find time, please verify and provide feedback.

I have tested the new version, but unfortunately on different mbox files (I have already deleted old archives). Selecting mails from All Mails, moving them to User selected Mails and de-duplicating them now works without errors :).

commented

Closing. User doesn't see the reported problem anymore.