eneam / mboxviewer

A small but powerfull app for viewing MBOX files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Semi-interactive Re-Threading (based on user prompts/intervention)

afarlie opened this issue · comments

''' What is the problem you are trying to solve?'''

Archive,org provides a series of mbox archive files which contain the postings to various newsgroups (USENET/NNTP). (See Issue #35 for a more general request in respect of USENET support)

However, there are a few usability issues with these archives.

  1. Currently as provided some of these postings do not contain dates in a form which is parsed fully by Mboxviewer.
    (This on examination may be due to these postings only containing a partial date as opposed to a full ISO date/time format as specified in the relevant RFC.)

This is in some instances results in replies to postings appearing in listings 'before' the nominal posting they are replying.

  1. In a few instances, the postings do not contain a complete References: field, due to various factors, including interactions with Google Groups which has used varying internal threading-referencing approaches, which are not necessarily the same as those in a relevant RFC for USENET more widely.

'''What would you like the application to do?'''

Provide a mechanism or process whereby a user can select a 'posting', and then by a series of semi-interactive prompts, 're-order' or 're-thread' postings semi-interactively, in order to reconstruct a nominal corrected thread/posting order, based on header information (such as Message-ID's , Google Thread ID's, partial dates and other information in the postings concerned.

The application, would then cautiously modify/update information in the mbox file and it's internal index, potentially reconstructing a new References (and other relevant fields) for postings as required. There is no need to provide the ability to reconstruct or add complete dates from 'partial' dates, although if implemented this should be a seperate header field from the original Date field (such as RevisedDate , or DateAnnotation, which the application could use in the future to reconstruct a nominal order.)

commented

The proposed feature seems to place large burden on users and possibly non-trivial enhancements to MBox Viewer to make reordering more user friendly. Not exactly sure what interaction you think would be helpful.

Before reordering could take place, mails belonging to the same thread need to be discovered. Discovery could start from a single mail and keep expanding by searching newly discovered message ids, etc. Mail pruning would have to be supported too. Eventually, reordering based on time would need to take place. Alternatively, user could reorder discovered mails manually. Final result would have to be recorded in the new mbox file.

Current MBox Viewer can discover conversation threads based on Gmail thread id or based on the message ids. Discovery based on message ids is not perfect and could/should be enhanced.

I suspect newsfeeds have large number of newsgroups so you would have to apply reordering steps to potentially large number of groups. It sounds like tremendous effort unless reordering is automatic or require very small input from users.

I tried to find example newsfeed mbox files by browsing http://archive.org/ but had hard time to find such files. Few concrete links would be helpfull.

I suspect the the proposed feature will not be used by large number of users. Small(er) complexity and small(er) implementation effort helps to accept this and other feature requests.

I will soon be traveling for next two months. I may have a bit of time to think about the request but implementation will have to wait, assuming the request is accepted after detailed analysis.

The proposed feature seems to place large burden on users and possibly non-trivial enhancements to MBox Viewer to make reordering more user friendly. Not exactly sure what interaction you think would be helpful.

I appreciate that the request is intended for more experienced/competent users with some knowledge of the relevant RFC, message headers and protocols.

Before reordering could take place, mails belonging to the same thread need to be discovered. Discovery could start from a single mail and keep expanding by searching newly discovered message ids, etc. Mail pruning would have to be supported too. Eventually, reordering based on time would need to take place. Alternatively, user could reorder discovered mails manually. Final result would have to be recorded in the new mbox file.

That was exactly what I had in mind, veering more towards tool-assisted manual re-ordering, perhaps with an overrideable warning if the application detected a user was trying to place a mail/posting outside of a thread in which it's Metadata references field, or Google thread details indicated it should be.

Currently, MBOx viewer, displays a flat-list of postings/mail it's found in an mbox. As a first step to enabling re-ordering, would it be possible to use some kind of 'tree-view' type display, so that threading information can be more explicitly represented?
(I'm not sure what cross platform libraries for 'tree-view' UI elements exist currently, but was reasaonably confident some variant of a treeview existed on mosty major platforms. )

In respect of re-ordering by 'time'- Some of the postings in the NNTP mbox files I've been looking at write the date as YYYY/MM/DD, placing a fuller cannonical (?) date in the X-Server-Date field. Greater tolerance of varying Date formats may be needed.

Compare "

Current MBox Viewer can discover conversation threads based on Gmail thread id or based on the message ids. Discovery based on message ids is not perfect and could/should be enhanced.

Another possibility (at least for newsgroup postings) is to look at the References field to determine posting order, https://www.rfc-editor.org/rfc/rfc1036 in my reading seems to say these are in order of the replies, but this may have been updated in a subsequent RFC I've not yet found. I'm not sure how Googlethread ID's work, and I'm not sure that would be covered by an RFC in the way semi-standardised Usenet headers are. (If you have technical information on this you can document, please consider drafting an unofficial FAQ/RFC about this.)

I suspect newsfeeds have large number of newsgroups so you would have to apply reordering steps to potentially large number of groups. It sounds like tremendous effort unless reordering is automatic or require very small input from users.

I was wanting to apply this to 'single group' mbox archives . Attempting to do it over numerous folders would IMO be beyond the computational capacity of most consumer hardware. (Google has entire data centers to analyse Usenet/NNTP headers, and even then it's not perfect.)

I tried to find example newsfeed mbox files by browsing http://archive.org/ but had hard time to find such files. Few concrete links would be helpfull.

(See my previous comment for some example mbox files containing NNTP/Usent content.)

commented

I will likely create new package to release some minor enhancements and fixes along with enhancements to the date and time handling. No changes to discovery of the conversation threads however.

I did prototype enhancements to address date and time issue I observed in the zip file you provided. If the date/time format is not according to RFC specification, MBox Viewer will search entire mail header for any instance of date/time that can used instead of date/time specified in the date: field.

In the provided zip files, there many Date fields specified as year/month/day and with the time missing. If MBox Viewer can't find instance of date/time or the found date/time parsing fails, the date/time in the Date field is used. However, lack of the time spec will result in many email with the same date/time.

MBox Viewer will be able to handle the date formats that consist of year, month and day in any order and different separators. In case when both the month and day are < 12, MBox Viewer will prefer year/month/day or day/month/year. In order to make better decision, MBox Viewer would have to inspect all emails from the sender to discover the date format employed by that user. It is likely more enhancements are needed to processing of the date/time.

Once I create new release, before my travel, I might be able to spent some time to investigate ordering issue. Current MBox Viewer relies on the gmail thread ids if present, otherwise it relies on the from addresses, message ids and the message id in the in-reply-to field. The References field is not used currently. There was some work done with References in the past but it was commented out. Support for References needs to optimize memory footprint to store potentially large number of message ids in the References.

As you noted, currently, MBOx viewer, displays a flat-list of postings/mail it's found in an mbox. However, when you order emails by subject or by message ids in in-reply-t field, threads are distinguished by different color, see below.

image

It is likely that mail ordering will require to leverage From address, massage ids, reply-to and in-reply-to fields as well as the References field. As you noted, The References field may help to order mails with the same data/time.

I didn't find any information on use of Googlethread ID's.

commented

Hi,

I released v1.0.3.35 to enhance date/time discovery. It is probably not what you are looking for. It looks like you are looking to display messages based on date/time and message order which MBox Viewer doesn't support. Message order can likely be determined based on Message Ids. Likely with some issues.

I examined content of mbox files you provided as follow:

  1. I set File->Options->Subject Sort Type ->time ordered threads
  2. Clicked on Subject column to order messages by subjects
  3. Selected View->View Raw Message Headers
  4. Browsed messages

Field X-Google-Thread: f4600,deae22583cc56b0a shows news group ID and Message Thread

X-Google-Thread: f4600,deae22583cc56b0a,start start marks first message in the thread. However, I see cases when there are multiple start messages within the same thread.

Didn't examine closely References but that is the only way to determine order of send/response. In-Reply_Id filed is used occasionally only.

Showing thread messages as a tree would be a challenge for the current architecture of MBox Viewer.

commented

In addition to RFC 1026 you highlighted, there are a number of follow up RFCs attempting to standardize news protocol.

https://www.rfc-editor.org/rfc/rfc1036

Follow up RFCs::

https://datatracker.ietf.org/doc/rfc1849/

https://www.rfc-editor.org/rfc/rfc5536

https://www.rfc-editor.org/rfc/rfc5537

etc ...

Processing of News messages seems to be quite challenging due to many non-standard implementations.