snutesh / EnronSearch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Converting the Enron Email Dataset to mbox Format

The Enron Email Dataset is distributed in maildir format, which means that each message is stored in a separate file. This is unwieldy to work with. Here's how you can convert maildir into mbox, where all messages in a folder are stored in a single mbox file.

Go fetch the dataset and then unpack:

$ tar xvfz enron_mail_20150507.tgz

The dataset should unpack into a directory called maildir. Use the script count_messages.sh to gather an inventory of the messages in each folder:

$ ./count_messages.sh

Verify the total number of messages in the dataset:

$ ./count_messages.sh | cut -d' ' -f1 | awk '{s+=$1} END {print s}'
517401

Now run the conversion script:

$ ./convert_enron_to_mbox.py

It might take a bit, so go grab a cup of coffee...

Note that the script is destructive, in that it alters the original structure of the dataset. This is necessary to get everything in the right maildir format so that it can be processed by Python tools (in particular, the script creates cur/ and new/ directories, which is part of the expected layout).

After the script completes, the resulting mbox files are stored in the enron/ directory:

$ ls enron | wc
    3311    3311   93804

The repo includes ReadMbox.java, a very simple Java program that uses the JavaMail API to read the mbox files. The dependent jars are checked into the repo for convenience, so you can compile directly:

$ javac -cp lib/javax.mail-1.5.6.jar:lib/mbox.jar ReadMbox.java

You can now examine a particular mbox file:

$ java -cp .:lib/javax.mail-1.5.6.jar:lib/mbox.jar ReadMbox enron/enron.allen-p._sent_mail

The program prints out the subject line of each email.

To verify the integrity of the entire dataset in mbox format, run:

$ ./verify_mbox.sh > mbox.log &

Confirm that the number of messages is exactly the same:

$ cut -d' ' -f3 mbox.log | awk '{s+=$1} END {print s}'
517401

About


Languages

Language:Python 74.6%Language:Java 14.6%Language:Shell 10.8%