vmarkovtsev / ggmbox

Google Groups raw email crawler and parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ggmbox Build Status Build status Docker Build Status

Google Groups raw emails crawler and parser. Turbo speed and reliable! The downloaded messages are in RFC 822 format - taken verbatim from the Google servers.

Installation

Docker

Docker is the simplest option. Go to DockerHub Prepend docker run -it --rm vmarkovtsev/ggmbox to all the commands in the "Usage" section.

Crawler

Requirements: Python 3 and Scrapy. Download ggmbox.py file.

Parser

Requirements: Go.

go get -v github.com/vmarkovtsev/ggmbox

Usage

Crawler

scrapy runspider -a name=golang-nuts -o result.json -t json ggmbox.py

Replace "golang-nuts" with the actual group name. The raw emails will be saved by default to the corresponding directory.

scrapy runspider -a name=chromium-dev -a prefix=a/chromium.org -o result.json -t json ggmbox.py

Note the usage of "prefix" argument - it sets the name of the parent. Some groups require that.

Parser

./parse golang-nuts > dataset.csv

Replace "golang-nuts" with the actual directory name with raw emails. The plain text threads will be written to dataset.csv, one thread per line. Special characters are escaped.

Performance

Crawler

golang-nuts group was fully fetched on 24/02/2018 with 30043 topics and 192654 messages in 3 hours at 1gbps connection speed. The raw emails occupied 1.6 GB on disk.

Compare to 1 day using icy/google-group-crawler, it fetched only 63% and then stopped without any errors reported, or to henryk/gggd, it fetched only 3% within one hour and then unexpectedly stopped too.

Parser

It takes 7 seconds to parse 1.6 GB of raw emails on a 32-core machine.

Contributions

...are welcome! See CONTRIBUTING.md and CODE_OF_CONDUCT.md.

License

MIT.

About

Google Groups raw email crawler and parser

License:MIT License


Languages

Language:Python 51.5%Language:Go 40.5%Language:Dockerfile 8.0%