CristianCantoro / mailman-archive-scraper

Python script that scrapes public and private Mailman archive HTML pages and republishes them to local files, and generates an RSS feed of recent emails.

Home Page:http://www.gyford.com/phil/writing/2009/05/13/mailman_archive_scraper.php

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mailman Archive Scraper

By Phil Gyford phil@gyford.com
v1.13, 2010-01-15

Latest version is available from http://github.com/philgyford/mailman-archive-scraper/

This script will scrape the archive pages generated by the Mailman mailing list manager http://www.gnu.org/software/mailman/index.html and republish them as files on the local file system. In addition it can optionally do a number of things:

  • Create an RSS feed of recent messages.
  • Scrape private Mailman archives (if you have a valid email address and password).
  • Remove all email addresses from the files (both those in 'phil@gyford.com' and 'phil at gyford dot com' format).
  • Replace the URL for the 'more info on this list' links with another.
  • Remove one or more levels of quoted emails.
  • Search and replace any custom strings you specify.
  • Add custom HTML into the section of the re-published pages.

Why would you want to do this? Three reasons:

  1. You want to create your own HTML archive of a mailing list hosted elsewhere.

  2. You want to create a public version of a private archive. We hope you have permission to do this of course. The tools mentioned above allow you to do things like anonymise names and phone numbers, remove email addresses, etc.

  3. To have an RSS feed of recent messages.

There may be more efficient ways to do this if you have access to the database in which the Mailman archive is stored. If you don't, and can only access the web pages, this script is for you.

This script doesn't store any state locally between sessions so every time it's run it will have to scrape several pages, even if nothing's changed (particularly if you want an RSS feed of n recent messages). There is a half second delay between each fetch of a remote page, which slows things up but will hopefully prevent hammering web servers.

There are caveats. I have only tested this with a couple of Mailman archives (one private, one public) and it seems to work fine. I'm sure that some people will find problems with different installations -- unscrapeable HTML, different URLs and filepaths, etc. Feel free to suggest fixes.

Installation

  1. Put the directory containing the MailmanArchiveScraper.py script somewhere you want to run it from.
  2. Make a copy of the MailmanArchiveScraper-example.cfg file and call it MailmanArchiveScraper.cfg.
  3. Set the configuration options in that file (see below).
  4. Install the required extra python modules:
  5. Make sure the MailmanArchiveScraper.py script is executable (chmod +x).

Configuration

There is help in the configuration file for each setting. The minimum things you'll need to set are:

  1. domain -- The domain name that your Mailman pages are on.
  2. list_name -- Name of your mailing list.
  3. email and password -- Required if your Mailman archive is password protected.
  4. publish_dir -- The path to the local directory the files should be republished to.
  5. publish_url - If you're going to publish the messages to a website.

What would also be nice:

  • Sending each message on as an email. I can't see how to do this simply, given that we retain no state between times the script is run, so can't tell which emails haven't previously been sent.

About

Python script that scrapes public and private Mailman archive HTML pages and republishes them to local files, and generates an RSS feed of recent emails.

http://www.gyford.com/phil/writing/2009/05/13/mailman_archive_scraper.php