ratsinfo-scraper
Scrape documents with associated metadata from http://ratsinfo.dresden.de
INSTALLATION
Get the code
git clone https://github.com/Mic92/ratsinfo-scraper.git
Get jruby
curl -L https://get.rvm.io | bash
rvm install jruby
Install bundler
gem install bundler
Install Dependencies
cd ratsinfo
bundle install
USAGE
To start scraping use:
rake
This will extract all documents to the path of the environment variable DOWNLOAD_PATH (defaults to "data") and convert it to plain text
To scrape an individual session for example: http://ratsinfo.dresden.de/to0040.php?__ksinr=100
rake scrape_session[100]
To display all tasks use:
rake -T
The download directory will have the following scheme:
- each session have a directory, where the id is the directory name
- every document belonging to this session will be extracted to this directory
- if the document is an pdf, it try to convert it to text using the pdf-reader gem
- additionally a file called metadata.json is created. This is a machine-readable version of the index.htm file, which is contained in the document archives
The metadata.json file follow this structure. (optional means null-values for strings or empty array, required values should be always available)
Key | Value | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | (required) a human-readable identifier of the session, ex: "SR/003/2009" | ||||||||||||||||||||
description | (required) the long name of the session, ex "3. Sitzung des Stadtrates" | ||||||||||||||||||||
committee | (required) the board, holding the session, ex "Stadtrat" | ||||||||||||||||||||
started_at | (required) the time when the session started (converted from CEST time), ex "2009-10-01T14:00:00Z" | ||||||||||||||||||||
ended_at | (optional) the time when the session ended (converted from CEST time), ex "2009-10-01T18:30:00Z" | ||||||||||||||||||||
location | (optional) the location where the session took place, ex "Landeshauptstadt Dresden, im Neuen Rathaus, Plenarsaal,Rathausplatz 1, 01067 Dresden" | ||||||||||||||||||||
download_at | (required) the time when the archive was downloaded | ||||||||||||||||||||
documents | documents associated with the session (excluding those associated
with parts)
|
||||||||||||||||||||
parts | (optional) each session can contain an array of parts.
a part is an object containing the following keys:
|
TODO
- continue, where the last scan stopped
- templates for custom tasks
- clean up task
- some kind of tests