ratsinfo-scraper

Scrape documents with associated metadata from http://ratsinfo.dresden.de

INSTALLATION

Get the code

git clone https://github.com/Mic92/ratsinfo-scraper.git

Get jruby

curl -L https://get.rvm.io | bash
rvm install jruby

Install bundler

gem install bundler

Install Dependencies

cd ratsinfo
bundle install

USAGE

To start scraping use:

rake

This will extract all documents to the path of the environment variable DOWNLOAD_PATH (defaults to "data") and convert it to plain text

To scrape an individual session for example: http://ratsinfo.dresden.de/to0040.php?__ksinr=100

rake scrape_session[100]

To display all tasks use:

rake -T

The download directory will have the following scheme:

each session have a directory, where the id is the directory name
every document belonging to this session will be extracted to this directory
if the document is an pdf, it try to convert it to text using the pdf-reader gem
additionally a file called metadata.json is created. This is a machine-readable version of the index.htm file, which is contained in the document archives

The metadata.json file follow this structure. (optional means null-values for strings or empty array, required values should be always available)

Key

Value

id

(required) a human-readable identifier of the session, ex: "SR/003/2009"

description

(required) the long name of the session, ex "3. Sitzung des Stadtrates"

committee

(required) the board, holding the session, ex "Stadtrat"

started_at

(required) the time when the session started (converted from CEST time), ex "2009-10-01T14:00:00Z"

ended_at

(optional) the time when the session ended (converted from CEST time), ex "2009-10-01T18:30:00Z"

location

(optional) the location where the session took place, ex "Landeshauptstadt Dresden, im Neuen Rathaus, Plenarsaal,Rathausplatz 1, 01067 Dresden"

download_at

(required) the time when the archive was downloaded

documents

documents associated with the session (excluding those associated with parts)

file_name	the file name as it is in the session directory, ex: 00003144.pdf"
description:	name of the document, ex: "Vorlage Gremien"

parts

(optional) each session can contain an array of parts. a part is an object containing the following keys:

description

(required) name of the part, ex "Beschlussvorlagen zu VOB-Vergaben"

template_id

(optional) some parts uses templates, further information here http://ratsinfo.dresden.de/vo0042.php

documents

(optional) array of documents associated with this part

file_name	the file name as it is in the session directory, ex: "00003144.pdf"
description	name of the document, ex: "Vorlage Gremien"

decision

(optional) some sessions ended with a decision made by the comittee, ex: "Zustimmung"

vote_result

(optional) object

pro	(required) votes for the subject, ex: 1
contra	(required) votes against the subject, ex: 2
abstention	(required) neither/nor contra or pro, ex: 0

TODO

continue, where the last scan stopped
templates for custom tasks
clean up task
some kind of tests

About

Scrape documents with associated metadata from http://ratsinfo.dresden.de

Other

Languages

Language:Ruby 100.0%