mwalmsley / gaceta

Scraping utility for downloading Gaceta parliamentary vote history

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

gaceta

Scraping utility for downloading Gaceta parliamentary vote history.

Install

Only needs pandas, requests, and beautifulsoup4.

pip install -r requirements.txt

Running

python download_votes.py
python parse_votes.py
python join_votes.py

The server is quite unreliable - it seems to go through periods where any request returns a 404. The code is designed to stop if this happens. If you get errors like

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://gaceta.diputados.gob.mx/voto58/

then try again later.

Set overwrite=True in download_votes.py to always download every file, even if they already exist.

Website file structure

Plain text files are hosted on the server with urls like

http://gaceta.diputados.gob.mx/votoN/ordiXY/votoMMDD-I.txt

For example

http://gaceta.diputados.gob.mx/voto64/voto64/ordi22/voto0630-1.txt

votoN is votes during the Nth parliament. For example, voto64 is votes during the "LXIV Legislatura (2018-2021)", where LXIV=64 (took me way too long to notice).

ordiXY is the parliamentary year and period that year. This page lists them in the form

  • "first year (of the parliament), first ordinary period"
  • "first year (of the parliament), second ordinary period"
  • "first year (of the parliament), first extraordinary period"
  • ...

so e.g. ordi22 is the second year of the parliament, and the second ordinary session that year.

The text file name votoMMDD-I.txt is the date and the index of the proposal for that day. So e.g. voto1219-2 is the second proposal of the day of Dec 19th.

About

Scraping utility for downloading Gaceta parliamentary vote history


Languages

Language:Python 100.0%