chrishein / bora_crawler

BORA (Boletín Oficial de la Republica Argentina) Crawler implemented using Scrapy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BORA Crawler

BORA (Boletín Oficial de la Republica Argentina) Crawler implemented using Scrapy.

BORA is the Official Gazette for Argentina, where the government publishes public or legal notices, including companies incorporation or modifications in their structure and share holders.

More details can be found in this article.

This crawler saves the following information for each notice:

  • id: Notice ID in the BORA website
  • company: Name of the company
  • date: Date of publication
  • type: Type of publication. Eg: company constitution, company modification, etc.
  • content: Text of publication

The content of the publication contains unstructured text and must be further processed in order to extract data.

Running the Spider

To run the spider and save the crawled items in JSON use:

scrapy crawl bora -o items_bora.json -a start_date=YYY-mm-dd -a end_date=YYY-mm-dd

start_date and end_date are optional, with default values 2011-01-01 and current date respectively.

Deploying to Scrapinghub

When deploying to Scrapinghub, make sure you use the scrapy stack, as explained here in order to avoid SSL errors.

License

Distributed under the MIT License. See LICENSE file for further details.

About

BORA (Boletín Oficial de la Republica Argentina) Crawler implemented using Scrapy

License:MIT License


Languages

Language:Python 100.0%