goshaQ / ridiculous-scraper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ridiculous-scraper

Getting started

Docker Compose

The provided docker-compose.yml is configured to launch two services, one of which is the graph database and the other is the scraper. The easiest way to deploy the project is to run it on a single machine by the following commands:

git clone https://github.com/goshaQ/ridiculous-scraper.git
cd ridiculous-scraper/
docker-compose up

About

The Scraper

The website we are interested in allows to search for a company by Register code. Because our goal is to extract as much information about Estonian companies as possible, the obvious choice is to iterate over all possible values and generate a search request every iteration. In the project, the CreditinfoScraper class takes this responsibility.

The information about a company is retrieved from the server response. Basically, we process each row of the table describing the company manually and extract usefull information. That is done in that way because the data have poor structure. After extraction, the result is stored in the graph database. But if there already exists a record about the company.

To speed up the process we employ a data parallelism approach, that is we use a thread pool to process the result of each search request. However, because of some restriction on the number of generated requests per second imposed by the website, we actually make a delay between search requests.

Graph Data Model

We use pretty simple data model which contains two entities and single relationship between them. For now, the main objective is to capture the relationship between a company and its representatives. All of the company details are stored as properties of the corresponding node. For sure, this is not an optimal design, however further improvements require a better understanding of how should the interaction with the database take place, or what the database search queries are of interest.

The definition of the data model is shown in the image below.

About

License:MIT License


Languages

Language:Python 98.8%Language:Shell 1.2%