Crawler for the OSS catalog of Developers Italia

How it works

The crawler finds and retrieves the publiccode.yml files from the organizations in the whitelist.

It then creates YAML files used by the Jekyll build chain to generate the static pages of developers.italia.it.

Elasticsearch 6.8 is used to store the data which be active and ready to accept connections before the crawler is started.

Setup and deployment processes

The crawler can either run manually on the target machine or it can be deployed in form of Docker container with its helm-chart in Kubernetes.

Manually configure and build the crawler

cd crawler
Save the auth tokens to domains.yml.
Rename config.toml.example to config.toml and set the variables

NOTE: The application also supports environment variables in substitution to config.toml file. Remember: "environment variables get higher priority than the ones in configuration file"
Build the crawler binary with make

Docker

The repository has a Dockerfile, used to build the production image, and a docker-compose.yml file to setup the development environment.

Copy the .env.example file into .env and modify the environment variables as it suits you. .env.example holds the detailed description of each variable.
```
cp .env.example .env
```

Save your auth tokens to domains.yml

cp crawler/domains.yml.example crawler/domains.yml
editor crawler/domains.yml

Start the environment:
```
docker-compose up
```

Run the crawler

Crawl mode (all item in whitelists): bin/crawler crawl whitelist/*.yml
- crawl supports blacklists (see below for details). The crawler will try to match each repository URL in its list with the ones listed in blacklists and, if it does, it will print a warn log and skip all operation on it. Furthermore it will immediately remove the blacklisted repository from ES if it is present.
  
  It also generates:
  - amministrazioni.yml containing all the Public Administrations their name, website URL and iPA code.
  - softwares.yml containing all the software that the crawler scraped, validated and saved into ElasticSearch.
    
    The structure is similar to publiccode data structure with some additional fields like vitality and vitality score.
  - software-riuso.yml containing all the software in softwares.yml having an iPA code.
  - software-open-source.yml containing all the software in softwares.yml with no iPA code.
One mode (single repository url): bin/crawler one [repo url] whitelist/*.yml
- In this mode one single repository at the time will be evaluated. If the organization is present, its IPA code will be matched with the ones in whitelist otherwise it will be set to null and the slug will have a random code in the end (instead of the IPA code). Furthermore, the IPA code validation, which is a simple check within whitelists (to ensure that code belongs to the selected PA), will be skipped.
- one supports blacklists (see below for details), whether [repo url] is present in one of the indicated blacklists, the crawler will exit immediately. Basically ignore all repository defined in list preventing the unauthorized loading in catalog.
bin/crawler updateipa downloads IPA data and writes them into Elasticsearch
bin/crawler delete [URL] deletes software from Elasticsearch using its code hosting URL specified in publiccode.url
bin/crawler download-whitelist downloads organizations and repositories from the onboarding portal repository and saves them to a whitelist file

Crawler whitelists

The whitelist directory contains the of organizations to crawl from.

whitelist/manual-reuse.yml is a list of Public Administrations repositories that for various reasons were not onboarded with developers-italia-onboarding, while whitelist/thirdparty.yml contains the non-PAs repos.

Here's an example of how the files might look like:

- id: "Comune di Bagnacavallo" # generic name of the organization.
  codice-iPA: "c_a547" # codice-iPA
  organizations: # list of organization urls.
    - "https://github.com/gith002"

Crawler blacklists

Blacklists are needed to exclude individual repository that are not in line with our guidelines.

You can set BLACKLIST_FOLDER in config.toml to point to a directory where blacklist files are located. Blacklisting is currently supported by the one and crawl commands.

Authors

Developers Italia is a project by AgID and the Italian Digital Team, which developed the crawler and maintains this repository.

MattMattV / developers-italia-backend