developers-italia-backend

Backend & crawler for the OSS catalog of Developers Italia

The crawler finds and retrieves all the publiccode.yml files from the Organizations registered on Github/Bitbucket/Gitlab listed in the whitelistes, and then generates YAML files that are later used by the Jekyll build chain to generate the static pages of developers.italia.it.

Components

Elasticsearch for storing the data
Kibana for internal visualization of data
Prometheus for collecting metrics

Dependencies

Set-up

Stack

rename .env.example to .env and fill the variables with your values
- default Elasticsearch user and password are elastic:elastic
- default Kibana user and password are kibana:kibana
rename elasticsearch/config/searchguard/sg_internal_users.yml.example to elasticsearch/config/searchguard/sg_internal_users.yml and insert the correct passwords

Hashed passwords can be generated with:
```
docker exec -t -i developers-italia-backend_elasticsearch elasticsearch/plugins/search-guard-6/tools/hash.sh -p <password>
```
insert the kibana password in kibana/config/kibana.yml

configure the nginx proxy for the elasticsearch host with the following directives:

limit_req_zone $binary_remote_addr zone=elasticsearch_limit:10m rate=10r/s;

server {
    ...
    location / {
        limit_req zone=elasticsearch_limit burst=20 nodelay;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_pass http://localhost:9200;
        proxy_ssl_session_reuse off;
        proxy_cache_bypass $http_upgrade;
        proxy_redirect off;
    }
}

you might need to type sysctl -w vm.max_map_count=262144 and make this permanent in /etc/sysctl.conf in order to start elasticsearch, as documented here
start the Docker stack: make up

Crawler

cd crawler
Fill your domains.yml file with configuration values (like specific host basic auth tokens)
Rename config.toml.example to config.toml and fill the variables
build the crawler binary: make
start the crawler: bin/crawler crawl whitelist/*.yml
configure in crontab as desired

Tools

bin/crawler updateipa downloads IPA data and writes it into Elasticsearch
bin/crawler download-whitelist downloads orgs and repos from the onboarding portal and writes them to a whitelist file

Troubleshooting

From docker logs seems that Elasticsearch container needs more virtual memory and now it's Stalling for Elasticsearch....

Increase container virtual memory: https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#docker-cli-run-prod-mode
When trying to make build the crawler image, a fatal memory error occurs: "fatal error: out of memory"

Probably you should increase the container memory: docker-machine stop && VBoxManage modifyvm default --cpus 2 && VBoxManage modifyvm default --memory 2048 && docker-machine stop

Development

In order to access Elasticsearch with write permissions from the outside, you can forward the 9200 port via SSH using ssh -L9200:localhost:9200 and configure ELASTIC_URL = "http://localhost:9200/" in your local config.toml.

Authors

Developers Italia is a project by AgID and the Italian Digital Team, which developed the crawler and maintains this repository.

About

Backend & crawler for the OSS catalog of Developers Italia

GNU Affero General Public License v3.0

Languages

Language:Go 88.7%Language:Shell 11.0%Language:Makefile 0.3%

cappe87 / developers-italia-backend

developers-italia-backend

Backend & crawler for the OSS catalog of Developers Italia

Components

Dependencies

Set-up

Stack

Crawler

Tools

Troubleshooting

Development

See also

Authors

About

Languages