Crawling@Home GPU controlled Hetzner Cloud swarm of scrapers

Help us build a billion-scale image-caption dataset by filtering Common Crawl with OpenAI CLIP. At the time of this writing we are up to 35 million high quality pairs ready for training various models but we still expect your help to advance to the potential 30 billion pairs estimated to exist in the commoncrawl data. This dataset is intended for public use and towards a truly open access to AI for everyone !

Concept

This scraping task comes with specific characteristics: link lists might be old and images might not be online anymore, even entire domains might be missing. Also there are seldom multiple links pointing to the same domain, so the DNS queries are many and often. Finally after the actual scraping there is a computational intensive task to calculate similarities between images themselves and their captions.

On a normal CPU machine, scraping and filtering take almost the same time. On a GPU though filtering is much faster, in order of 60x faster than on single CPU.

Hence this concept for crawling@home where a cental GPU machine can drive a swarm of cloud workers then perform computing intensive task on GPU.

At this time the script is tested on a single GPU driving 32 workers. At full load we estimate getting about 6M pairs per 24 hours for the cost of using the local GPU and 6 Euro in Hetzner could computing.

Remember to watch your progress at http://cah.io.community/

Recent updates

Due to new features introduced in CAH tracking server and client, we have been able to further improve the architecture and obtain top performance by completely separating CPU workers from GPU workers.

Thus the code migrated to:

Swarm control: use infrastructure.py to control the swarm at Hetzner Cloud via commands like python3 infrastructure.py up 20 fsn1 where up means bring up swarm, 20 is the desired number of nodes, and fsn1 is the desired datacenter location.
CPU clients: a) worker.py is used by swarm nodes but it can be ran from any CPU only computer with good network link to the internet. It only require one CPU core and 2GB RAM b) cpuclient.ipynb is a Jupyter Notebook that can be ran from Google Colab, Kaggle or any other CPU only powered Jupyter environment. Please check terms for cloud services as they could considder this script as crawler/scraper and potentially cancel the account running it
GPU clients only consume max 3.5GB of GPU VRAM so any GPU card with 4GB VRAM or more is deemed compatible: a) run python3 gpu.py from any Linux based PC with an Nvidia GPU and correct drivers installed b) run gpuclient.ipynb from any jupyter environment with GPU such as Google Colab with GPU accelerator.

If you want to install on your own box, then

Prerequisites

Ubuntu box with 4GB+ Nvidia GPU
Nvidia driver installed
Cuda toolkit 11.0 (also corresponding cudnn is recommended for future)
check driver installation with nvidia-smi command
your user is able to run sudo commands
install python3-pip and git packages

Distributed infrastructure setup and run

Make an account at Hetzner Cloud (https://www.hetzner.com/) and issue an API token
create the .env file and paste your HCLOUD API key in it. optionally, if you have more than one account, paste all API keys each on a separate line
bring up infrastructure at any time with python3 infrastructure.py up N in order to raise N nodes. It will scan all API keys and create maximum available servers on each until N limit is met
tear down infrastructure at any time with python3 infrastructure.py down in order to shutdown things (and save cash). this will shut down all cloud servers that belong to all API tokens saved in the .env file. Be aware, this command will delete all servers in the accounts even if they are NOT related to this project !!!

If you wish to SSH into any droplet you can use this command: ssh -oStrictHostKeyChecking=no -oIdentitiesOnly=yes -i~/.ssh/id_cah crawl@<<droplet_ip>>. The crawling script is ran as a service, check logs with tail -f crawl.log. Access service status or commands with sudo systemctl stop|restart|start crawl

If you are asked for any droplet root password at any time, it means you need to rerun git pull and source conda-setup.sh to refresh the files and regenerate the ssh keys pair.

How to run GPU node from home computer

run git clone https://github.com/rvencu/crawlingathome-gpu-hcloud --branch staged-clients, to download crawlingathome GPU node script
run cd crawlingathome-gpu-hcloud, to enter the newly created directory
run source conda-setup.sh to setup the environment if you use anaconda. otherwise use source pip-setup.sh. the script will ask for a nickame to be used on leaderboard as well as for the sudo password
run python3 gpu.py, to start Distributed Crawling with Central GPU Processing with a swarm of N scrapers! The script will run in a loop that can be interrupted at any time with Ctrl-C. The cloud infrastructure will be automatically shut down after logs from all nodes would have been collected on GPU computer. Change N with any number you like provided it is withing your cloud account limits.

Notebook version

open the notebook from Google Colab or Kaggle by looking it up on Github or using direct url https://raw.githubusercontent.com/rvencu/crawlingathome-gpu-hcloud/staged-clients/gpuclient.ipynb or https://raw.githubusercontent.com/rvencu/crawlingathome-gpu-hcloud/staged-clients/cpuclient.ipynb or clicking the button above
run all the cells and insert proper values into the form (nickname, leave group size as 16 for best results)

deprecated:

scripted notebook run on Kaggle (alpha version)

make a Kaggle account and issue an API Token
from project folder run . kaggle.sh
input nickname, Hetzner API token and number of desired nodes in the swarm when asked for
the script will stop automatically in 9 hours. relaunch it once per day for 3 days per week

TODO

This work is based on code written by:

This is a subproject ran by the community around https://github.com/lucidrains/DALLE-pytorch

Alternative single computer solutions to contribute to the Crawling@Home dataset

this notebook that can run in Google Colab and Kaggle: [![Open In Colab] (https://colab.research.google.com/assets/colab-badge.svg)] (https://colab.research.google.com/github/rvencu/crawlingathome-gpu-hcloud/blob/main/gpucah.ipynb) (https://raw.githubusercontent.com/rvencu/crawlingathome-worker/colab-mod-asks/fastcah.ipynb)
this notebook in Google Colab: [![Open In Colab] (https://colab.research.google.com/assets/colab-badge.svg)] (https://colab.research.google.com/github/ARKseal/crawlingathome-worker/blob/colab-gpu/colab-gpu.ipynb)
this notebook in Google Colab: [![Open In Colab] (https://colab.research.google.com/assets/colab-badge.svg)] (https://colab.research.google.com/drive/1o8MndyY-l9vaox8pb0xfe7VQXUt8Qq0s)
this repo for autonomous script (on home computer or cloud virtual computer): https://github.com/rvencu/crawlingathome-worker/tree/master
this alternate repo for the same: https://github.com/christophschuhmann/crawlingathome-worker

About

GPU controlled Hetzner Cloud workers swarm for Crawling@Home project

MIT License

Languages

Language:Python 55.3%Language:Jupyter Notebook 39.3%Language:Shell 3.4%Language:Dockerfile 2.0%