Crawling@Home Server

A server powering Crawling@Home's effort to filter CommonCrawl with CLIP, building a large scale image-text dataset.

UPDATE

jobs/open.json is now too big to store on GitHub. You can download it from here.

git clone https://github.com/TheoCoombes/crawlingathome-server
cd crawlingathome-server
pip install -r requirements.txt

The jobs data is already compiled for Common Crawl. To use, simply run main.py:

python main.py

You can edit the server's host and port by editing config.py.

A server powering Crawling@Home's effort to filter CommonCrawl with CLIP, building a large scale image-text dataset.

MIT License

Language:Python 67.5%Language:HTML 27.1%Language:JavaScript 5.4%