SherlockScraper

Scrapy bot, that scrapes urls in a deep and tries to find some file (url) by query

SetUp

Clone project and go to bot directory

git clone https://github.com/DevIhor/SherlockScraper.git
cd SherlockScraper

Create virtual environment and activate it (you need to activate it before running scrapper)

python3 -m venv .venv
source .venv/bin/activate

Install python requirements

pip install -r requirements.txt

Install pipenv

python3.10 -m pip install -U pipenv

Create virtual environment and install python requirements

python3.10 -m pipenv install --deploy --dev --ignore-pipfile

Activate virtual environment

pipenv shell

Enter virtual environment

source .venv/bin/activate

Run scrapper

python start.py --help

After finish working with bot, you need to deactivate virtual environment

deactivate

--help - shows all information;
--start_point ... - set start url to scrape;
--domain_zone ... - set domain zone for urls to scrape;
--query ... - set query to search for on all scraped web-pages;
--full_search - enable searching for query also inside .js files;
--links_per_url ... - set amount of urls to extract, and scrape in deep, per web-page;
--scraping_deep_level ... - set the level of deep to scrape web-pages;
--concurrency ... - set amount of concurrent requests;

Run unlimited scrapper

python start.py --start_point="https://ukr.net" --domain_zone=".net" --query="analytics.js"

Run limited scrapper

python start.py --start_point="https://ukr.net" --domain_zone=".net" --query="analytics.js" --links_per_url=10 --scraping_deep_level=4 --full_search

output.txt - list of scrapped urls

query_output.txt - list of web-pages urls that have query string

output.csv - list of scrapped urls + url deep level

query_output.csv - list of web-pages urls that have query string + url with query string

To start the scraper, you need to have installed at least one web-browser (Chrome, Firefox, MsEdge, Safari).
If you have installed only Safari browser on your computer, you need to enable Safari driver, just running the following command.

safaridriver --enable