Scrapy bot, that scrapes urls in a deep and tries to find some file (url) by query
Clone project and go to bot directory
git clone https://github.com/DevIhor/SherlockScraper.git
cd SherlockScraper
Create virtual environment and activate it (you need to activate it before running scrapper)
python3 -m venv .venv
source .venv/bin/activate
Install python requirements
pip install -r requirements.txt
Install pipenv
python3.10 -m pip install -U pipenv
Create virtual environment and install python requirements
python3.10 -m pipenv install --deploy --dev --ignore-pipfile
Activate virtual environment
pipenv shell
Enter virtual environment
source .venv/bin/activate
Run scrapper
python start.py --help
After finish working with bot, you need to deactivate virtual environment
deactivate
--help
- shows all information;--start_point ...
- set start url to scrape;--domain_zone ...
- set domain zone for urls to scrape;--query ...
- set query to search for on all scraped web-pages;--full_search
- enable searching for query also inside.js
files;--links_per_url ...
- set amount of urls to extract, and scrape in deep, per web-page;--scraping_deep_level ...
- set the level of deep to scrape web-pages;--concurrency ...
- set amount of concurrent requests;
Run unlimited scrapper
python start.py --start_point="https://ukr.net" --domain_zone=".net" --query="analytics.js"
Run limited scrapper
python start.py --start_point="https://ukr.net" --domain_zone=".net" --query="analytics.js" --links_per_url=10 --scraping_deep_level=4 --full_search
output.txt
- list of scrapped urls
query_output.txt
- list of web-pages urls that have query string
output.csv
- list of scrapped urls + url deep level
query_output.csv
- list of web-pages urls that have query string + url
with query string
- To start the scraper, you need to have installed at least one web-browser (Chrome, Firefox, MsEdge, Safari).
- If you have installed only Safari browser on your computer, you need to enable Safari driver, just running the following command.
safaridriver --enable