A scraper to collect data from the websites
Python version 3.7+
- Clone this repository:
git clone https://github.com/Marketionist/py-web-scraper.git
- Create virtual environment and activate it:
python -m venv py-web-scraper/
source py-web-scraper/bin/activate
- Switch to py-web-scraper folder and install all dependencies:
cd py-web-scraper && pip install -r requirements.txt
- Download Playwright browsers:
playwright install
- Create a
data-to-scrape.csv
file with 2 rows and 4 cells in each row: 1 - URL of the page that you want to scrape 2 - selector (CSS or XPath) for the first parameter that you want to scrape 3 - selector (CSS or XPath) for the second parameter that you want to scrape 4 - selector (CSS or XPath) for the third parameter that you want to scrape As it is a .csv file, each value (cell) is separated by comma. For example:
https://www.bbc.com/news/technology,[class*="gel-3/5@xxl"] .qa-status-date-output,[class*="gel-3/5@xxl"] .gs-c-promo-heading__title,[class*="gel-3/5@xxl"] .gs-c-section-link
https://news.ycombinator.com/,.rank,.storylink,.age
Or if you just want to get the first 3 article titles:
https://www.cnn.com/business/tech,(//*[ancestor::*[ul[descendant::*[contains(@data-analytics, "Top stories _list-xs_")]]] and contains(@class, "cd__headline-text")])[1],(//*[ancestor::*[ul[descendant::*[contains(@data-analytics, "Top stories _list-xs_")]]] and contains(@class, "cd__headline-text")])[2],(//*[ancestor::*[ul[descendant::*[contains(@data-analytics, "Top stories _list-xs_")]]] and contains(@class, "cd__headline-text")])[3]
- Create a
data-processed.csv
file with any data that you do not want to be displayed in the scraper output (if it's several items you can put one item/string per line)
To run the script just execute:
python web_scraper.py
Alternatively instead of creating data-to-scrape.csv
you can set a path to
the file with links and selectors by specifying the INCOMING_DATA_SOURCE
environment variabale like this:
INCOMING_DATA_SOURCE=my-file-with-data-to-scrape.csv python web_scraper.py
In addition to that, if you want to see the browser while the script is running,
you can enable it by setting the HEADED
environment variabale to True
like
this:
INCOMING_DATA_SOURCE=my-file-with-data-to-scrape.csv HEADED=True python web_scraper.py
If you want to configure a cron job to run this script 2 times per day, you can set the time like this:
0 10,22 * * *
It will run twice per day: 10:00am and 10:00pm (https://crontab.guru/)
In case if you would like to install any additional dependencies (for example scrapy) - run:
pip install scrapy && pip freeze > requirements.txt
If this script was helpful to you, please give it a ★ Star on GitHub