py-web-scraper

A scraper to collect data from the websites

Requirements

Python version 3.7+

Installation

Clone this repository:

git clone https://github.com/Marketionist/py-web-scraper.git

Create virtual environment and activate it:

python -m venv py-web-scraper/
source py-web-scraper/bin/activate

Switch to py-web-scraper folder and install all dependencies:

cd py-web-scraper && pip install -r requirements.txt

Download Playwright browsers:

playwright install

Create a data-to-scrape.csv file with 2 rows and 4 cells in each row: 1 - URL of the page that you want to scrape 2 - selector (CSS or XPath) for the first parameter that you want to scrape 3 - selector (CSS or XPath) for the second parameter that you want to scrape 4 - selector (CSS or XPath) for the third parameter that you want to scrape As it is a .csv file, each value (cell) is separated by comma. For example:

https://www.bbc.com/news/technology,[class*="gel-3/5@xxl"] .qa-status-date-output,[class*="gel-3/5@xxl"] .gs-c-promo-heading__title,[class*="gel-3/5@xxl"] .gs-c-section-link
https://news.ycombinator.com/,.rank,.storylink,.age

Or if you just want to get the first 3 article titles:

https://www.cnn.com/business/tech,(//*[ancestor::*[ul[descendant::*[contains(@data-analytics, "Top stories _list-xs_")]]] and contains(@class, "cd__headline-text")])[1],(//*[ancestor::*[ul[descendant::*[contains(@data-analytics, "Top stories _list-xs_")]]] and contains(@class, "cd__headline-text")])[2],(//*[ancestor::*[ul[descendant::*[contains(@data-analytics, "Top stories _list-xs_")]]] and contains(@class, "cd__headline-text")])[3]

Create a data-processed.csv file with any data that you do not want to be displayed in the scraper output (if it's several items you can put one item/string per line)

Running

To run the script just execute:

python web_scraper.py

Bonus

Alternatively instead of creating data-to-scrape.csv you can set a path to the file with links and selectors by specifying the INCOMING_DATA_SOURCE environment variabale like this:

INCOMING_DATA_SOURCE=my-file-with-data-to-scrape.csv python web_scraper.py

In addition to that, if you want to see the browser while the script is running, you can enable it by setting the HEADED environment variabale to True like this:

INCOMING_DATA_SOURCE=my-file-with-data-to-scrape.csv HEADED=True python web_scraper.py

If you want to configure a cron job to run this script 2 times per day, you can set the time like this:

0 10,22 * * *

It will run twice per day: 10:00am and 10:00pm (https://crontab.guru/)

Optional

In case if you would like to install any additional dependencies (for example scrapy) - run:

pip install scrapy && pip freeze > requirements.txt

Thanks

If this script was helpful to you, please give it a ★ Star on GitHub

Marketionist / py-web-scraper