This is an automation tool to fetch articles data from cointelgraph.com. The articles on the homepage of the website are collected and their metadata(including title, author, date, etc.) is saved as output.
- Install brew from here.
- Install
git
andpipenv
using brew.brew install git pipenv
- Clone this repository.
git clone https://github.com/dub-basu/cointelegraph-scraper/
- Install the Python dependencies and create a virtual env using the pipfile provided.
cd cointelegraph-scraper pipenv install
- Downlaod the Firefox driver for selenium from here. Extract the
tar.gz
file and copy the content to any directory in your$PATH
.
- Activate the Python environment using pipenv.
pipenv shell
Download new articles from the website is a two step process.
- Step 1: Enter the following command replacing the date in the given format YYYY-MM-DD.
This will open up a browser instance and keep loading the page till an article is found which was posted before the provided date. All the articles between the current date and the provided date will be download in the form of an
python main.py step1 --date YYYY-MM-DD
HTML
file(sources.html
by default). - Step 2: Enter the following command replacing the date in the given format YYYY-MM-DD.
This will parse the URLs of articles obtained in HTML file from step 1 and fetch their data one-by-one.
python main.py step2 --date YYYY-MM-DD
- Once completed, the results will be saved in the
downloads
directory in the form of CSV files.
- Choose this option if you want to update the data present in a CSV file that was obtained from step 2 above.
A new file will be created in the downloads directory.
python main.py update --filepath <path-to-your-csv-file>