jpellman / gcp-docs-scraper

A simple BeautifulSoup script to scrape docs from the GCP website and then generate epub files using pandoc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Google Cloud Documentation Scraper

Create the conda environment:

conda env create -f environment.yml

Modify the services variable within the gcpDocScraper.py script to include one URI for each GCP service. URIs must include the side navbar.

Run the script:

python gcpDocScraper.py

Under csvs, there will be a number of semicolon-delimited CSV files that can be used to track reading. Under epubs, there will be a number of epub files. Under html, there will be a number of scraped html pages that include only the article content (minus the sidebar and all the other nonsense that Google clutters its pages with).

About

A simple BeautifulSoup script to scrape docs from the GCP website and then generate epub files using pandoc.

License:MIT License


Languages

Language:Python 100.0%