9health / dietgrail-scraping

DietGrail Website Scraping using Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dietgrail-scraping

DietGrail Website Scraping using Python

Prerequisites

This script uses these applications and Python libraries to crawl data from DietGrail GI and GL of Foods website

OS and Applications

  • macOS Monterey (12.2.1)
  • Firefox 99.0
  • pyenv
  • Python 3.9.11
  • Geckodriver

Python Libraries

  • pip
  • Beautiful Soup 4
  • Requests
  • Pandas
  • Selenium 4.1.3
  • PyAutoGui

Installations

  1. Install Python Version Management:

    brew install pyenv
    brew install geckodriver
    
  2. Install Python 3.9.11:

    pyenv install 3.9.11
    pyenv global 3.9.11
    pyenv exec python -V
    
  3. Install Python Package Installer:

    pyenv exec python3 get-pip.py
    
  4. Create .pyenvrc file:

    echo 'eval "$(pyenv init -)"' > ~/.pyenvrc
    
  5. Install Python packages:

    pyenv exec pip install beautifulsoup4
    pyenv exec pip install requests
    pyenv exec pip install selenium
    pyenv exec pip install pyautogui
    pyenv exec pip install pandas
    

List of Files

   scrape_selenium_10.py
   setting_scrape.txt

How to Run

  1. Edit setting_scrape.txt file. Refer to Script Settings section.

  2. Source the pyenv environment file:

    source ~/.pyenvrc
    
  3. Run command:

    pyenv exec python3 scrape_selenium_10.py
    
  4. Other running options:

    MOZ_HEADLESS=1 pyenv exec python3 scrape_selenium_10.py
    
  5. Output files and folders:

  • Webpages will be saved in offline_pages folder.
  • Output .csv file will be saved in csv folder.
  • Chart files will be saved in charts folder.

Script Settings

Parameters Default Value Unit Description
WEBPAGE_TIMEOUT 15 sec First wait time at the first webpage loading
WEBPAGE_LOAD 2.7 sec Wait time between web pages
WEBPAGE_PAUSE 10 page Number of pages to pause
WEBPAGE_PAUSE_TIME 90 sec Pause time between two pages
WEBPAGE_CHART_ON 0 - Enable chart scraping
(0: OFF, 1: ON)
WEBPAGE_OFFLINE_PARSE 0 - Enable offline pages processing only
(0: OFF, 1: ON)
GI_START_PAGE 1 page Start page to scrape
GI_STOP_PAGE 219 page Stop page to scrape
GI_LAST_PAGE 219 page Last page to scrape.
It should be larger than stop page
GI_ROW_NUM 14 row Number of rows in a page
GI_ROW_NUM_LAST 4 row Last number of rows in the last page

Other Notes

  • Sometimes DietGrail GI and GL of Foods website does not response in Firefox Remote, it needs to click manually.
  • After about 50 clicks to download and save 50 pages, the DietGrail GI and GL of Foods website will stop response, it needs to wait about 30 seconds to 1 minute to wait for this website to be okay.

Screenshots

DietGrail GI and GL of Foods first page

First Page

DietGrail GI and GL of Foods last page

Last Page

DietGrail GI and GL of Foods with chart

With Chart

References

About

DietGrail Website Scraping using Python

License:MIT License