dietgrail-scraping

DietGrail Website Scraping using Python

Prerequisites

This script uses these applications and Python libraries to crawl data from DietGrail GI and GL of Foods website

OS and Applications

macOS Monterey (12.2.1)
Firefox 99.0
pyenv
Python 3.9.11
Geckodriver

Python Libraries

pip
Beautiful Soup 4
Requests
Pandas
Selenium 4.1.3
PyAutoGui

Installations

Install Python Version Management:

brew install pyenv
brew install geckodriver

Install Python 3.9.11:

pyenv install 3.9.11
pyenv global 3.9.11
pyenv exec python -V

Install Python Package Installer:
```
pyenv exec python3 get-pip.py
```

Create .pyenvrc file:

echo 'eval "$(pyenv init -)"' > ~/.pyenvrc

Install Python packages:

pyenv exec pip install beautifulsoup4
pyenv exec pip install requests
pyenv exec pip install selenium
pyenv exec pip install pyautogui
pyenv exec pip install pandas

List of Files

   scrape_selenium_10.py
   setting_scrape.txt

How to Run

Edit setting_scrape.txt file. Refer to Script Settings section.
Source the pyenv environment file:
```
source ~/.pyenvrc
```

Run command:

pyenv exec python3 scrape_selenium_10.py

Other running options:

MOZ_HEADLESS=1 pyenv exec python3 scrape_selenium_10.py

Output files and folders:

Webpages will be saved in offline_pages folder.
Output .csv file will be saved in csv folder.
Chart files will be saved in charts folder.

Script Settings

Parameters	Default Value	Unit	Description
`WEBPAGE_TIMEOUT`	15	sec	First wait time at the first webpage loading
`WEBPAGE_LOAD`	2.7	sec	Wait time between web pages
`WEBPAGE_PAUSE`	10	page	Number of pages to pause
`WEBPAGE_PAUSE_TIME`	90	sec	Pause time between two pages
`WEBPAGE_CHART_ON`	0	-	Enable chart scraping (`0`: OFF, `1`: ON)
`WEBPAGE_OFFLINE_PARSE`	0	-	Enable offline pages processing only (`0`: OFF, `1`: ON)
`GI_START_PAGE`	1	page	Start page to scrape
`GI_STOP_PAGE`	219	page	Stop page to scrape
`GI_LAST_PAGE`	219	page	Last page to scrape. It should be larger than stop page
`GI_ROW_NUM`	14	row	Number of rows in a page
`GI_ROW_NUM_LAST`	4	row	Last number of rows in the last page

Other Notes

Sometimes DietGrail GI and GL of Foods website does not response in Firefox Remote, it needs to click manually.
After about 50 clicks to download and save 50 pages, the DietGrail GI and GL of Foods website will stop response, it needs to wait about 30 seconds to 1 minute to wait for this website to be okay.

Screenshots

DietGrail GI and GL of Foods first page

DietGrail GI and GL of Foods last page

DietGrail GI and GL of Foods with chart

References

About

DietGrail Website Scraping using Python

MIT License