DiGA and Study Data Extraction and Processing

This repository contains scripts for extracting, processing, and crawling data related to DiGA (Digitale Gesundheitsanwendung) and study data. The scripts are:

Installation

Clone the repository:

git clone https://github.com/suriija/digaCrawler.git
cd digaCrawler

Install the required dependencies:

pip install pandas  
pip install sqlite3
pip install selenium 
pip install pandas
pip install sqlite
pip install webdriver-manager

Usage

study_data_extraction_and_transformation.py

This script is used for extracting and transforming study data from Excel files.

Ensure your Excel files are named and formatted correctly.
Modify the excel_file_path and sheet_name variables in the script to match your file and sheet names.

Run the script:

python study_data_extraction_and_transformation.py

diga_data_extraction_and_processing.py

This script is used for extracting and processing DiGA data and importing it into an SQLite database.

Ensure your Excel files are named and formatted correctly.
Modify the excel_file, sheet_name, database_file, and other relevant variables in the script to match your files and database.

Run the script:

python diga_data_extraction_and_processing.py

Crawling-selenium-version2.ipynb

This script is designed to scrape information from a DiGA (Digitale Gesundheitsanwendung) website using Selenium. It extracts data related to various health apps and their details.

Chrome and ChromeDriver Issues

The Chrome and Chromedriver versions have to be compatible for Selenium to operate without errors. You need to find a driver version in latest_release_url, which is available at 'https://github.com/GoogleChromeLabs/chrome-for-testing#json-api-endpoints', that is compatible with your current Chrome version, then specify the latest_release_url and driver_version parameters in:

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager(
    latest_release_url='https://googlechromelabs.github.io/chrome-for-testing/last-known-good-versions-with-downloads.json',
    driver_version='124.0.6367.91').install()), options=options)

Here, the driver version chosen is 124.0.6367.91.

How to Run in Jupyter Notebook

Download the Jupyter Notebook file to your local machine.
Open the notebook using Jupyter Notebook or Jupyter Lab.
Run each cell in the notebook sequentially. You can do this by clicking the "Run" button or using the shortcut Shift + Enter.
The notebook will extract information from the DiGA website and display the results in the output of the respective cells. The extracted data will also be saved as a DataFrame in a CSV file.

suriija / digaCrawler