FINDCov19TrackerData

FINDCov19TrackerData
Test data scraping
- High-Level Workflow Description
- Detailed Workflow Description

Raw and processed data for dsbbfinddx/FINDCov19TrackerShiny

Test data scraping

High-Level Workflow Description

The main part of the test data is scraped in an automated fashion by combining Python and R based solutions. COVID-19 tests are queried twice per day (early in the morning and late in the evening). Due to the fact that countries change their way of reporting from time to time, manual action is needed on a daily base for some countries.

Most countries are scraped via Python using Selenium or json libraries, and place in automated/selenium/.
Countries which report in PDF (or other non-HTML formats) are queried via R functions and placed in automated/fetch/.
Last, country information gathered via manual website visits is added and combined into a single information source listing the number of tests from all different sources (located at automated/merged/).

The R package {FindCovTracker}, which powers most of the automated actions run via GitHub Actions, takes this data and writes processed/coronavirus_test.csv which is then taken as input for the final Shiny App.

Detailed Workflow Description

The following section explains the workflow in greater detail, including links to all R functions from the {FindCovTracker} R package and how conflicts/errors are handled in the individual stages.

1. Test data scraping via Selenium

Selenium python code located in selenium/ is run, specifically python3 selenium/run.py is executed.
The result is uploaded as a JSON file to the automated/selenium/ directory with a prefix of the respective date.
new_tests are calculated from the different to the previous day.
What happens on error?: Countries which do not return a value after the specified timeout in selenium/test.py will be reported as NA. The country will also be listed in the all-countries-error.csv.

2. Test data scraping via "R fetch functions"

fetch_test_data() processes countries specified in the respective upstream file with dedicated functions for the given file type (e.g. PDF).
What happens on error?: the functions operate with a "try/catch" approach and return NA in case something does not work. The country will also be listed in the countries-error.csv file of the respective day.

3. Combination of Selenium, "R fetch functions", and manual updates

The third step in the CI workflow combines the results from Selenium, fetch functions, and manual updates when they are available in manual/processed/. The function get_test_data(). writes out a combined data source to automated/merged/. In addition, the list with countries which errored (all-countries-error.csv) is written.

4. Analysis of Workflow Run

The last step performs some analysis on the previous workflow steps. In particular, combine_all_tests()

writes the file which lists all countries that still need manual processing ($DATE-need-processing.csv)
writes coronavirus_tests_new.csv which lists information from all dates and all countries which have been processed so far.

This step exists twice in the GHA workflow file:

Job: run-analysis runs when automated scraping has happened before and therefore includes a needs condition.
Job: run-analysis-manual runs only if the commit message contains manually processed countries. Also in this scenario the scraping jobs are not triggered.

The reasoning here is that if the .csv file containing the manually information for countries is uploaded, it should only be merged into the final file. The automated data scraping should not be triggered again since a new run could potentially lead to new failures for some countries. These new failing countries would then be missing for the day since they were not processed manually beforehand.

angelicambg / FINDCov19TrackerData