Raw and processed data for dsbbfinddx/FINDCov19TrackerShiny
The main part of the test data is scraped in an automated fashion by combining Python and R based solutions. COVID-19 tests are queried twice per day (early in the morning and late in the evening). Due to the fact that countries change their way of reporting from time to time, manual action is needed on a daily base for some countries.
- Most countries are scraped via Python using Selenium or json libraries, and place in
automated/selenium/
. - Countries which report in PDF (or other non-HTML formats) are queried via R functions and placed in
automated/fetch/
. - Last, country information gathered via manual website visits is added and combined into a single information source listing the number of tests from all different sources (located at
automated/merged/
).
The R package {FindCovTracker}, which powers most of the automated actions run via GitHub Actions, takes this data and writes processed/coronavirus_test.csv
which is then taken as input for the final Shiny App.
The following section explains the workflow in greater detail, including links to all R functions from the {FindCovTracker} R package and how conflicts/errors are handled in the individual stages.
- Selenium python code located in
selenium/
is run, specificallypython3 selenium/run.py
is executed. - The result is uploaded as a JSON file to the
automated/selenium/
directory with a prefix of the respective date. new_tests
are calculated from the different to the previous day.- What happens on error?: Countries which do not return a value after the specified timeout in
selenium/test.py
will be reported asNA
. The country will also be listed in the all-countries-error.csv.
fetch_test_data()
processes countries specified in the respective upstream file with dedicated functions for the given file type (e.g. PDF).- What happens on error?: the functions operate with a "try/catch" approach and return NA in case something does not work. The country will also be listed in the countries-error.csv file of the respective day.
The third step in the CI workflow combines the results from Selenium, fetch functions, and manual updates when they are available in manual/processed/
.
The function get_test_data()
.
writes out a combined data source to automated/merged/
.
In addition, the list with countries which errored (all-countries-error.csv
) is written.
The last step performs some analysis on the previous workflow steps.
In particular, combine_all_tests()
- writes the file which lists all countries that still need manual processing (
$DATE-need-processing.csv
) - writes
coronavirus_tests_new.csv
which lists information from all dates and all countries which have been processed so far.
This step exists twice in the GHA workflow file:
- Job:
run-analysis
runs when automated scraping has happened before and therefore includes aneeds
condition. - Job:
run-analysis-manual
runs only if the commit message containsmanually processed countries
. Also in this scenario the scraping jobs are not triggered.
The reasoning here is that if the .csv
file containing the manually information for countries is uploaded, it should only be merged into the final file.
The automated data scraping should not be triggered again since a new run could potentially lead to new failures for some countries.
These new failing countries would then be missing for the day since they were not processed manually beforehand.